linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@linutronix.de>
To: Jason Gunthorpe <jgg@nvidia.com>,
	Dan Williams <dan.j.williams@intel.com>
Cc: "Raj\, Ashok" <ashok.raj@intel.com>, "Tian\,
	Kevin" <kevin.tian@intel.com>, "Jiang\,
	Dave" <dave.jiang@intel.com>, Bjorn Helgaas <helgaas@kernel.org>,
	"vkoul\@kernel.org" <vkoul@kernel.org>, "Dey\,
	Megha" <megha.dey@intel.com>, "maz\@kernel.org" <maz@kernel.org>,
	"bhelgaas\@google.com" <bhelgaas@google.com>,
	"alex.williamson\@redhat.com" <alex.williamson@redhat.com>, "Pan\,
	Jacob jun" <jacob.jun.pan@intel.com>, "Liu\,
	Yi L" <yi.l.liu@intel.com>, "Lu\, Baolu" <baolu.lu@intel.com>,
	"Kumar\, Sanjay K" <sanjay.k.kumar@intel.com>, "Luck\,
	Tony" <tony.luck@intel.com>,
	"jing.lin\@intel.com" <jing.lin@intel.com>,
	"kwankhede\@nvidia.com" <kwankhede@nvidia.com>,
	"eric.auger\@redhat.com" <eric.auger@redhat.com>,
	"parav\@mellanox.com" <parav@mellanox.com>,
	"rafael\@kernel.org" <rafael@kernel.org>,
	"netanelg\@mellanox.com" <netanelg@mellanox.com>,
	"shahafs\@mellanox.com" <shahafs@mellanox.com>,
	"yan.y.zhao\@linux.intel.com" <yan.y.zhao@linux.intel.com>,
	"pbonzini\@redhat.com" <pbonzini@redhat.com>, "Ortiz\,
	Samuel" <samuel.ortiz@intel.com>, "Hossain\,
	Mona" <mona.hossain@intel.com>,
	"dmaengine\@vger.kernel.org" <dmaengine@vger.kernel.org>,
	"linux-kernel\@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-pci\@vger.kernel.org" <linux-pci@vger.kernel.org>,
	"kvm\@vger.kernel.org" <kvm@vger.kernel.org>
Subject: Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
Date: Sun, 08 Nov 2020 19:47:24 +0100	[thread overview]
Message-ID: <87pn4nk7nn.fsf@nanos.tec.linutronix.de> (raw)
In-Reply-To: <20201107001207.GA2620339@nvidia.com>

On Fri, Nov 06 2020 at 20:12, Jason Gunthorpe wrote:
> All IMS device drivers will work correctly. No VMM device emulation is
> ever needed to translate addr/data pairs.
>
> Earlier in this thread Kevin said hyper-v is already working this way,
> even for MSI/MSI-X. To me this says it is fundamentally a KVM platform
> problem and it should not be solved by PCI capability flags.

I mostly agree but want to add a few clarifications about the
terminology and the boundaries because I think there is where lot of the
confusion comes from.

Let me go back to the basic structure both at the hardware and at the
software level.

The basic structure is:

  [CPU] -- [Bridge] -- Bus -- [Device]

This applies to all kind of buses where the bridge directly translates
into the CPUs address space. Now let's look at the boundaries:

                |
                |
  [CPU] -- [Bri | dge] -- Bus -- [Device]
                |   
                |

The boundary is in the middle of the bridge because the CPU side of the
bridge is obviously CPU and therefore architecture specific. The Bus
side of the bridge is architecture agnostic.

Now let's add an IOMMU:

  [CPU] -- [IOMMU] -- [Bridge] -- Bus -- [Device]

and in theory the boundary moves now to:

               |
               |
  [CPU] -- [IO | MMU] -- [Bridge] -- Bus -- [Device]
               |
               |

because with an IOMMU the bridge could become CPU and architecture
agnostic. In reality this is not the case as the bridge is still the
same thing.

Now let's look at MSI. As established above, the Bus and the Device are
CPU and architecture agnostic and the Device merily uses a composed
message which is stored at some place accessible to the device to send
that message when it raises an interrupt. So where is this message
composed?

The basic case:

                   |
                   |
  [CPU]    -- [Bri | dge] -- Bus -- [Device]
                   |
  Alloc +           
  Compose                   Store     Use

The Bridge is irrelevant here as it just is involved in the
transport. Nevertheless the Bridge is only transport in the view of the
interrupt subsystem.

The IOMMU case:

               |
               |
  [CPU] -- [IO | MMU] -- [Bridge] -- Bus -- [Device]
               |
            Alloc +
  Alloc     Compose                 Store     Use


That's exactly reflected in hierarchical irq domains:

                       |
                       |
  [CPU]        -- [Bri | dge] --    Bus    -- [Device]
                       |   
  Alloc +           
  Compose                         Store        Use

  Vectordomain                   Busdomain

and:

                     |
                     |
  [CPU]       -- [IO | MMU]  -- [Bridge] --    Bus    -- [Device]
                     |
                  Alloc +   
  Alloc           Compose                    Store       Use

  Vectordomain   Remapdomain                Busdomain


Now if we look at the virtualization scenario and device hand through
then the structure in the guest view is not any different from the basic
case. This works with PCI-MSI[X] and the IDXD IMS variant because the
hypervisor can trap the access to the storage and translate the message:

                   |
                   |
  [CPU]    -- [Bri | dge] -- Bus -- [Device]
                   |
  Alloc +
  Compose                   Store     Use
                             |
                             | Trap
                             v
                             Hypervisor translates and stores

But obviously with an IMS storage location which is software controlled
by the guest side driver (the case Jason is interested in) the above
cannot work for obvious reasons.

That means the guest needs a way to ask the hypervisor for a proper
translation, i.e. a hypercall. Now where to do that? Looking at the
above remapping case it's pretty obvious:


                     |
                     |
  [CPU]       -- [VI | RT]  -- [Bridge] --    Bus    -- [Device]
                     |
  Alloc          "Compose"                   Store         Use

  Vectordomain   HCALLdomain                Busdomain
                 |        ^
                 |        |
                 v        | 
            Hypervisor    
               Alloc + Compose

Why? Because it reflects the boundaries and leaves the busdomain part
agnostic as it should be. And it works for _all_ variants of Busdomains.

Now the question which I can't answer is whether this can work correctly
in terms of isolation. If the IMS storage is in guest memory (queue
storage) then the guest driver can obviously write random crap into it
which the device will happily send. (For MSI and IDXD style IMS it
still can trap the store).

Is the IOMMU/Interrupt remapping unit able to catch such messages which
go outside the space to which the guest is allowed to signal to? If yes,
problem solved. If no, then IMS storage in guest memory can't ever work.

Coming back to this:

> In the end pci_subdevice_msi_create_irq_domain() is a platform
> function. Either it should work completely on every device with no
> device-specific emulation required in the VMM, or it should not work
> at all and return -EOPNOTSUPP.

The subdevice domain is a 'Busdomain' according to the structure
above. It does not and should never have any clue about the underlying
system. It's in the agnostic part and always works. It simply does not
care what's underneath. So it won't return -EOPNOTSUPP.

What it has to do is to transport the IMS in queue memory requirement to
the underlying parent domain.

So in case that the HCALL domain is missing, the Vector domain needs
return an error code on domain creation. If the HCALL domain is there
then the domain creation works and in case of actual interrupt
allocation the hypercall either returns a valid composed message or an
appropriate error code.

But there's a catch:

This only works when the guest OS actually knows that it runs in a
VM. If the guest can't figure that out, i.e. via CPUID, this cannot be
solved because from the guest OS view that's the same as running on bare
metal. Obviously on bare metal the Vector domain can and must handle
this.

So this needs some thought.

Thanks,

        tglx

  parent reply	other threads:[~2020-11-08 18:47 UTC|newest]

Thread overview: 125+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-30 18:50 [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver Dave Jiang
2020-10-30 18:50 ` [PATCH v4 01/17] irqchip: Add IMS (Interrupt Message Store) driver Dave Jiang
2020-10-30 22:01   ` Thomas Gleixner
2020-10-30 18:51 ` [PATCH v4 02/17] iommu/vt-d: Add DEV-MSI support Dave Jiang
2020-10-30 20:31   ` Thomas Gleixner
2020-10-30 20:52     ` Dave Jiang
2020-10-30 18:51 ` [PATCH v4 03/17] dmaengine: idxd: add theory of operation documentation for idxd mdev Dave Jiang
2020-10-30 18:51 ` [PATCH v4 04/17] dmaengine: idxd: add support for readonly config devices Dave Jiang
2020-10-30 18:51 ` [PATCH v4 05/17] dmaengine: idxd: add interrupt handle request support Dave Jiang
2020-10-30 18:51 ` [PATCH v4 06/17] PCI: add SIOV and IMS capability detection Dave Jiang
2020-10-30 19:51   ` Bjorn Helgaas
2020-10-30 21:20     ` Dave Jiang
2020-10-30 21:50       ` Bjorn Helgaas
2020-10-30 22:45       ` Jason Gunthorpe
2020-10-30 22:49         ` Dave Jiang
2020-11-02 13:21           ` Jason Gunthorpe
2020-11-03  2:49             ` Tian, Kevin
2020-11-03 12:43               ` Jason Gunthorpe
2020-11-04  3:41                 ` Tian, Kevin
2020-11-04 12:40                   ` Jason Gunthorpe
2020-11-04 13:34                     ` Tian, Kevin
2020-11-04 13:54                       ` Jason Gunthorpe
2020-11-06  9:48                         ` Tian, Kevin
2020-11-06 13:14                           ` Jason Gunthorpe
2020-11-06 16:48                             ` Raj, Ashok
2020-11-06 17:51                               ` Jason Gunthorpe
2020-11-06 23:47                                 ` Dan Williams
2020-11-07  0:12                                   ` Jason Gunthorpe
2020-11-07  1:42                                     ` Dan Williams
2020-11-08 18:11                                     ` Raj, Ashok
2020-11-08 18:34                                       ` David Woodhouse
2020-11-08 23:25                                         ` Raj, Ashok
2020-11-10 14:19                                           ` Raj, Ashok
2020-11-10 14:41                                             ` David Woodhouse
2020-11-08 23:41                                       ` Jason Gunthorpe
2020-11-09  0:05                                         ` Raj, Ashok
2020-11-08 18:47                                     ` Thomas Gleixner [this message]
2020-11-08 19:36                                       ` David Woodhouse
2020-11-08 22:47                                         ` Thomas Gleixner
2020-11-08 23:29                                           ` Jason Gunthorpe
2020-11-11 15:41                                         ` Christoph Hellwig
2020-11-11 16:09                                           ` Raj, Ashok
2020-11-11 22:27                                             ` Thomas Gleixner
2020-11-11 23:03                                               ` Raj, Ashok
2020-11-12  1:13                                                 ` Thomas Gleixner
2020-11-12 13:10                                                 ` Jason Gunthorpe
2020-11-08 23:23                                       ` Jason Gunthorpe
2020-11-08 23:36                                         ` Raj, Ashok
2020-11-09  7:37                                         ` Tian, Kevin
2020-11-09 16:46                                           ` Jason Gunthorpe
2020-11-08 23:58                                       ` Raj, Ashok
2020-11-09  7:59                                         ` Tian, Kevin
2020-11-09 11:21                                         ` Thomas Gleixner
2020-11-09 17:30                                           ` Jason Gunthorpe
2020-11-09 22:40                                             ` Raj, Ashok
2020-11-09 22:42                                             ` Thomas Gleixner
2020-11-10  5:14                                               ` Raj, Ashok
2020-11-10 10:27                                                 ` Thomas Gleixner
2020-11-10 14:13                                                   ` Raj, Ashok
2020-11-10 14:23                                                     ` Jason Gunthorpe
2020-11-11  2:17                                                       ` Tian, Kevin
2020-11-12 13:46                                                         ` Jason Gunthorpe
2020-11-11  7:14                                                     ` Tian, Kevin
2020-11-12 19:32                                                       ` Konrad Rzeszutek Wilk
2020-11-12 22:42                                                         ` Thomas Gleixner
2020-11-13  2:42                                                           ` Tian, Kevin
2020-11-13 12:57                                                             ` Jason Gunthorpe
2020-11-13 13:32                                                             ` Thomas Gleixner
2020-11-13 16:12                                                               ` Luck, Tony
2020-11-13 17:38                                                                 ` Raj, Ashok
2020-11-14 10:34                                                           ` Christoph Hellwig
2020-11-14 21:18                                                             ` Raj, Ashok
2020-11-15 11:26                                                               ` Thomas Gleixner
2020-11-15 19:31                                                                 ` Raj, Ashok
2020-11-15 22:11                                                                   ` Thomas Gleixner
2020-11-16  0:22                                                                     ` Raj, Ashok
2020-11-16  7:31                                                                       ` Tian, Kevin
2020-11-16 15:46                                                                         ` Jason Gunthorpe
2020-11-16 17:56                                                                           ` Thomas Gleixner
2020-11-16 18:02                                                                             ` Jason Gunthorpe
2020-11-16 20:37                                                                               ` Thomas Gleixner
2020-11-16 23:51                                                                               ` Tian, Kevin
2020-11-17  9:21                                                                                 ` Thomas Gleixner
2020-11-16  8:25                                                               ` Christoph Hellwig
2020-11-10 14:19                                                 ` Jason Gunthorpe
2020-11-11  2:35                                                   ` Tian, Kevin
2020-11-08 21:18                             ` Thomas Gleixner
2020-11-08 22:09                               ` David Woodhouse
2020-11-08 22:52                                 ` Thomas Gleixner
2020-11-07  0:32                           ` Thomas Gleixner
2020-11-09  5:25                             ` Tian, Kevin
2020-10-30 18:51 ` [PATCH v4 07/17] dmaengine: idxd: add IMS support in base driver Dave Jiang
2020-10-30 18:51 ` [PATCH v4 08/17] dmaengine: idxd: add device support functions in prep for mdev Dave Jiang
2020-10-30 18:51 ` [PATCH v4 09/17] dmaengine: idxd: add basic mdev registration and helper functions Dave Jiang
2020-10-30 18:51 ` [PATCH v4 10/17] dmaengine: idxd: add emulation rw routines Dave Jiang
2020-10-30 18:52 ` [PATCH v4 11/17] dmaengine: idxd: prep for virtual device commands Dave Jiang
2020-10-30 18:52 ` [PATCH v4 12/17] dmaengine: idxd: virtual device commands emulation Dave Jiang
2020-10-30 18:52 ` [PATCH v4 13/17] dmaengine: idxd: ims setup for the vdcm Dave Jiang
2020-10-30 21:26   ` Thomas Gleixner
2020-10-30 18:52 ` [PATCH v4 14/17] dmaengine: idxd: add mdev type as a new wq type Dave Jiang
2020-10-30 18:52 ` [PATCH v4 15/17] dmaengine: idxd: add dedicated wq mdev type Dave Jiang
2020-10-30 18:52 ` [PATCH v4 16/17] dmaengine: idxd: add new wq state for mdev Dave Jiang
2020-10-30 18:52 ` [PATCH v4 17/17] dmaengine: idxd: add error notification from host driver to mediated device Dave Jiang
2020-10-30 18:58 ` [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver Jason Gunthorpe
2020-10-30 19:13   ` Dave Jiang
2020-10-30 19:17     ` Jason Gunthorpe
2020-10-30 19:23       ` Raj, Ashok
2020-10-30 19:30         ` Jason Gunthorpe
2020-10-30 20:43           ` Raj, Ashok
2020-10-30 22:54             ` Jason Gunthorpe
2020-10-31  2:50             ` Thomas Gleixner
2020-10-31 23:53               ` Raj, Ashok
2020-11-02 13:20                 ` Jason Gunthorpe
2020-11-02 16:20                   ` Raj, Ashok
2020-11-02 17:19                     ` Jason Gunthorpe
2020-11-02 18:18                       ` Dave Jiang
2020-11-02 18:26                         ` Jason Gunthorpe
2020-11-02 18:38                           ` Dan Williams
2020-11-02 18:51                             ` Jason Gunthorpe
2020-11-02 19:26                               ` Dan Williams
2020-10-30 20:48 ` Thomas Gleixner
2020-10-30 20:59   ` Dave Jiang
2020-10-30 22:10     ` Thomas Gleixner
     [not found] <draft-875z6ekcj5.fsf@nanos.tec.linutronix.de>
2020-11-09 14:08 ` [PATCH v4 06/17] PCI: add SIOV and IMS capability detection Thomas Gleixner
2020-11-09 18:10   ` Raj, Ashok

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87pn4nk7nn.fsf@nanos.tec.linutronix.de \
    --to=tglx@linutronix.de \
    --cc=alex.williamson@redhat.com \
    --cc=ashok.raj@intel.com \
    --cc=baolu.lu@intel.com \
    --cc=bhelgaas@google.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dmaengine@vger.kernel.org \
    --cc=eric.auger@redhat.com \
    --cc=helgaas@kernel.org \
    --cc=jacob.jun.pan@intel.com \
    --cc=jgg@nvidia.com \
    --cc=jing.lin@intel.com \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=kwankhede@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=maz@kernel.org \
    --cc=megha.dey@intel.com \
    --cc=mona.hossain@intel.com \
    --cc=netanelg@mellanox.com \
    --cc=parav@mellanox.com \
    --cc=pbonzini@redhat.com \
    --cc=rafael@kernel.org \
    --cc=samuel.ortiz@intel.com \
    --cc=sanjay.k.kumar@intel.com \
    --cc=shahafs@mellanox.com \
    --cc=tony.luck@intel.com \
    --cc=vkoul@kernel.org \
    --cc=yan.y.zhao@linux.intel.com \
    --cc=yi.l.liu@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).