linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Tian, Kevin" <kevin.tian@intel.com>
To: Jason Gunthorpe <jgg@nvidia.com>, Thomas Gleixner <tglx@linutronix.de>
Cc: "Williams, Dan J" <dan.j.williams@intel.com>,
	"Raj, Ashok" <ashok.raj@intel.com>,
	"Jiang, Dave" <dave.jiang@intel.com>,
	Bjorn Helgaas <helgaas@kernel.org>,
	"vkoul@kernel.org" <vkoul@kernel.org>,
	"Dey, Megha" <megha.dey@intel.com>,
	"maz@kernel.org" <maz@kernel.org>,
	"bhelgaas@google.com" <bhelgaas@google.com>,
	"alex.williamson@redhat.com" <alex.williamson@redhat.com>,
	"Pan, Jacob jun" <jacob.jun.pan@intel.com>,
	"Liu, Yi L" <yi.l.liu@intel.com>,
	"Lu, Baolu" <baolu.lu@intel.com>,
	"Kumar, Sanjay K" <sanjay.k.kumar@intel.com>,
	"Luck, Tony" <tony.luck@intel.com>,
	"kwankhede@nvidia.com" <kwankhede@nvidia.com>,
	"eric.auger@redhat.com" <eric.auger@redhat.com>,
	"parav@mellanox.com" <parav@mellanox.com>,
	"rafael@kernel.org" <rafael@kernel.org>,
	"netanelg@mellanox.com" <netanelg@mellanox.com>,
	"shahafs@mellanox.com" <shahafs@mellanox.com>,
	"yan.y.zhao@linux.intel.com" <yan.y.zhao@linux.intel.com>,
	"pbonzini@redhat.com" <pbonzini@redhat.com>,
	"Ortiz, Samuel" <samuel.ortiz@intel.com>,
	"Hossain, Mona" <mona.hossain@intel.com>,
	"dmaengine@vger.kernel.org" <dmaengine@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>
Subject: RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
Date: Mon, 9 Nov 2020 07:37:03 +0000	[thread overview]
Message-ID: <MWHPR11MB164578A1CC38EB28F6EA8F918CEA0@MWHPR11MB1645.namprd11.prod.outlook.com> (raw)
In-Reply-To: <20201108232341.GB2620339@nvidia.com>

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Monday, November 9, 2020 7:24 AM
> 
> On Sun, Nov 08, 2020 at 07:47:24PM +0100, Thomas Gleixner wrote:
> >
> > That means the guest needs a way to ask the hypervisor for a proper
> > translation, i.e. a hypercall. Now where to do that? Looking at the
> > above remapping case it's pretty obvious:
> >
> >
> >                      |
> >                      |
> >   [CPU]       -- [VI | RT]  -- [Bridge] --    Bus    -- [Device]
> >                      |
> >   Alloc          "Compose"                   Store         Use
> >
> >   Vectordomain   HCALLdomain                Busdomain
> >                  |        ^
> >                  |        |
> >                  v        |
> >             Hypervisor
> >                Alloc + Compose
> 
> Yes, this will describes what I have been thinking

Agree

> 
> > Now the question which I can't answer is whether this can work correctly
> > in terms of isolation. If the IMS storage is in guest memory (queue
> > storage) then the guest driver can obviously write random crap into it
> > which the device will happily send. (For MSI and IDXD style IMS it
> > still can trap the store).
> 
> There are four cases of interest here:
> 
>  1) Bare metal, PF and VF devices just deliver whatever addr/data pairs
>     to the APIC. IMS works perfectly with
> pci_subdevice_msi_create_irq_domain()
> 
>  2) SRIOV VF assigned to the guest.
> 
>     The guest can cause any MemWr TLP to any addr/data pair
>     and the iommu/platform/vmm is supposed to use the
>     Bus/device/function to isolate & secure the interrupt address
>     range.
> 
>     IMS can work in the guest if the guest knows the details of the
>     address range and can make hypercalls to setup routing. So
>     pci_subdevice_msi_create_irq_domain() works if the hypercalls
>     exist and fails if they don't.
> 
>  3) SIOV sub device assigned to the guest.
> 
>     The difference between SIOV and SRIOV is the device must attach a
>     PASID to every TLP triggered by the guest. Logically we'd expect
>     when IMS is used in this situation the interrupt MemWr is tagged
>     with bus/device/function/PASID to uniquly ID the guest and the same
>     security protection scheme from #2 applies.

Unfortunately no. Intel VT-d only treats MemWr w/o PASID to 0xFEExxxxx
as interrupt request. MemWr w/ PASID, even to 0xFEE, is translated
normally through DMA remapping page table. I don't know other IOMMU
vendors. But at least on Intel platform such device would not get the 
desired effect, since the IOMMU only guarantees interrupt isolation in 
BDF-level.

Does your device already implement such capability? We can bring this 
request back to the hardware team. 

> 
>  4) SIOV sub device assigned to the guest, but with emulation.
> 
>     This SIOV device cannot tag interrupts with PASID so cannot do #2
>     (or the platform cannot recieve a PASID tagged interrupt message).
> 
>     Since the interrupts are being delivered with TLPs pointing at the
>     hypervisor the only solution is for the hypervisor to exclusively
>     control the interrupt table. MSI table like emulation for IMS is
>     needed and the hypervisor will use
> pci_subdevice_msi_create_irq_domain()
>     to get the real interrupts.
> 
>     pci_subdevice_msi_create_irq_domain() needs to return the 'fake'
>     addr/data pairs which are actually an ABI between the guest and
>     hypervisor carried in the hidden hypercall of the emulation.
>     (ie it works like MSI works today)
> 
> IDXD is worring about case #4, I think, but I didn't follow in that
> whole discussion about the IMS table layout if they PASID tag the IMS
> MemWr or not?? Ashok can you clarify?
> 
> > Is the IOMMU/Interrupt remapping unit able to catch such messages which
> > go outside the space to which the guest is allowed to signal to? If yes,
> > problem solved. If no, then IMS storage in guest memory can't ever work.
> 
> Right. Only PASID on the interrupt messages can resolve this securely.
> 
> > So in case that the HCALL domain is missing, the Vector domain needs
> > return an error code on domain creation. If the HCALL domain is there
> > then the domain creation works and in case of actual interrupt
> > allocation the hypercall either returns a valid composed message or an
> > appropriate error code.
> 
> Yes
> 
> > But there's a catch:
> >
> > This only works when the guest OS actually knows that it runs in a
> > VM. If the guest can't figure that out, i.e. via CPUID, this cannot be
> > solved because from the guest OS view that's the same as running on bare
> > metal. Obviously on bare metal the Vector domain can and must handle
> > this.
> 
> Yes
> 
> The flip side is today, the way pci_subdevice_msi_create_irq_domain()
> works a VF using it on baremetal will succeed and if that same VF is
> assigned to a guest then pci_subdevice_msi_create_irq_domain()
> succeeds but the interrupt never comes - so the driver is broken.

Yes, this is the main worry here. While all agree that using hypercall is 
the proper way to virtualize IMS, how to disable it when hypercall is
not available is a more urgent demand at current stage.

> 
> Yes, if we add some ACPI/etc flag that is not going to magically fix
> old kvm's, but at least *something* exists that works right and
> generically.

Agree. We can work together on this definition. 

btw in reality such ACPI extension doesn't exist yet, which likely will
take some time. In the meantime we already have pending usages 
like IDXD. Do you suggest holding these patches until we get ASWG 
to accept the extension, or accept using Intel IMS cap as a vendor
specific mitigation to move forward while the platform flag is being 
worked on? Anyway the IMS cap is already defined and can help fix 
some broken cases.

> 
> If we follow Intel's path then we need special KVM level support for
> *every* device, PCI cap mangling and so on. Forever. Sounds horrible
> to me..
> 
> This feels like one of these things where no matter what we do
> something is broken. Picking the least breakage is the challenge here.
> 
> > So this needs some thought.
> 
> I think your HAOLL diagram is the only sane architecture.
> 
> If go that way then case #4 will still work, in this case the HCALL
> will return addr/data pairs that conform to what the emulation
> expects. Or fail if the VMM can't do emulation for the device.
> 

Thanks
Kevin

  parent reply	other threads:[~2020-11-09  7:37 UTC|newest]

Thread overview: 125+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-30 18:50 [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver Dave Jiang
2020-10-30 18:50 ` [PATCH v4 01/17] irqchip: Add IMS (Interrupt Message Store) driver Dave Jiang
2020-10-30 22:01   ` Thomas Gleixner
2020-10-30 18:51 ` [PATCH v4 02/17] iommu/vt-d: Add DEV-MSI support Dave Jiang
2020-10-30 20:31   ` Thomas Gleixner
2020-10-30 20:52     ` Dave Jiang
2020-10-30 18:51 ` [PATCH v4 03/17] dmaengine: idxd: add theory of operation documentation for idxd mdev Dave Jiang
2020-10-30 18:51 ` [PATCH v4 04/17] dmaengine: idxd: add support for readonly config devices Dave Jiang
2020-10-30 18:51 ` [PATCH v4 05/17] dmaengine: idxd: add interrupt handle request support Dave Jiang
2020-10-30 18:51 ` [PATCH v4 06/17] PCI: add SIOV and IMS capability detection Dave Jiang
2020-10-30 19:51   ` Bjorn Helgaas
2020-10-30 21:20     ` Dave Jiang
2020-10-30 21:50       ` Bjorn Helgaas
2020-10-30 22:45       ` Jason Gunthorpe
2020-10-30 22:49         ` Dave Jiang
2020-11-02 13:21           ` Jason Gunthorpe
2020-11-03  2:49             ` Tian, Kevin
2020-11-03 12:43               ` Jason Gunthorpe
2020-11-04  3:41                 ` Tian, Kevin
2020-11-04 12:40                   ` Jason Gunthorpe
2020-11-04 13:34                     ` Tian, Kevin
2020-11-04 13:54                       ` Jason Gunthorpe
2020-11-06  9:48                         ` Tian, Kevin
2020-11-06 13:14                           ` Jason Gunthorpe
2020-11-06 16:48                             ` Raj, Ashok
2020-11-06 17:51                               ` Jason Gunthorpe
2020-11-06 23:47                                 ` Dan Williams
2020-11-07  0:12                                   ` Jason Gunthorpe
2020-11-07  1:42                                     ` Dan Williams
2020-11-08 18:11                                     ` Raj, Ashok
2020-11-08 18:34                                       ` David Woodhouse
2020-11-08 23:25                                         ` Raj, Ashok
2020-11-10 14:19                                           ` Raj, Ashok
2020-11-10 14:41                                             ` David Woodhouse
2020-11-08 23:41                                       ` Jason Gunthorpe
2020-11-09  0:05                                         ` Raj, Ashok
2020-11-08 18:47                                     ` Thomas Gleixner
2020-11-08 19:36                                       ` David Woodhouse
2020-11-08 22:47                                         ` Thomas Gleixner
2020-11-08 23:29                                           ` Jason Gunthorpe
2020-11-11 15:41                                         ` Christoph Hellwig
2020-11-11 16:09                                           ` Raj, Ashok
2020-11-11 22:27                                             ` Thomas Gleixner
2020-11-11 23:03                                               ` Raj, Ashok
2020-11-12  1:13                                                 ` Thomas Gleixner
2020-11-12 13:10                                                 ` Jason Gunthorpe
2020-11-08 23:23                                       ` Jason Gunthorpe
2020-11-08 23:36                                         ` Raj, Ashok
2020-11-09  7:37                                         ` Tian, Kevin [this message]
2020-11-09 16:46                                           ` Jason Gunthorpe
2020-11-08 23:58                                       ` Raj, Ashok
2020-11-09  7:59                                         ` Tian, Kevin
2020-11-09 11:21                                         ` Thomas Gleixner
2020-11-09 17:30                                           ` Jason Gunthorpe
2020-11-09 22:40                                             ` Raj, Ashok
2020-11-09 22:42                                             ` Thomas Gleixner
2020-11-10  5:14                                               ` Raj, Ashok
2020-11-10 10:27                                                 ` Thomas Gleixner
2020-11-10 14:13                                                   ` Raj, Ashok
2020-11-10 14:23                                                     ` Jason Gunthorpe
2020-11-11  2:17                                                       ` Tian, Kevin
2020-11-12 13:46                                                         ` Jason Gunthorpe
2020-11-11  7:14                                                     ` Tian, Kevin
2020-11-12 19:32                                                       ` Konrad Rzeszutek Wilk
2020-11-12 22:42                                                         ` Thomas Gleixner
2020-11-13  2:42                                                           ` Tian, Kevin
2020-11-13 12:57                                                             ` Jason Gunthorpe
2020-11-13 13:32                                                             ` Thomas Gleixner
2020-11-13 16:12                                                               ` Luck, Tony
2020-11-13 17:38                                                                 ` Raj, Ashok
2020-11-14 10:34                                                           ` Christoph Hellwig
2020-11-14 21:18                                                             ` Raj, Ashok
2020-11-15 11:26                                                               ` Thomas Gleixner
2020-11-15 19:31                                                                 ` Raj, Ashok
2020-11-15 22:11                                                                   ` Thomas Gleixner
2020-11-16  0:22                                                                     ` Raj, Ashok
2020-11-16  7:31                                                                       ` Tian, Kevin
2020-11-16 15:46                                                                         ` Jason Gunthorpe
2020-11-16 17:56                                                                           ` Thomas Gleixner
2020-11-16 18:02                                                                             ` Jason Gunthorpe
2020-11-16 20:37                                                                               ` Thomas Gleixner
2020-11-16 23:51                                                                               ` Tian, Kevin
2020-11-17  9:21                                                                                 ` Thomas Gleixner
2020-11-16  8:25                                                               ` Christoph Hellwig
2020-11-10 14:19                                                 ` Jason Gunthorpe
2020-11-11  2:35                                                   ` Tian, Kevin
2020-11-08 21:18                             ` Thomas Gleixner
2020-11-08 22:09                               ` David Woodhouse
2020-11-08 22:52                                 ` Thomas Gleixner
2020-11-07  0:32                           ` Thomas Gleixner
2020-11-09  5:25                             ` Tian, Kevin
2020-10-30 18:51 ` [PATCH v4 07/17] dmaengine: idxd: add IMS support in base driver Dave Jiang
2020-10-30 18:51 ` [PATCH v4 08/17] dmaengine: idxd: add device support functions in prep for mdev Dave Jiang
2020-10-30 18:51 ` [PATCH v4 09/17] dmaengine: idxd: add basic mdev registration and helper functions Dave Jiang
2020-10-30 18:51 ` [PATCH v4 10/17] dmaengine: idxd: add emulation rw routines Dave Jiang
2020-10-30 18:52 ` [PATCH v4 11/17] dmaengine: idxd: prep for virtual device commands Dave Jiang
2020-10-30 18:52 ` [PATCH v4 12/17] dmaengine: idxd: virtual device commands emulation Dave Jiang
2020-10-30 18:52 ` [PATCH v4 13/17] dmaengine: idxd: ims setup for the vdcm Dave Jiang
2020-10-30 21:26   ` Thomas Gleixner
2020-10-30 18:52 ` [PATCH v4 14/17] dmaengine: idxd: add mdev type as a new wq type Dave Jiang
2020-10-30 18:52 ` [PATCH v4 15/17] dmaengine: idxd: add dedicated wq mdev type Dave Jiang
2020-10-30 18:52 ` [PATCH v4 16/17] dmaengine: idxd: add new wq state for mdev Dave Jiang
2020-10-30 18:52 ` [PATCH v4 17/17] dmaengine: idxd: add error notification from host driver to mediated device Dave Jiang
2020-10-30 18:58 ` [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver Jason Gunthorpe
2020-10-30 19:13   ` Dave Jiang
2020-10-30 19:17     ` Jason Gunthorpe
2020-10-30 19:23       ` Raj, Ashok
2020-10-30 19:30         ` Jason Gunthorpe
2020-10-30 20:43           ` Raj, Ashok
2020-10-30 22:54             ` Jason Gunthorpe
2020-10-31  2:50             ` Thomas Gleixner
2020-10-31 23:53               ` Raj, Ashok
2020-11-02 13:20                 ` Jason Gunthorpe
2020-11-02 16:20                   ` Raj, Ashok
2020-11-02 17:19                     ` Jason Gunthorpe
2020-11-02 18:18                       ` Dave Jiang
2020-11-02 18:26                         ` Jason Gunthorpe
2020-11-02 18:38                           ` Dan Williams
2020-11-02 18:51                             ` Jason Gunthorpe
2020-11-02 19:26                               ` Dan Williams
2020-10-30 20:48 ` Thomas Gleixner
2020-10-30 20:59   ` Dave Jiang
2020-10-30 22:10     ` Thomas Gleixner
     [not found] <draft-875z6ekcj5.fsf@nanos.tec.linutronix.de>
2020-11-09 14:08 ` [PATCH v4 06/17] PCI: add SIOV and IMS capability detection Thomas Gleixner
2020-11-09 18:10   ` Raj, Ashok

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=MWHPR11MB164578A1CC38EB28F6EA8F918CEA0@MWHPR11MB1645.namprd11.prod.outlook.com \
    --to=kevin.tian@intel.com \
    --cc=alex.williamson@redhat.com \
    --cc=ashok.raj@intel.com \
    --cc=baolu.lu@intel.com \
    --cc=bhelgaas@google.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dmaengine@vger.kernel.org \
    --cc=eric.auger@redhat.com \
    --cc=helgaas@kernel.org \
    --cc=jacob.jun.pan@intel.com \
    --cc=jgg@nvidia.com \
    --cc=kvm@vger.kernel.org \
    --cc=kwankhede@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=maz@kernel.org \
    --cc=megha.dey@intel.com \
    --cc=mona.hossain@intel.com \
    --cc=netanelg@mellanox.com \
    --cc=parav@mellanox.com \
    --cc=pbonzini@redhat.com \
    --cc=rafael@kernel.org \
    --cc=samuel.ortiz@intel.com \
    --cc=sanjay.k.kumar@intel.com \
    --cc=shahafs@mellanox.com \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    --cc=vkoul@kernel.org \
    --cc=yan.y.zhao@linux.intel.com \
    --cc=yi.l.liu@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).