Re: [Qemu-devel] [PATCH v3 3/3] IOMMU: Integrate between VFIO and vIOMMU to support device assignment

From: Alex Williamson <alex.williamson@redhat.com>
To: Peter Xu <peterx@redhat.com>
Cc: "Aviv B.D." <bd.aviv@gmail.com>,
	Jan Kiszka <jan.kiszka@siemens.com>,
	qemu-devel@nongnu.org, "Michael S. Tsirkin" <mst@redhat.com>
Subject: Re: [Qemu-devel] [PATCH v3 3/3] IOMMU: Integrate between VFIO and vIOMMU to support device assignment
Date: Mon, 6 Jun 2016 12:21:34 -0600	[thread overview]
Message-ID: <20160606122134.47e6ef39@ul30vt.home> (raw)
In-Reply-To: <20160606080911.GI21254@pxdev.xzpeter.org>

On Mon, 6 Jun 2016 16:09:11 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Thu, Jun 02, 2016 at 07:34:17AM -0600, Alex Williamson wrote:
> > On Thu, 02 Jun 2016 13:09:27 +0000
> > "Aviv B.D." <bd.aviv@gmail.com> wrote:
> >   
> > > Hi,
> > > 
> > > In case of hot plug vfio device there should not be any active mapping
> > > to this device prior the device addition.  
> > 
> > Counter example - a device is hot added to a guest booted with iommu=pt.  
> 
> I got the same question with Aviv...
> 
> For hot-plug devices

First, let's just remove "hot-plug" from this discussion because I'm
afraid someone is going to say "let's just not support hotplug
devices".  The issue of moving a device to a pre-populated domain can
occur any time a device attaches to a new domain.  It might occur if a
device is added to a vfio driver in the L1 guest where it's already
managing a device.  It might occur any time the device is released from
an L1 vfio user and returned to the static identity map in the L1 guest
kernel.

>, even if it is using iommu=pt, shouldn't it still
> follow the steps that first init vfio device, then configure device
> context entry? Let me list the steps for device addition in case I got
> any mistake:
> 
> 1. user add new VFIO device A
> 
> 2. vfio_listener_region_add() called for device A on the IOMMU mr,
>    here we should create the iommu notifier. However since the context
>    entry still does not exist, memory_region_iommu_replay() will got
>    all invalid IOTLB (IOMMU_NONE entries)
> 
> 3. guest kernel found the device, enabled the device, filled in
>    context entry for device A with "pass-through" (so the SLPTPTR is
>    invalid)
> 
> 4. guest sent context invalidation to QEMU vIOMMU since we have CM=1
>    set for guest vIOMMU
> 
> 5. QEMU vIOMMU handle the invalidation, trigger VFIO notify to do
>    correct VFIO mapping for device A
> 
> Though here step 5 should still be missing (IIUC Aviv's patch 3 still
> not handled context invalidation). Just want to know whether we can
> avoid the replay operation for Intel vIOMMUs (for Intel only, because
> Intel has context invalidation and cache mode support, not sure about
> other platform)?

I'm not sure why you're so eager to avoid implementing a replay
callback for VT-d.  What happens when VT-d is not enabled by the
guest?  vfio/pci.c calls pci_device_iommu_address_space() to get the
address space for the device, which results in vtd_find_add_as() which
gives us a unique VTDAddressSpace.as for that device.  With VT-d not in
use by the guest, when do steps 3-5 occur?  I agree with your reasoning
when VT-d is enabled, but the BIOS/UEFI does not enable VT-d, so are we
going to exclude all use cases of an assigned device prior to the guest
enabling VT-d?

On that same topic, I'm really dubious that we have the flexibility in
our memory API to really support an IOMMU like VT-d and the layout of
having a VTDAddressSpace per device, rather than exposing shared
address spaces, has implications on the efficiency and locked memory
requirements for vfio.  In the above case with VT-d disabled, the
VTDAddressSpace should alias to the system memory AddressSpace and
dynamically switch to a per device address space when VT-d is enabled.
AFAICT, we don't have anything resembling that sort of feature, so we
rely on the IOMMU driver to replay, perhaps even configuring its own
MemoryListener to IOMMU notifier gateway, which is also something that
doesn't currently exist.

Additionally, if there are things unique to VT-d, for instance if VT-d
is enabled and we can rely on the sequence of events you've set forth,
then yes, the replay mechanism should do nothing.  But only the VT-d
code can decide that, which implies that vfio always needs to call the
replay function and a new MemoryRegionIOMMUOps callback needs to decide
what to do given the current state of the vIOMMU.  Thanks,

Alex