On Mon, Dec 19, 2016 at 09:52:52PM -0700, Alex Williamson wrote:
> On Tue, 20 Dec 2016 11:44:41 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Mon, Dec 19, 2016 at 09:56:50AM -0700, Alex Williamson wrote:
> > > On Mon, 19 Dec 2016 22:41:26 +0800
> > > Peter Xu <peterx@redhat.com> wrote:
> > >   
> > > > This is preparation work to finally enabled dynamic switching ON/OFF for
> > > > VT-d protection. The old VT-d codes is using static IOMMU region, and
> > > > that won't satisfy vfio-pci device listeners.
> > > > 
> > > > Let me explain.
> > > > 
> > > > vfio-pci devices depend on the memory region listener and IOMMU replay
> > > > mechanism to make sure the device mapping is coherent with the guest
> > > > even if there are domain switches. And there are two kinds of domain
> > > > switches:
> > > > 
> > > >   (1) switch from domain A -> B
> > > >   (2) switch from domain A -> no domain (e.g., turn DMAR off)
> > > > 
> > > > Case (1) is handled by the context entry invalidation handling by the
> > > > VT-d replay logic. What the replay function should do here is to replay
> > > > the existing page mappings in domain B.
> > > > 
> > > > However for case (2), we don't want to replay any domain mappings - we
> > > > just need the default GPA->HPA mappings (the address_space_memory
> > > > mapping). And this patch helps on case (2) to build up the mapping
> > > > automatically by leveraging the vfio-pci memory listeners.
> > > > 
> > > > Another important thing that this patch does is to seperate
> > > > IR (Interrupt Remapping) from DMAR (DMA Remapping). IR region should not
> > > > depend on the DMAR region (like before this patch). It should be a
> > > > standalone region, and it should be able to be activated without
> > > > DMAR (which is a common behavior of Linux kernel - by default it enables
> > > > IR while disabled DMAR).  
> > > 
> > > 
> > > This seems like an improvement, but I will note that there are existing
> > > locked memory accounting issues inherent with VT-d and vfio.  With
> > > VT-d, each device has a unique AddressSpace.  This requires that each
> > > is managed via a separate vfio container.  Each container is accounted
> > > for separately for locked pages.  libvirt currently only knows that if
> > > any vfio devices are attached that the locked memory limit for the
> > > process needs to be set sufficient for the VM memory.  When VT-d is
> > > involved, we either need to figure out how to associate otherwise
> > > independent vfio containers to share locked page accounting or teach
> > > libvirt that the locked memory requirement needs to be multiplied by
> > > the number of attached vfio devices.  The latter seems far less
> > > complicated but reduces the containment of QEMU a bit since the
> > > process has the ability to lock potentially many multiples of the VM
> > > address size.  Thanks,  
> > 
> > Yes, this patch just tried to move VT-d forward a bit, rather than do
> > it once and for all. I think we can do better than this in the future,
> > for example, one address space per guest IOMMU domain (as you have
> > mentioned before). However I suppose that will need more work (which I
> > still can't estimate on the amount of work). So I am considering to
> > enable the device assignments functionally first, then we can further
> > improve based on a workable version. Same thoughts apply to the IOMMU
> > replay RFC series.
> 
> I'm not arguing against it, I'm just trying to set expectations for
> where this gets us.  An AddressSpace per guest iommu domain seems like
> the right model for QEMU, but it has some fundamental issues with
> vfio.  We currently tie a QEMU AddressSpace to a vfio container, which
> represents the host IOMMU context.  The AddressSpace of a device is
> currently assumed to be fixed in QEMU,

Actually, I think we can work around this: you could set up a separate
AddressSpace for each device which consists of nothing but a big alias
into an AddressSpace associated with the current IOMMU domain.  As the
device is moved between domains you remove/replace the alias region -
or even replace it with an alias direct into system memory when the
IOMMU is disabled.

> est IOMMU domains clearly
> are not.  vfio only let's us have access to a device while it's
> protected within a container.  Therefore in order to move a device to a
> different AddressSpace based on the guest domain configuration, we'd
> need to tear down the vfio configuration, including releasing the
> device.
>  
> > Regarding to the locked memory accounting issue: do we have existing
> > way to do the accounting? If so, would you (or anyone) please
> > elaborate a bit? If not, is that an ongoing/planned work?
> 
> As I describe above, there's a vfio container per AddressSpace, each
> container is an IOMMU domain in the host.  In the guest, an IOMMU
> domain can include multiple AddressSpaces, one for each context entry
> that's part of the domain.  When the guest programs a translation for
> an IOMMU domain, that maps a guest IOVA to a guest physical address,
> for each AddressSpace.  Each AddressSpace is backed by a vfio
> container, which needs to pin the pages of that translation in order to
> get a host physical address, which then gets programmed into the host
> IOMMU domain with the guest-IOVA and host physical address.  The
> pinning process is where page accounting is done.

Ah.. and I take it the accounting isn't smart enough to tell that the
same page is already pinned elsewhere.  I guess that would take rather
a lot of extra bookkeeping.

> It's done per vfio
> context.  The worst case scenario for accounting is thus when VT-d is
> present but disabled (or in passthrough mode) as each AddressSpace
> duplicates address_space_memory and every page of guest memory is
> pinned and accounted for each vfio container.

Hmm.  I imagine you'll need a copy of the current translation tables
for a guest domain regardless of VFIO involvement.  So, when a domain
is unused - i.e. has no devices in it, won't the container have all
the groups detached and so give up all the memory.  Obviously when a
device is assigned to the domain you'll need to replay the current
mappings into VFIO.

> That's the existing way we do accounting.  There is no current
> development that I'm aware of to change this.  As above, the simplest
> stop-gap solution is that libvirt would need to be aware when VT-d is
> present for a VM and use a different algorithm to set QEMU locked
> memory limit, but it's not without its downsides.  Alternatively, a new
> IOMMU model would need to be developed for vfio.  The type1 model was
> only ever intended to be used for relatively static user mappings and I
> expect it to have horrendous performance when backing a dynamic guest
> IOMMU domain.  Really the only guest IOMMU usage model that makes any
> sort of sense with type1 is to run the guest with passthrough (iommu=pt)
> and only pull devices out of passthrough for relatively static mapping
> cases within the guest userspace (nested assigned devices or dpdk).  If
> the expectation is that we just need this one little bit more code to
> make vfio usable in the guest, that may be true, but it really is just
> barely usable.  It's not going to be fast for any sort of dynamic
> mapping and it's going to have accounting issues that are not
> compatible with how libvirt sets locked memory limits for QEMU as soon
> as you go beyond a single device.  Thanks,

Maybe we should revisit the idea of a "type2" IOMMU which could handle
both guest VT-d and guest PAPR TCEs.  I'm not excessively fond of the
pre-registration model that PAPR uses at the moment, but it might be
the best available way to deal with the accounting issue.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson