On Thu, Apr 28, 2022 at 12:10:37PM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 29, 2022 at 12:53:16AM +1000, David Gibson wrote:
> 
> > 2) Costly GUPs.  pseries (the most common ppc machine type) always
> > expects a (v)IOMMU.  That means that unlike the common x86 model of a
> > host with IOMMU, but guests with no-vIOMMU, guest initiated
> > maps/unmaps can be a hot path.  Accounting in that path can be
> > prohibitive (and on POWER8 in particular it prevented us from
> > optimizing that path the way we wanted).  We had two solutions for
> > that, in v1 the explicit ENABLE/DISABLE calls, which preaccounted
> > based on the IOVA window sizes.  That was improved in the v2 which
> > used the concept of preregistration.  IIUC iommufd can achieve the
> > same effect as preregistration using IOAS_COPY, so this one isn't
> > really a problem either.
> 
> I think PPC and S390 are solving the same problem here. I think S390
> is going to go to a SW nested model where it has an iommu_domain
> controlled by iommufd that is populated with the pinned pages, eg
> stored in an xarray.
> 
> Then the performance map/unmap path is simply copying pages from the
> xarray to the real IOPTEs - and this would be modeled as a nested
> iommu_domain with a SW vIOPTE walker instead of a HW vIOPTE walker.
> 
> Perhaps this is agreeable for PPC too?

Uh.. maybe?  Note that I'm making these comments based on working on
this some years ago (the initial VFIO for ppc implementation in
particular).  I'm no longer actively involved in ppc kernel work.

> > 3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for 2 IOVA
> > windows, which aren't contiguous with each other.  The base addresses
> > of each of these are fixed, but the size of each window, the pagesize
> > (i.e. granularity) of each window and the number of levels in the
> > IOMMU pagetable are runtime configurable.  Because it's true in the
> > hardware, it's also true of the vIOMMU interface defined by the IBM
> > hypervisor (and adpoted by KVM as well).  So, guests can request
> > changes in how these windows are handled.  Typical Linux guests will
> > use the "low" window (IOVA 0..2GiB) dynamically, and the high window
> > (IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
> > can't count on that; the guest can use them however it wants.
> 
> As part of nesting iommufd will have a 'create iommu_domain using
> iommu driver specific data' primitive.
> 
> The driver specific data for PPC can include a description of these
> windows so the PPC specific qemu driver can issue this new ioctl
> using the information provided by the guest.

Hmm.. not sure if that works.  At the moment, qemu (for example) needs
to set up the domains/containers/IOASes as it constructs the machine,
because that's based on the virtual hardware topology.  Initially they
use the default windows (0..2GiB first window, second window
disabled).  Only once the guest kernel is up and running does it issue
the hypercalls to set the final windows as it prefers.  In theory the
guest could change them during runtime though it's unlikely in
practice.  They could change during machine lifetime in practice,
though, if you rebooted from one guest kernel to another that uses a
different configuration.

*Maybe* IOAS construction can be deferred somehow, though I'm not sure
because the assigned devices need to live somewhere.

> The main issue is that internally to the iommu subsystem the
> iommu_domain aperture is assumed to be a single window. This kAPI will
> have to be improved to model the PPC multi-window iommu_domain.

Right.

> If this API is not used then the PPC driver should choose some
> sensible default windows that makes things like DPDK happy.
> 
> > Then, there's handling existing qemu (or other software) that is using
> > the VFIO SPAPR_TCE interfaces.  First, it's not entirely clear if this
> > should be a goal or not: as others have noted, working actively to
> > port qemu to the new interface at the same time as making a
> > comprehensive in-kernel compat layer is arguably redundant work.
> 
> At the moment I think I would stick with not including the SPAPR
> interfaces in vfio_compat, but there does seem to be a path if someone
> with HW wants to build and test them?
> 
> > You might be able to do this by simply failing this outright if
> > there's anything other than exactly one IOMMU group bound to the
> > container / IOAS (which I think might be what VFIO itself does now).
> > Handling that with a device centric API gets somewhat fiddlier, of
> > course.
> 
> Maybe every device gets a copy of the error notification?

Alas, it's harder than that.  One of the things that can happen on an
EEH fault is that the entire PE gets suspended (blocking both DMA and
MMIO, IIRC) until the proper recovery steps are taken.  Since that's
handled at the hardware/firmware level, it will obviously only affect
the host side PE (== host iommu group).  However the interfaces we
have only allow things to be reported to the guest at the granularity
of a guest side PE (== container/IOAS == guest host bridge in
practice).  So to handle this correctly when guest PE != host PE we'd
need to synchronize suspended / recovery state between all the host
PEs in the guest PE.  That *might* be technically possible, but it's
really damn fiddly.

> ie maybe this should be part of vfio_pci and not part of iommufd to
> mirror how AER works?
> 
> It feels strange to put in device error notification to iommufd, is
> that connected the IOMMU?

Only in that operates at the granularity of a PE, which is mostly an
IOMMU concept.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson