On Wed, May 11, 2022 at 03:15:22AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, May 11, 2022 3:00 AM
> > 
> > On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote:
> > > Ok... here's a revised version of my proposal which I think addresses
> > > your concerns and simplfies things.
> > >
> > > - No new operations, but IOAS_MAP gets some new flags (and IOAS_COPY
> > >   will probably need matching changes)
> > >
> > > - By default the IOVA given to IOAS_MAP is a hint only, and the IOVA
> > >   is chosen by the kernel within the aperture(s).  This is closer to
> > >   how mmap() operates, and DPDK and similar shouldn't care about
> > >   having specific IOVAs, even at the individual mapping level.
> > >
> > > - IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s MAP_FIXED,
> > >   for when you really do want to control the IOVA (qemu, maybe some
> > >   special userspace driver cases)
> > 
> > We already did both of these, the flag is called
> > IOMMU_IOAS_MAP_FIXED_IOVA - if it is not specified then kernel will
> > select the IOVA internally.
> > 
> > > - ATTACH will fail if the new device would shrink the aperture to
> > >   exclude any already established mappings (I assume this is already
> > >   the case)
> > 
> > Yes
> > 
> > > - IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a
> > >   PROT_NONE mmap().  It reserves that IOVA space, so other (non-FIXED)
> > >   MAPs won't use it, but doesn't actually put anything into the IO
> > >   pagetables.
> > >     - Like a regular mapping, ATTACHes that are incompatible with an
> > >       IOMAP_RESERVEed region will fail
> > >     - An IOMAP_RESERVEed area can be overmapped with an IOMAP_FIXED
> > >       mapping
> > 
> > Yeah, this seems OK, I'm thinking a new API might make sense because
> > you don't really want mmap replacement semantics but a permanent
> > record of what IOVA must always be valid.
> > 
> > IOMMU_IOA_REQUIRE_IOVA perhaps, similar signature to
> > IOMMUFD_CMD_IOAS_IOVA_RANGES:
> > 
> > struct iommu_ioas_require_iova {
> >         __u32 size;
> >         __u32 ioas_id;
> >         __u32 num_iovas;
> >         __u32 __reserved;
> >         struct iommu_required_iovas {
> >                 __aligned_u64 start;
> >                 __aligned_u64 last;
> >         } required_iovas[];
> > };
> 
> As a permanent record do we want to enforce that once the required
> range list is set all FIXED and non-FIXED allocations must be within the
> list of ranges?

No, I don't think so.  In fact the way I was envisaging this,
non-FIXED mappings will *never* go into the reserved ranges.  This is
for the benefit of any use cases that need both mappings where they
don't care about the IOVA and those which do.

Essentially, reserving a region here is saying to the kernel "I want
to manage this IOVA space; make sure nothing else touches it".  That
means both that the kernel must disallow any hw associated changes
(like ATTACH) which would impinge on the reserved region, and also any
IOVA allocations that would take parts away from that space.

Whether we want to restrict FIXED mappings to the reserved regions is
an interesting question.  I wasn't thinking that would be necessary
(just as you can use mmap() MAP_FIXED anywhere).  However.. much as
MAP_FIXED is very dangerous to use if you don't previously reserve
address space, I think IOMAP_FIXED is dangerous if you haven't
previously reserved space.  So maybe it would make sense to only allow
FIXED mappings within reserved regions.

Strictly dividing the IOVA space into kernel managed and user managed
regions does make a certain amount of sense.

> If yes we can take the end of the last range as the max size of the iova
> address space to optimize the page table layout.
> 
> otherwise we may need another dedicated hint for that optimization.

Right.  With the revised model where reserving windows is optional,
not required, I don't think we can quite re-use this for optimization
hints.  Which is a bit unfortunate.

I can't immediately see a way to tweak this which handles both more
neatly, but I like the idea if we can figure out a way.

> > > So, for DPDK the sequence would be:
> > >
> > > 1. Create IOAS
> > > 2. ATTACH devices
> > > 3. IOAS_MAP some stuff
> > > 4. Do DMA with the IOVAs that IOAS_MAP returned
> > >
> > > (Note, not even any need for QUERY in simple cases)
> > 
> > Yes, this is done already
> > 
> > > For (unoptimized) qemu it would be:
> > >
> > > 1. Create IOAS
> > > 2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of
> > the
> > >    guest platform
> > > 3. ATTACH devices (this will fail if they're not compatible with the
> > >    reserved IOVA regions)
> > > 4. Boot the guest
> 
> I suppose above is only the sample flow for PPC vIOMMU. For non-PPC
> vIOMMUs regular mappings are required before booting the guest and
> reservation might be done but not mandatory (at least not what current
> Qemu vfio can afford as it simply replays valid ranges in the CPU address
> space).

That was a somewhat simplified description.  When we look in more
detail, I think the ppc and x86 models become more similar.  So, in
more detail, I think it would look like this:

1. Create base IOAS
2. Map guest memory into base IOAS so that IOVA==GPA
3. Create IOASes for each vIOMMU domain
4. Reserve windows in domain IOASes where the vIOMMU will allow
   mappings by default
5. ATTACH devices to appropriate IOASes (***)
6. Boot the guest

  On guest map/invalidate:
        Use IOAS_COPY to take mappings from base IOAS and put them
	into the domain IOAS
  On memory hotplug:
        IOAS_MAP new memory block into base IOAS
  On dev hotplug: (***)
        ATTACH devices to appropriate IOAS
  On guest reconfiguration of vIOMMU domains (x86 only):
        DETACH device from base IOAS, attach to vIOMMU domain IOAS
  On guest reconfiguration of vIOMMU apertures (ppc only):
        Alter reserved regions to match vIOMMU

The difference between ppc and x86 is at the places marked (***):
which IOAS each device gets attached to and when. For x86 all devices
live in the base IOAS by default, and only get moved to domain IOASes
when those domains are set up in the vIOMMU.  For POWER each device
starts in a domain IOAS based on its guest PE, and never moves.

[This is still a bit simplified.  In practice, I imagine you'd
 optimize to only create the domain IOASes at the point
 they're needed - on boot for ppc, but only when the vIOMMU is
 configured for x86.  I don't think that really changes the model,
 though.]

A few aspects of the model interact quite nicely here.  Mapping a
large memory guest with IOVA==GPA would probably fail on a ppc host
IOMMU.  But if both guest and host are ppc, then no devices get
attached to that base IOAS, so its apertures don't get restricted by
the host hardware.  That way we get a common model, and the benefits
of GUP sharing via IOAS_COPY, without it failing in the ppc-on-ppc
case.

x86-on-ppc and ppc-on-x86 will probably only work in limited cases
where the various sizes and windows line up, but the possibility isn't
precluded by the model or interfaces.

> > >   (on guest map/invalidate) -> IOAS_MAP(IOMAP_FIXED) to overmap part
> > of
> > >                                the reserved regions
> > >   (on dev hotplug) -> ATTACH (which might fail, if it conflicts with the
> > >                       reserved regions)
> > >   (on vIOMMU reconfiguration) -> UNMAP/MAP reserved regions as
> > >                                  necessary (which might fail)
> > 
> > OK, I will take care of it
> > 
> > Thanks,
> > Jason
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson