On Wed, May 11, 2022 at 03:15:22AM +0000, Tian, Kevin wrote: > > From: Jason Gunthorpe > > Sent: Wednesday, May 11, 2022 3:00 AM > > > > On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote: > > > Ok... here's a revised version of my proposal which I think addresses > > > your concerns and simplfies things. > > > > > > - No new operations, but IOAS_MAP gets some new flags (and IOAS_COPY > > > will probably need matching changes) > > > > > > - By default the IOVA given to IOAS_MAP is a hint only, and the IOVA > > > is chosen by the kernel within the aperture(s). This is closer to > > > how mmap() operates, and DPDK and similar shouldn't care about > > > having specific IOVAs, even at the individual mapping level. > > > > > > - IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s MAP_FIXED, > > > for when you really do want to control the IOVA (qemu, maybe some > > > special userspace driver cases) > > > > We already did both of these, the flag is called > > IOMMU_IOAS_MAP_FIXED_IOVA - if it is not specified then kernel will > > select the IOVA internally. > > > > > - ATTACH will fail if the new device would shrink the aperture to > > > exclude any already established mappings (I assume this is already > > > the case) > > > > Yes > > > > > - IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a > > > PROT_NONE mmap(). It reserves that IOVA space, so other (non-FIXED) > > > MAPs won't use it, but doesn't actually put anything into the IO > > > pagetables. > > > - Like a regular mapping, ATTACHes that are incompatible with an > > > IOMAP_RESERVEed region will fail > > > - An IOMAP_RESERVEed area can be overmapped with an IOMAP_FIXED > > > mapping > > > > Yeah, this seems OK, I'm thinking a new API might make sense because > > you don't really want mmap replacement semantics but a permanent > > record of what IOVA must always be valid. > > > > IOMMU_IOA_REQUIRE_IOVA perhaps, similar signature to > > IOMMUFD_CMD_IOAS_IOVA_RANGES: > > > > struct iommu_ioas_require_iova { > > __u32 size; > > __u32 ioas_id; > > __u32 num_iovas; > > __u32 __reserved; > > struct iommu_required_iovas { > > __aligned_u64 start; > > __aligned_u64 last; > > } required_iovas[]; > > }; > > As a permanent record do we want to enforce that once the required > range list is set all FIXED and non-FIXED allocations must be within the > list of ranges? No, I don't think so. In fact the way I was envisaging this, non-FIXED mappings will *never* go into the reserved ranges. This is for the benefit of any use cases that need both mappings where they don't care about the IOVA and those which do. Essentially, reserving a region here is saying to the kernel "I want to manage this IOVA space; make sure nothing else touches it". That means both that the kernel must disallow any hw associated changes (like ATTACH) which would impinge on the reserved region, and also any IOVA allocations that would take parts away from that space. Whether we want to restrict FIXED mappings to the reserved regions is an interesting question. I wasn't thinking that would be necessary (just as you can use mmap() MAP_FIXED anywhere). However.. much as MAP_FIXED is very dangerous to use if you don't previously reserve address space, I think IOMAP_FIXED is dangerous if you haven't previously reserved space. So maybe it would make sense to only allow FIXED mappings within reserved regions. Strictly dividing the IOVA space into kernel managed and user managed regions does make a certain amount of sense. > If yes we can take the end of the last range as the max size of the iova > address space to optimize the page table layout. > > otherwise we may need another dedicated hint for that optimization. Right. With the revised model where reserving windows is optional, not required, I don't think we can quite re-use this for optimization hints. Which is a bit unfortunate. I can't immediately see a way to tweak this which handles both more neatly, but I like the idea if we can figure out a way. > > > So, for DPDK the sequence would be: > > > > > > 1. Create IOAS > > > 2. ATTACH devices > > > 3. IOAS_MAP some stuff > > > 4. Do DMA with the IOVAs that IOAS_MAP returned > > > > > > (Note, not even any need for QUERY in simple cases) > > > > Yes, this is done already > > > > > For (unoptimized) qemu it would be: > > > > > > 1. Create IOAS > > > 2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of > > the > > > guest platform > > > 3. ATTACH devices (this will fail if they're not compatible with the > > > reserved IOVA regions) > > > 4. Boot the guest > > I suppose above is only the sample flow for PPC vIOMMU. For non-PPC > vIOMMUs regular mappings are required before booting the guest and > reservation might be done but not mandatory (at least not what current > Qemu vfio can afford as it simply replays valid ranges in the CPU address > space). That was a somewhat simplified description. When we look in more detail, I think the ppc and x86 models become more similar. So, in more detail, I think it would look like this: 1. Create base IOAS 2. Map guest memory into base IOAS so that IOVA==GPA 3. Create IOASes for each vIOMMU domain 4. Reserve windows in domain IOASes where the vIOMMU will allow mappings by default 5. ATTACH devices to appropriate IOASes (***) 6. Boot the guest On guest map/invalidate: Use IOAS_COPY to take mappings from base IOAS and put them into the domain IOAS On memory hotplug: IOAS_MAP new memory block into base IOAS On dev hotplug: (***) ATTACH devices to appropriate IOAS On guest reconfiguration of vIOMMU domains (x86 only): DETACH device from base IOAS, attach to vIOMMU domain IOAS On guest reconfiguration of vIOMMU apertures (ppc only): Alter reserved regions to match vIOMMU The difference between ppc and x86 is at the places marked (***): which IOAS each device gets attached to and when. For x86 all devices live in the base IOAS by default, and only get moved to domain IOASes when those domains are set up in the vIOMMU. For POWER each device starts in a domain IOAS based on its guest PE, and never moves. [This is still a bit simplified. In practice, I imagine you'd optimize to only create the domain IOASes at the point they're needed - on boot for ppc, but only when the vIOMMU is configured for x86. I don't think that really changes the model, though.] A few aspects of the model interact quite nicely here. Mapping a large memory guest with IOVA==GPA would probably fail on a ppc host IOMMU. But if both guest and host are ppc, then no devices get attached to that base IOAS, so its apertures don't get restricted by the host hardware. That way we get a common model, and the benefits of GUP sharing via IOAS_COPY, without it failing in the ppc-on-ppc case. x86-on-ppc and ppc-on-x86 will probably only work in limited cases where the various sizes and windows line up, but the possibility isn't precluded by the model or interfaces. > > > (on guest map/invalidate) -> IOAS_MAP(IOMAP_FIXED) to overmap part > > of > > > the reserved regions > > > (on dev hotplug) -> ATTACH (which might fail, if it conflicts with the > > > reserved regions) > > > (on vIOMMU reconfiguration) -> UNMAP/MAP reserved regions as > > > necessary (which might fail) > > > > OK, I will take care of it > > > > Thanks, > > Jason > -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson