On Thu, Jun 03, 2021 at 07:17:23AM +0000, Tian, Kevin wrote: > > From: David Gibson > > Sent: Wednesday, June 2, 2021 2:15 PM > > > [...] > > > An I/O address space takes effect in the IOMMU only after it is attached > > > to a device. The device in the /dev/ioasid context always refers to a > > > physical one or 'pdev' (PF or VF). > > > > What you mean by "physical" device here isn't really clear - VFs > > aren't really physical devices, and the PF/VF terminology also doesn't > > extent to non-PCI devices (which I think we want to consider for the > > API, even if we're not implemenenting it any time soon). > > Yes, it's not very clear, and more in PCI context to simplify the > description. A "physical" one here means an PCI endpoint function > which has a unique RID. It's more to differentiate with later mdev/ > subdevice which uses both RID+PASID. Naming is always a hard > exercise to me... Possibly I'll just use device vs. subdevice in future > versions. > > > > > Now, it's clear that we can't program things into the IOMMU before > > attaching a device - we might not even know which IOMMU to use. > > yes > > > However, I'm not sure if its wise to automatically make the AS "real" > > as soon as we attach a device: > > > > * If we're going to attach a whole bunch of devices, could we (for at > > least some IOMMU models) end up doing a lot of work which then has > > to be re-done for each extra device we attach? > > which extra work did you specifically refer to? each attach just implies > writing the base address of the I/O page table to the IOMMU structure > corresponding to this device (either being a per-device entry, or per > device+PASID entry). > > and generally device attach should not be in a hot path. > > > > > * With kernel managed IO page tables could attaching a second device > > (at least on some IOMMU models) require some operation which would > > require discarding those tables? e.g. if the second device somehow > > forces a different IO page size > > Then the attach should fail and the user should create another IOASID > for the second device. Couldn't this make things weirdly order dependent though? If device A has strictly more capabilities than device B, then attaching A then B will be fine, but B then A will trigger a new ioasid fd. > > For that reason I wonder if we want some sort of explicit enable or > > activate call. Device attaches would only be valid before, map or > > attach pagetable calls would only be valid after. > > I'm interested in learning a real example requiring explicit enable... > > > > > > One I/O address space could be attached to multiple devices. In this case, > > > /dev/ioasid uAPI applies to all attached devices under the specified IOASID. > > > > > > Based on the underlying IOMMU capability one device might be allowed > > > to attach to multiple I/O address spaces, with DMAs accessing them by > > > carrying different routing information. One of them is the default I/O > > > address space routed by PCI Requestor ID (RID) or ARM Stream ID. The > > > remaining are routed by RID + Process Address Space ID (PASID) or > > > Stream+Substream ID. For simplicity the following context uses RID and > > > PASID when talking about the routing information for I/O address spaces. > > > > I'm not really clear on how this interacts with nested ioasids. Would > > you generally expect the RID+PASID IOASes to be children of the base > > RID IOAS, or not? > > No. With Intel SIOV both parent/children could be RID+PASID, e.g. > when one enables vSVA on a mdev. Hm, ok. I really haven't understood how the PASIDs fit into this then. I'll try again on v2. > > If the PASID ASes are children of the RID AS, can we consider this not > > as the device explicitly attaching to multiple IOASIDs, but instead > > attaching to the parent IOASID with awareness of the child ones? > > > > > Device attachment is initiated through passthrough framework uAPI (use > > > VFIO for simplicity in following context). VFIO is responsible for identifying > > > the routing information and registering it to the ioasid driver when calling > > > ioasid attach helper function. It could be RID if the assigned device is > > > pdev (PF/VF) or RID+PASID if the device is mediated (mdev). In addition, > > > user might also provide its view of virtual routing information (vPASID) in > > > the attach call, e.g. when multiple user-managed I/O address spaces are > > > attached to the vfio_device. In this case VFIO must figure out whether > > > vPASID should be directly used (for pdev) or converted to a kernel- > > > allocated one (pPASID, for mdev) for physical routing (see section 4). > > > > > > Device must be bound to an IOASID FD before attach operation can be > > > conducted. This is also through VFIO uAPI. In this proposal one device > > > should not be bound to multiple FD's. Not sure about the gain of > > > allowing it except adding unnecessary complexity. But if others have > > > different view we can further discuss. > > > > > > VFIO must ensure its device composes DMAs with the routing information > > > attached to the IOASID. For pdev it naturally happens since vPASID is > > > directly programmed to the device by guest software. For mdev this > > > implies any guest operation carrying a vPASID on this device must be > > > trapped into VFIO and then converted to pPASID before sent to the > > > device. A detail explanation about PASID virtualization policies can be > > > found in section 4. > > > > > > Modern devices may support a scalable workload submission interface > > > based on PCI DMWr capability, allowing a single work queue to access > > > multiple I/O address spaces. One example is Intel ENQCMD, having > > > PASID saved in the CPU MSR and carried in the instruction payload > > > when sent out to the device. Then a single work queue shared by > > > multiple processes can compose DMAs carrying different PASIDs. > > > > Is the assumption here that the processes share the IOASID FD > > instance, but not memory? > > I didn't get this question Ok, stepping back, what exactly do you mean by "processes" above? Do you mean Linux processes, or something else? > > > When executing ENQCMD in the guest, the CPU MSR includes a vPASID > > > which, if targeting a mdev, must be converted to pPASID before sent > > > to the wire. Intel CPU provides a hardware PASID translation capability > > > for auto-conversion in the fast path. The user is expected to setup the > > > PASID mapping through KVM uAPI, with information about {vpasid, > > > ioasid_fd, ioasid}. The ioasid driver provides helper function for KVM > > > to figure out the actual pPASID given an IOASID. > > > > > > With above design /dev/ioasid uAPI is all about I/O address spaces. > > > It doesn't include any device routing information, which is only > > > indirectly registered to the ioasid driver through VFIO uAPI. For > > > example, I/O page fault is always reported to userspace per IOASID, > > > although it's physically reported per device (RID+PASID). If there is a > > > need of further relaying this fault into the guest, the user is responsible > > > of identifying the device attached to this IOASID (randomly pick one if > > > multiple attached devices) and then generates a per-device virtual I/O > > > page fault into guest. Similarly the iotlb invalidation uAPI describes the > > > granularity in the I/O address space (all, or a range), different from the > > > underlying IOMMU semantics (domain-wide, PASID-wide, range-based). > > > > > > I/O page tables routed through PASID are installed in a per-RID PASID > > > table structure. Some platforms implement the PASID table in the guest > > > physical space (GPA), expecting it managed by the guest. The guest > > > PASID table is bound to the IOMMU also by attaching to an IOASID, > > > representing the per-RID vPASID space. > > > > Do we need to consider two management modes here, much as we have for > > the pagetables themsleves: either kernel managed, in which we have > > explicit calls to bind a vPASID to a parent PASID, or user managed in > > which case we register a table in some format. > > yes, this is related to PASID virtualization in section 4. And based on > suggestion from Jason, the vPASID requirement will be reported to > user space via the per-device reporting interface. > > Thanks > Kevin > -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson