[RFC] /dev/ioasid uAPI proposal

* [RFC] /dev/ioasid uAPI proposal
@ 2021-05-27  7:58 ` Tian, Kevin
  0 siblings, 0 replies; 518+ messages in thread
From: Tian, Kevin @ 2021-05-27  7:58 UTC (permalink / raw)
  To: LKML, Joerg Roedel, Jason Gunthorpe, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang
  Cc: Eric Auger, Jonathan Corbet, Raj, Ashok, Tian, Kevin, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

/dev/ioasid provides an unified interface for managing I/O page tables for 
devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, 
etc.) are expected to use this interface instead of creating their own logic to 
isolate untrusted device DMAs initiated by userspace. 

This proposal describes the uAPI of /dev/ioasid and also sample sequences 
with VFIO as example in typical usages. The driver-facing kernel API provided 
by the iommu layer is still TBD, which can be discussed after consensus is 
made on this uAPI.

It's based on a lengthy discussion starting from here:
	https://lore.kernel.org/linux-iommu/20210330132830.GO2356281@nvidia.com/ 

It ends up to be a long writing due to many things to be summarized and
non-trivial effort required to connect them into a complete proposal.
Hope it provides a clean base to converge.

TOC
====
1. Terminologies and Concepts
2. uAPI Proposal
    2.1. /dev/ioasid uAPI
    2.2. /dev/vfio uAPI
    2.3. /dev/kvm uAPI
3. Sample structures and helper functions
4. PASID virtualization
5. Use Cases and Flows
    5.1. A simple example
    5.2. Multiple IOASIDs (no nesting)
    5.3. IOASID nesting (software)
    5.4. IOASID nesting (hardware)
    5.5. Guest SVA (vSVA)
    5.6. I/O page fault
    5.7. BIND_PASID_TABLE
====

1. Terminologies and Concepts
-----------------------------------------

IOASID FD is the container holding multiple I/O address spaces. User 
manages those address spaces through FD operations. Multiple FD's are 
allowed per process, but with this proposal one FD should be sufficient for 
all intended usages.

IOASID is the FD-local software handle representing an I/O address space. 
Each IOASID is associated with a single I/O page table. IOASIDs can be 
nested together, implying the output address from one I/O page table 
(represented by child IOASID) must be further translated by another I/O 
page table (represented by parent IOASID).

I/O address space can be managed through two protocols, according to 
whether the corresponding I/O page table is constructed by the kernel or 
the user. When kernel-managed, a dma mapping protocol (similar to 
existing VFIO iommu type1) is provided for the user to explicitly specify 
how the I/O address space is mapped. Otherwise, a different protocol is 
provided for the user to bind an user-managed I/O page table to the 
IOMMU, plus necessary commands for iotlb invalidation and I/O fault 
handling. 

Pgtable binding protocol can be used only on the child IOASID's, implying 
IOASID nesting must be enabled. This is because the kernel doesn't trust 
userspace. Nesting allows the kernel to enforce its DMA isolation policy 
through the parent IOASID.

IOASID nesting can be implemented in two ways: hardware nesting and 
software nesting. With hardware support the child and parent I/O page 
tables are walked consecutively by the IOMMU to form a nested translation. 
When it's implemented in software, the ioasid driver is responsible for 
merging the two-level mappings into a single-level shadow I/O page table. 
Software nesting requires both child/parent page tables operated through 
the dma mapping protocol, so any change in either level can be captured 
by the kernel to update the corresponding shadow mapping.

An I/O address space takes effect in the IOMMU only after it is attached 
to a device. The device in the /dev/ioasid context always refers to a 
physical one or 'pdev' (PF or VF). 

One I/O address space could be attached to multiple devices. In this case, 
/dev/ioasid uAPI applies to all attached devices under the specified IOASID.

Based on the underlying IOMMU capability one device might be allowed 
to attach to multiple I/O address spaces, with DMAs accessing them by 
carrying different routing information. One of them is the default I/O 
address space routed by PCI Requestor ID (RID) or ARM Stream ID. The 
remaining are routed by RID + Process Address Space ID (PASID) or 
Stream+Substream ID. For simplicity the following context uses RID and
PASID when talking about the routing information for I/O address spaces.

Device attachment is initiated through passthrough framework uAPI (use
VFIO for simplicity in following context). VFIO is responsible for identifying 
the routing information and registering it to the ioasid driver when calling 
ioasid attach helper function. It could be RID if the assigned device is 
pdev (PF/VF) or RID+PASID if the device is mediated (mdev). In addition, 
user might also provide its view of virtual routing information (vPASID) in 
the attach call, e.g. when multiple user-managed I/O address spaces are 
attached to the vfio_device. In this case VFIO must figure out whether 
vPASID should be directly used (for pdev) or converted to a kernel-
allocated one (pPASID, for mdev) for physical routing (see section 4).

Device must be bound to an IOASID FD before attach operation can be
conducted. This is also through VFIO uAPI. In this proposal one device 
should not be bound to multiple FD's. Not sure about the gain of 
allowing it except adding unnecessary complexity. But if others have 
different view we can further discuss.

VFIO must ensure its device composes DMAs with the routing information
attached to the IOASID. For pdev it naturally happens since vPASID is 
directly programmed to the device by guest software. For mdev this 
implies any guest operation carrying a vPASID on this device must be 
trapped into VFIO and then converted to pPASID before sent to the 
device. A detail explanation about PASID virtualization policies can be 
found in section 4. 

Modern devices may support a scalable workload submission interface 
based on PCI DMWr capability, allowing a single work queue to access
multiple I/O address spaces. One example is Intel ENQCMD, having 
PASID saved in the CPU MSR and carried in the instruction payload 
when sent out to the device. Then a single work queue shared by 
multiple processes can compose DMAs carrying different PASIDs. 

When executing ENQCMD in the guest, the CPU MSR includes a vPASID 
which, if targeting a mdev, must be converted to pPASID before sent
to the wire. Intel CPU provides a hardware PASID translation capability 
for auto-conversion in the fast path. The user is expected to setup the 
PASID mapping through KVM uAPI, with information about {vpasid, 
ioasid_fd, ioasid}. The ioasid driver provides helper function for KVM 
to figure out the actual pPASID given an IOASID.

With above design /dev/ioasid uAPI is all about I/O address spaces. 
It doesn't include any device routing information, which is only 
indirectly registered to the ioasid driver through VFIO uAPI. For 
example, I/O page fault is always reported to userspace per IOASID, 
although it's physically reported per device (RID+PASID). If there is a 
need of further relaying this fault into the guest, the user is responsible 
of identifying the device attached to this IOASID (randomly pick one if 
multiple attached devices) and then generates a per-device virtual I/O 
page fault into guest. Similarly the iotlb invalidation uAPI describes the 
granularity in the I/O address space (all, or a range), different from the 
underlying IOMMU semantics (domain-wide, PASID-wide, range-based).

I/O page tables routed through PASID are installed in a per-RID PASID 
table structure. Some platforms implement the PASID table in the guest 
physical space (GPA), expecting it managed by the guest. The guest
PASID table is bound to the IOMMU also by attaching to an IOASID, 
representing the per-RID vPASID space. 

We propose the host kernel needs to explicitly track  guest I/O page 
tables even on these platforms, i.e. the same pgtable binding protocol 
should be used universally on all platforms (with only difference on who 
actually writes the PASID table). One opinion from previous discussion 
was treating this special IOASID as a container for all guest I/O page 
tables i.e. hiding them from the host. However this way significantly 
violates the philosophy in this /dev/ioasid proposal. It is not one IOASID 
one address space any more. Device routing information (indirectly 
marking hidden I/O spaces) has to be carried in iotlb invalidation and 
page faulting uAPI to help connect vIOMMU with the underlying 
pIOMMU. This is one design choice to be confirmed with ARM guys.

Devices may sit behind IOMMU's with incompatible capabilities. The
difference may lie in the I/O page table format, or availability of an user
visible uAPI (e.g. hardware nesting). /dev/ioasid is responsible for 
checking the incompatibility between newly-attached device and existing
devices under the specific IOASID and, if found, returning error to user.
Upon such error the user should create a new IOASID for the incompatible
device. 

There is no explicit group enforcement in /dev/ioasid uAPI, due to no 
device notation in this interface as aforementioned. But the ioasid driver 
does implicit check to make sure that devices within an iommu group 
must be all attached to the same IOASID before this IOASID starts to
accept any uAPI command. Otherwise error information is returned to 
the user.

There was a long debate in previous discussion whether VFIO should keep 
explicit container/group semantics in its uAPI. Jason Gunthorpe proposes 
a simplified model where every device bound to VFIO is explicitly listed 
under /dev/vfio thus a device fd can be acquired w/o going through legacy
container/group interface. In this case the user is responsible for 
understanding the group topology and meeting the implicit group check 
criteria enforced in /dev/ioasid. The use case examples in this proposal 
are based on the new model.

Of course for backward compatibility VFIO still needs to keep the existing 
uAPI and vfio iommu type1 will become a shim layer connecting VFIO 
iommu ops to internal ioasid helper functions.

Notes:
-   It might be confusing as IOASID is also used in the kernel (drivers/
    iommu/ioasid.c) to represent PCI PASID or ARM substream ID. We need
    find a better name later to differentiate.

-   PPC has not be considered yet as we haven't got time to fully understand
    its semantics. According to previous discussion there is some generality 
    between PPC window-based scheme and VFIO type1 semantics. Let's 
    first make consensus on this proposal and then further discuss how to 
    extend it to cover PPC's requirement.

-   There is a protocol between vfio group and kvm. Needs to think about
    how it will be affected following this proposal.

-   mdev in this context refers to mediated subfunctions (e.g. Intel SIOV) 
    which can be physically isolated in-between through PASID-granular
    IOMMU protection. Historically people also discussed one usage by 
    mediating a pdev into a mdev. This usage is not covered here, and is 
    supposed to be replaced by Max's work which allows overriding various 
    VFIO operations in vfio-pci driver.

2. uAPI Proposal
----------------------

/dev/ioasid uAPI covers everything about managing I/O address spaces.

/dev/vfio uAPI builds connection between devices and I/O address spaces.

/dev/kvm uAPI is optional required as far as ENQCMD is concerned.

2.1. /dev/ioasid uAPI
+++++++++++++++++

/*
  * Check whether an uAPI extension is supported. 
  *
  * This is for FD-level capabilities, such as locked page pre-registration. 
  * IOASID-level capabilities are reported through IOASID_GET_INFO.
  *
  * Return: 0 if not supported, 1 if supported.
  */
#define IOASID_CHECK_EXTENSION	_IO(IOASID_TYPE, IOASID_BASE + 0)

/*
  * Register user space memory where DMA is allowed.
  *
  * It pins user pages and does the locked memory accounting so sub-
  * sequent IOASID_MAP/UNMAP_DMA calls get faster.
  *
  * When this ioctl is not used, one user page might be accounted
  * multiple times when it is mapped by multiple IOASIDs which are
  * not nested together.
  *
  * Input parameters:
  *	- vaddr;
  *	- size;
  *
  * Return: 0 on success, -errno on failure.
  */
#define IOASID_REGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 1)
#define IOASID_UNREGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 2)

/*
  * Allocate an IOASID. 
  *
  * IOASID is the FD-local software handle representing an I/O address 
  * space. Each IOASID is associated with a single I/O page table. User 
  * must call this ioctl to get an IOASID for every I/O address space that is
  * intended to be enabled in the IOMMU.
  *
  * A newly-created IOASID doesn't accept any command before it is 
  * attached to a device. Once attached, an empty I/O page table is 
  * bound with the IOMMU then the user could use either DMA mapping 
  * or pgtable binding commands to manage this I/O page table.
  *
  * Device attachment is initiated through device driver uAPI (e.g. VFIO)
  *
  * Return: allocated ioasid on success, -errno on failure.
  */
#define IOASID_ALLOC	_IO(IOASID_TYPE, IOASID_BASE + 3)
#define IOASID_FREE	_IO(IOASID_TYPE, IOASID_BASE + 4)

/*
  * Get information about an I/O address space
  *
  * Supported capabilities:
  *	- VFIO type1 map/unmap;
  *	- pgtable/pasid_table binding
  *	- hardware nesting vs. software nesting;
  *	- ...
  *
  * Related attributes:
  * 	- supported page sizes, reserved IOVA ranges (DMA mapping);
  *	- vendor pgtable formats (pgtable binding);
  *	- number of child IOASIDs (nesting);
  *	- ...
  *
  * Above information is available only after one or more devices are
  * attached to the specified IOASID. Otherwise the IOASID is just a
  * number w/o any capability or attribute.
  *
  * Input parameters:
  *	- u32 ioasid;
  *
  * Output parameters:
  *	- many. TBD.
  */
#define IOASID_GET_INFO	_IO(IOASID_TYPE, IOASID_BASE + 5)

/*
  * Map/unmap process virtual addresses to I/O virtual addresses.
  *
  * Provide VFIO type1 equivalent semantics. Start with the same 
  * restriction e.g. the unmap size should match those used in the 
  * original mapping call. 
  *
  * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
  * must be already in the preregistered list.
  *
  * Input parameters:
  *	- u32 ioasid;
  *	- refer to vfio_iommu_type1_dma_{un}map
  *
  * Return: 0 on success, -errno on failure.
  */
#define IOASID_MAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 6)
#define IOASID_UNMAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 7)

/*
  * Create a nesting IOASID (child) on an existing IOASID (parent)
  *
  * IOASIDs can be nested together, implying that the output address 
  * from one I/O page table (child) must be further translated by 
  * another I/O page table (parent).
  *
  * As the child adds essentially another reference to the I/O page table 
  * represented by the parent, any device attached to the child ioasid 
  * must be already attached to the parent.
  *
  * In concept there is no limit on the number of the nesting levels. 
  * However for the majority case one nesting level is sufficient. The
  * user should check whether an IOASID supports nesting through 
  * IOASID_GET_INFO. For example, if only one nesting level is allowed,
  * the nesting capability is reported only on the parent instead of the
  * child.
  *
  * User also needs check (via IOASID_GET_INFO) whether the nesting 
  * is implemented in hardware or software. If software-based, DMA 
  * mapping protocol should be used on the child IOASID. Otherwise, 
  * the child should be operated with pgtable binding protocol.
  *
  * Input parameters:
  *	- u32 parent_ioasid;
  *
  * Return: child_ioasid on success, -errno on failure;
  */
#define IOASID_CREATE_NESTING	_IO(IOASID_TYPE, IOASID_BASE + 8)

/*
  * Bind an user-managed I/O page table with the IOMMU
  *
  * Because user page table is untrusted, IOASID nesting must be enabled 
  * for this ioasid so the kernel can enforce its DMA isolation policy 
  * through the parent ioasid.
  *
  * Pgtable binding protocol is different from DMA mapping. The latter 
  * has the I/O page table constructed by the kernel and updated 
  * according to user MAP/UNMAP commands. With pgtable binding the 
  * whole page table is created and updated by userspace, thus different 
  * set of commands are required (bind, iotlb invalidation, page fault, etc.).
  *
  * Because the page table is directly walked by the IOMMU, the user 
  * must  use a format compatible to the underlying hardware. It can 
  * check the format information through IOASID_GET_INFO.
  *
  * The page table is bound to the IOMMU according to the routing 
  * information of each attached device under the specified IOASID. The
  * routing information (RID and optional PASID) is registered when a 
  * device is attached to this IOASID through VFIO uAPI. 
  *
  * Input parameters:
  *	- child_ioasid;
  *	- address of the user page table;
  *	- formats (vendor, address_width, etc.);
  * 
  * Return: 0 on success, -errno on failure.
  */
#define IOASID_BIND_PGTABLE		_IO(IOASID_TYPE, IOASID_BASE + 9)
#define IOASID_UNBIND_PGTABLE	_IO(IOASID_TYPE, IOASID_BASE + 10)

/*
  * Bind an user-managed PASID table to the IOMMU
  *
  * This is required for platforms which place PASID table in the GPA space.
  * In this case the specified IOASID represents the per-RID PASID space.
  *
  * Alternatively this may be replaced by IOASID_BIND_PGTABLE plus a
  * special flag to indicate the difference from normal I/O address spaces.
  *
  * The format info of the PASID table is reported in IOASID_GET_INFO.
  *
  * As explained in the design section, user-managed I/O page tables must
  * be explicitly bound to the kernel even on these platforms. It allows
  * the kernel to uniformly manage I/O address spaces cross all platforms.
  * Otherwise, the iotlb invalidation and page faulting uAPI must be hacked
  * to carry device routing information to indirectly mark the hidden I/O
  * address spaces.
  *
  * Input parameters:
  *	- child_ioasid;
  *	- address of PASID table;
  *	- formats (vendor, size, etc.);
  *
  * Return: 0 on success, -errno on failure.
  */
#define IOASID_BIND_PASID_TABLE	_IO(IOASID_TYPE, IOASID_BASE + 11)
#define IOASID_UNBIND_PASID_TABLE	_IO(IOASID_TYPE, IOASID_BASE + 12)

/*
  * Invalidate IOTLB for an user-managed I/O page table
  *
  * Unlike what's defined in include/uapi/linux/iommu.h, this command 
  * doesn't allow the user to specify cache type and likely support only
  * two granularities (all, or a specified range) in the I/O address space.
  *
  * Physical IOMMU have three cache types (iotlb, dev_iotlb and pasid
  * cache). If the IOASID represents an I/O address space, the invalidation
  * always applies to the iotlb (and dev_iotlb if enabled). If the IOASID
  * represents a vPASID space, then this command applies to the PASID
  * cache.
  *
  * Similarly this command doesn't provide IOMMU-like granularity
  * info (domain-wide, pasid-wide, range-based), since it's all about the
  * I/O address space itself. The ioasid driver walks the attached
  * routing information to match the IOMMU semantics under the
  * hood. 
  *
  * Input parameters:
  *	- child_ioasid;
  *	- granularity
  * 
  * Return: 0 on success, -errno on failure
  */
#define IOASID_INVALIDATE_CACHE	_IO(IOASID_TYPE, IOASID_BASE + 13)

/*
  * Page fault report and response
  *
  * This is TBD. Can be added after other parts are cleared up. Likely it 
  * will be a ring buffer shared between user/kernel, an eventfd to notify 
  * the user and an ioctl to complete the fault.
  *
  * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
  */

/*
  * Dirty page tracking 
  *
  * Track and report memory pages dirtied in I/O address spaces. There 
  * is an ongoing work by Kunkun Jiang by extending existing VFIO type1. 
  * It needs be adapted to /dev/ioasid later.
  */

2.2. /dev/vfio uAPI
++++++++++++++++

/*
  * Bind a vfio_device to the specified IOASID fd
  *
  * Multiple vfio devices can be bound to a single ioasid_fd, but a single 
  * vfio device should not be bound to multiple ioasid_fd's. 
  *
  * Input parameters:
  *	- ioasid_fd;
  *
  * Return: 0 on success, -errno on failure.
  */
#define VFIO_BIND_IOASID_FD		_IO(VFIO_TYPE, VFIO_BASE + 22)
#define VFIO_UNBIND_IOASID_FD	_IO(VFIO_TYPE, VFIO_BASE + 23)

/*
  * Attach a vfio device to the specified IOASID
  *
  * Multiple vfio devices can be attached to the same IOASID, and vice 
  * versa. 
  *
  * User may optionally provide a "virtual PASID" to mark an I/O page 
  * table on this vfio device. Whether the virtual PASID is physically used 
  * or converted to another kernel-allocated PASID is a policy in vfio device 
  * driver.
  *
  * There is no need to specify ioasid_fd in this call due to the assumption 
  * of 1:1 connection between vfio device and the bound fd.
  *
  * Input parameter:
  *	- ioasid;
  *	- flag;
  *	- user_pasid (if specified);
  * 
  * Return: 0 on success, -errno on failure.
  */
#define VFIO_ATTACH_IOASID		_IO(VFIO_TYPE, VFIO_BASE + 24)
#define VFIO_DETACH_IOASID		_IO(VFIO_TYPE, VFIO_BASE + 25)

2.3. KVM uAPI
++++++++++++

/*
  * Update CPU PASID mapping
  *
  * This is necessary when ENQCMD will be used in the guest while the
  * targeted device doesn't accept the vPASID saved in the CPU MSR.
  *
  * This command allows user to set/clear the vPASID->pPASID mapping
  * in the CPU, by providing the IOASID (and FD) information representing
  * the I/O address space marked by this vPASID.
  *
  * Input parameters:
  *	- user_pasid;
  *	- ioasid_fd;
  *	- ioasid;
  */
#define KVM_MAP_PASID	_IO(KVMIO, 0xf0)
#define KVM_UNMAP_PASID	_IO(KVMIO, 0xf1)

3. Sample structures and helper functions
--------------------------------------------------------

Three helper functions are provided to support VFIO_BIND_IOASID_FD:

	struct ioasid_ctx *ioasid_ctx_fdget(int fd);
	int ioasid_register_device(struct ioasid_ctx *ctx, struct ioasid_dev *dev);
	int ioasid_unregister_device(struct ioasid_dev *dev);

An ioasid_ctx is created for each fd:

	struct ioasid_ctx {
		// a list of allocated IOASID data's
		struct list_head		ioasid_list;
		// a list of registered devices
		struct list_head		dev_list;
		// a list of pre-registered virtual address ranges
		struct list_head		prereg_list;
	};

Each registered device is represented by ioasid_dev:

	struct ioasid_dev {
		struct list_head		next;
		struct ioasid_ctx	*ctx;
		// always be the physical device
		struct device 		*device;
		struct kref		kref;
	};

Because we assume one vfio_device connected to at most one ioasid_fd, 
here ioasid_dev could be embedded in vfio_device and then linked to 
ioasid_ctx->dev_list when registration succeeds. For mdev the struct
device should be the pointer to the parent device. PASID marking this
mdev is specified later when VFIO_ATTACH_IOASID.

An ioasid_data is created when IOASID_ALLOC, as the main object 
describing characteristics about an I/O page table:

	struct ioasid_data {
		// link to ioasid_ctx->ioasid_list
		struct list_head		next;

		// the IOASID number
		u32			ioasid;

		// the handle to convey iommu operations
		// hold the pgd (TBD until discussing iommu api)
		struct iommu_domain *domain;

		// map metadata (vfio type1 semantics)
		struct rb_node		dma_list;

		// pointer to user-managed pgtable (for nesting case)
		u64			user_pgd;

		// link to the parent ioasid (for nesting)
		struct ioasid_data	*parent;

		// cache the global PASID shared by ENQCMD-capable
		// devices (see below explanation in section 4)
		u32			pasid;

		// a list of device attach data (routing information)
		struct list_head		attach_data;

		// a list of partially-attached devices (group)
		struct list_head		partial_devices;

		// a list of fault_data reported from the iommu layer
		struct list_head		fault_data;

		...
	}

ioasid_data and iommu_domain have overlapping roles as both are 
introduced to represent an I/O address space. It is still a big TBD how 
the two should be corelated or even merged, and whether new iommu 
ops are required to handle RID+PASID explicitly. We leave this as open 
for now as this proposal is mainly about uAPI. For simplification 
purpose the two objects are kept separate in this context, assuming an 
1:1 connection in-between and the domain as the place-holder 
representing the 1st class object in the iommu ops. 

Two helper functions are provided to support VFIO_ATTACH_IOASID:

	struct attach_info {
		u32	ioasid;
		// If valid, the PASID to be used physically
		u32	pasid;
	};
	int ioasid_device_attach(struct ioasid_dev *dev, 
		struct attach_info info);
	int ioasid_device_detach(struct ioasid_dev *dev, u32 ioasid);

The pasid parameter is optionally provided based on the policy in vfio
device driver. It could be the PASID marking the default I/O address 
space for a mdev, or the user-provided PASID marking an user I/O page
table, or another kernel-allocated PASID backing the user-provided one.
Please check next section for detail explanation.

A new object is introduced and linked to ioasid_data->attach_data for 
each successful attach operation:

	struct ioasid_attach_data {
		struct list_head		next;
		struct ioasid_dev	*dev;
		u32 			pasid;
	}

As explained in the design section, there is no explicit group enforcement
in /dev/ioasid uAPI or helper functions. But the ioasid driver does
implicit group check - before every device within an iommu group is 
attached to this IOASID, the previously-attached devices in this group are
put in ioasid_data->partial_devices. The IOASID rejects any command if
the partial_devices list is not empty.

Then is the last helper function:
	u32 ioasid_get_global_pasid(struct ioasid_ctx *ctx, 
		u32 ioasid, bool alloc);

ioasid_get_global_pasid is necessary in scenarios where multiple devices 
want to share a same PASID value on the attached I/O page table (e.g. 
when ENQCMD is enabled, as explained in next section). We need a 
centralized place (ioasid_data->pasid) to hold this value (allocated when
first called with alloc=true). vfio device driver calls this function (alloc=
true) to get the global PASID for an ioasid before calling ioasid_device_
attach. KVM also calls this function (alloc=false) to setup PASID translation 
structure when user calls KVM_MAP_PASID.

4. PASID Virtualization
------------------------------

When guest SVA (vSVA) is enabled, multiple GVA address spaces are 
created on the assigned vfio device. This leads to the concepts of 
"virtual PASID" (vPASID) vs. "physical PASID" (pPASID). vPASID is assigned 
by the guest to mark an GVA address space while pPASID is the one 
selected by the host and actually routed in the wire.

vPASID is conveyed to the kernel when user calls VFIO_ATTACH_IOASID.

vfio device driver translates vPASID to pPASID before calling ioasid_attach_
device, with two factors to be considered:

-    Whether vPASID is directly used (vPASID==pPASID) in the wire, or 
     should be instead converted to a newly-allocated one (vPASID!=
     pPASID);

-    If vPASID!=pPASID, whether pPASID is allocated from per-RID PASID
     space or a global PASID space (implying sharing pPASID cross devices,
     e.g. when supporting Intel ENQCMD which puts PASID in a CPU MSR
     as part of the process context);

The actual policy depends on pdev vs. mdev, and whether ENQCMD is
supported. There are three possible scenarios:

(Note: /dev/ioasid uAPI is not affected by underlying PASID virtualization 
policies.)

1)  pdev (w/ or w/o ENQCMD): vPASID==pPASID

     vPASIDs are directly programmed by the guest to the assigned MMIO 
     bar, implying all DMAs out of this device having vPASID in the packet 
     header. This mandates vPASID==pPASID, sort of delegating the entire 
     per-RID PASID space to the guest.

     When ENQCMD is enabled, the CPU MSR when running a guest task
     contains a vPASID. In this case the CPU PASID translation capability 
     should be disabled so this vPASID in CPU MSR is directly sent to the
     wire.

     This ensures consistent vPASID usage on pdev regardless of the 
     workload submitted through a MMIO register or ENQCMD instruction.

2)  mdev: vPASID!=pPASID (per-RID if w/o ENQCMD, otherwise global)

     PASIDs are also used by kernel to mark the default I/O address space 
     for mdev, thus cannot be delegated to the guest. Instead, the mdev 
     driver must allocate a new pPASID for each vPASID (thus vPASID!=
     pPASID) and then use pPASID when attaching this mdev to an ioasid.

     The mdev driver needs cache the PASID mapping so in mediation 
     path vPASID programmed by the guest can be converted to pPASID 
     before updating the physical MMIO register. The mapping should
     also be saved in the CPU PASID translation structure (via KVM uAPI), 
     so the vPASID saved in the CPU MSR is auto-translated to pPASID 
     before sent to the wire, when ENQCMD is enabled. 

     Generally pPASID could be allocated from the per-RID PASID space
     if all mdev's created on the parent device don't support ENQCMD.

     However if the parent supports ENQCMD-capable mdev, pPASIDs
     must be allocated from a global pool because the CPU PASID 
     translation structure is per-VM. It implies that when an guest I/O 
     page table is attached to two mdevs with a single vPASID (i.e. bind 
     to the same guest process), a same pPASID should be used for 
     both mdevs even when they belong to different parents. Sharing
     pPASID cross mdevs is achieved by calling aforementioned ioasid_
     get_global_pasid().

3)  Mix pdev/mdev together

     Above policies are per device type thus are not affected when mixing 
     those device types together (when assigned to a single guest). However, 
     there is one exception - when both pdev/mdev support ENQCMD.

     Remember the two types have conflicting requirements on whether 
     CPU PASID translation should be enabled. This capability is per-VM, 
     and must be enabled for mdev isolation. When enabled, pdev will 
     receive a mdev pPASID violating its vPASID expectation.

     In previous thread a PASID range split scheme was discussed to support
     this combination, but we haven't worked out a clean uAPI design yet.
     Therefore in this proposal we decide to not support it, implying the 
     user should have some intelligence to avoid such scenario. It could be
     a TODO task for future.

In spite of those subtle considerations, the kernel implementation could
start simple, e.g.:

-    v==p for pdev;
-    v!=p and always use a global PASID pool for all mdev's;

Regardless of the kernel policy, the user policy is unchanged:

-    provide vPASID when calling VFIO_ATTACH_IOASID;
-    call KVM uAPI to setup CPU PASID translation if ENQCMD-capable mdev;
-    Don't expose ENQCMD capability on both pdev and mdev;

Sample user flow is described in section 5.5.

5. Use Cases and Flows
-------------------------------

Here assume VFIO will support a new model where every bound device
is explicitly listed under /dev/vfio thus a device fd can be acquired w/o 
going through legacy container/group interface. For illustration purpose
those devices are just called dev[1...N]:

	device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);

As explained earlier, one IOASID fd is sufficient for all intended use cases:

	ioasid_fd = open("/dev/ioasid", mode);

For simplicity below examples are all made for the virtualization story.
They are representative and could be easily adapted to a non-virtualization
scenario.

Three types of IOASIDs are considered:

	gpa_ioasid[1...N]: 	for GPA address space
	giova_ioasid[1...N]:	for guest IOVA address space
	gva_ioasid[1...N]:	for guest CPU VA address space

At least one gpa_ioasid must always be created per guest, while the other 
two are relevant as far as vIOMMU is concerned.

Examples here apply to both pdev and mdev, if not explicitly marked out
(e.g. in section 5.5). VFIO device driver in the kernel will figure out the 
associated routing information in the attaching operation.

For illustration simplicity, IOASID_CHECK_EXTENSION and IOASID_GET_
INFO are skipped in these examples.

5.1. A simple example
++++++++++++++++++

Dev1 is assigned to the guest. One gpa_ioasid is created. The GPA address
space is managed through DMA mapping protocol:

	/* Bind device to IOASID fd */
	device_fd = open("/dev/vfio/devices/dev1", mode);
	ioasid_fd = open("/dev/ioasid", mode);
	ioctl(device_fd, VFIO_BIND_IOASID_FD, ioasid_fd);

	/* Attach device to IOASID */
	gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
	at_data = { .ioasid = gpa_ioasid};
	ioctl(device_fd, VFIO_ATTACH_IOASID, &at_data);

	/* Setup GPA mapping */
	dma_map = {
		.ioasid	= gpa_ioasid;
		.iova	= 0;		// GPA
		.vaddr	= 0x40000000;	// HVA
		.size	= 1GB;
	};
	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);

If the guest is assigned with more than dev1, user follows above sequence
to attach other devices to the same gpa_ioasid i.e. sharing the GPA 
address space cross all assigned devices.

5.2. Multiple IOASIDs (no nesting)
++++++++++++++++++++++++++++

Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
both devices are attached to gpa_ioasid. After boot the guest creates 
an GIOVA address space (giova_ioasid) for dev2, leaving dev1 in pass
through mode (gpa_ioasid).

Suppose IOASID nesting is not supported in this case. Qemu need to
generate shadow mappings in userspace for giova_ioasid (like how
VFIO works today).

To avoid duplicated locked page accounting, it's recommended to pre-
register the virtual address range that will be used for DMA:

	device_fd1 = open("/dev/vfio/devices/dev1", mode);
	device_fd2 = open("/dev/vfio/devices/dev2", mode);
	ioasid_fd = open("/dev/ioasid", mode);
	ioctl(device_fd1, VFIO_BIND_IOASID_FD, ioasid_fd);
	ioctl(device_fd2, VFIO_BIND_IOASID_FD, ioasid_fd);

	/* pre-register the virtual address range for accounting */
	mem_info = { .vaddr = 0x40000000; .size = 1GB };
	ioctl(ioasid_fd, IOASID_REGISTER_MEMORY, &mem_info);

	/* Attach dev1 and dev2 to gpa_ioasid */
	gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
	at_data = { .ioasid = gpa_ioasid};
	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);

	/* Setup GPA mapping */
	dma_map = {
		.ioasid	= gpa_ioasid;
		.iova	= 0; 		// GPA
		.vaddr	= 0x40000000;	// HVA
		.size	= 1GB;
	};
	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);

	/* After boot, guest enables an GIOVA space for dev2 */
	giova_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);

	/* First detach dev2 from previous address space */
	at_data = { .ioasid = gpa_ioasid};
	ioctl(device_fd2, VFIO_DETACH_IOASID, &at_data);

	/* Then attach dev2 to the new address space */
	at_data = { .ioasid = giova_ioasid};
	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);

	/* Setup a shadow DMA mapping according to vIOMMU
	  * GIOVA (0x2000) -> GPA (0x1000) -> HVA (0x40001000)
	  */
	dma_map = {
		.ioasid	= giova_ioasid;
		.iova	= 0x2000; 	// GIOVA
		.vaddr	= 0x40001000;	// HVA
		.size	= 4KB;
	};
	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);

5.3. IOASID nesting (software)
+++++++++++++++++++++++++

Same usage scenario as 5.2, with software-based IOASID nesting 
available. In this mode it is the kernel instead of user to create the
shadow mapping.

The flow before guest boots is same as 5.2, except one point. Because 
giova_ioasid is nested on gpa_ioasid, locked accounting is only 
conducted for gpa_ioasid. So it's not necessary to pre-register virtual 
memory.

To save space we only list the steps after boots (i.e. both dev1/dev2
have been attached to gpa_ioasid before guest boots):

	/* After boots */
	/* Make GIOVA space nested on GPA space */
	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
				gpa_ioasid);

	/* Attach dev2 to the new address space (child)
	  * Note dev2 is still attached to gpa_ioasid (parent)
	  */
	at_data = { .ioasid = giova_ioasid};
	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);

	/* Setup a GIOVA->GPA mapping for giova_ioasid, which will be 
	  * merged by the kernel with GPA->HVA mapping of gpa_ioasid
	  * to form a shadow mapping.
	  */
	dma_map = {
		.ioasid	= giova_ioasid;
		.iova	= 0x2000;	// GIOVA
		.vaddr	= 0x1000;	// GPA
		.size	= 4KB;
	};
	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);

5.4. IOASID nesting (hardware)
+++++++++++++++++++++++++

Same usage scenario as 5.2, with hardware-based IOASID nesting
available. In this mode the pgtable binding protocol is used to 
bind the guest IOVA page table with the IOMMU:

	/* After boots */
	/* Make GIOVA space nested on GPA space */
	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
				gpa_ioasid);

	/* Attach dev2 to the new address space (child)
	  * Note dev2 is still attached to gpa_ioasid (parent)
	  */
	at_data = { .ioasid = giova_ioasid};
	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);

	/* Bind guest I/O page table  */
	bind_data = {
		.ioasid	= giova_ioasid;
		.addr	= giova_pgtable;
		// and format information
	};
	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);

	/* Invalidate IOTLB when required */
	inv_data = {
		.ioasid	= giova_ioasid;
		// granular information
	};
	ioctl(ioasid_fd, IOASID_INVALIDATE_CACHE, &inv_data);

	/* See 5.6 for I/O page fault handling */

5.5. Guest SVA (vSVA)
++++++++++++++++++

After boots the guest further create a GVA address spaces (gpasid1) on 
dev1. Dev2 is not affected (still attached to giova_ioasid).

As explained in section 4, user should avoid expose ENQCMD on both
pdev and mdev.

The sequence applies to all device types (being pdev or mdev), except
one additional step to call KVM for ENQCMD-capable mdev:

	/* After boots */
	/* Make GVA space nested on GPA space */
	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
				gpa_ioasid);

	/* Attach dev1 to the new address space and specify vPASID */
	at_data = {
		.ioasid		= gva_ioasid;
		.flag 		= IOASID_ATTACH_USER_PASID;
		.user_pasid	= gpasid1;
	};
	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);

	/* if dev1 is ENQCMD-capable mdev, update CPU PASID 
	  * translation structure through KVM
	  */
	pa_data = {
		.ioasid_fd	= ioasid_fd;
		.ioasid		= gva_ioasid;
		.guest_pasid	= gpasid1;
	};
	ioctl(kvm_fd, KVM_MAP_PASID, &pa_data);

	/* Bind guest I/O page table  */
	bind_data = {
		.ioasid	= gva_ioasid;
		.addr	= gva_pgtable1;
		// and format information
	};
	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);

	...

5.6. I/O page fault
+++++++++++++++

(uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
to guest IOMMU driver and backwards).

-   Host IOMMU driver receives a page request with raw fault_data {rid, 
    pasid, addr};

-   Host IOMMU driver identifies the faulting I/O page table according to
    information registered by IOASID fault handler;

-   IOASID fault handler is called with raw fault_data (rid, pasid, addr), which 
    is saved in ioasid_data->fault_data (used for response);

-   IOASID fault handler generates an user fault_data (ioasid, addr), links it 
    to the shared ring buffer and triggers eventfd to userspace;

-   Upon received event, Qemu needs to find the virtual routing information 
    (v_rid + v_pasid) of the device attached to the faulting ioasid. If there are 
    multiple, pick a random one. This should be fine since the purpose is to
    fix the I/O page table on the guest;

-   Qemu generates a virtual I/O page fault through vIOMMU into guest,
    carrying the virtual fault data (v_rid, v_pasid, addr);

-   Guest IOMMU driver fixes up the fault, updates the I/O page table, and
    then sends a page response with virtual completion data (v_rid, v_pasid, 
    response_code) to vIOMMU;

-   Qemu finds the pending fault event, converts virtual completion data 
    into (ioasid, response_code), and then calls a /dev/ioasid ioctl to 
    complete the pending fault;

-   /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved in 
    ioasid_data->fault_data, and then calls iommu api to complete it with
    {rid, pasid, response_code};

5.7. BIND_PASID_TABLE
++++++++++++++++++++

PASID table is put in the GPA space on some platform, thus must be updated
by the guest. It is treated as another user page table to be bound with the 
IOMMU.

As explained earlier, the user still needs to explicitly bind every user I/O 
page table to the kernel so the same pgtable binding protocol (bind, cache 
invalidate and fault handling) is unified cross platforms.

vIOMMUs may include a caching mode (or paravirtualized way) which, once 
enabled, requires the guest to invalidate PASID cache for any change on the 
PASID table. This allows Qemu to track the lifespan of guest I/O page tables.

In case of missing such capability, Qemu could enable write-protection on
the guest PASID table to achieve the same effect.

	/* After boots */
	/* Make vPASID space nested on GPA space */
	pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
				gpa_ioasid);

	/* Attach dev1 to pasidtbl_ioasid */
	at_data = { .ioasid = pasidtbl_ioasid};
	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);

	/* Bind PASID table */
	bind_data = {
		.ioasid	= pasidtbl_ioasid;
		.addr	= gpa_pasid_table;
		// and format information
	};
	ioctl(ioasid_fd, IOASID_BIND_PASID_TABLE, &bind_data);

	/* vIOMMU detects a new GVA I/O space created */
	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
				gpa_ioasid);

	/* Attach dev1 to the new address space, with gpasid1 */
	at_data = {
		.ioasid		= gva_ioasid;
		.flag 		= IOASID_ATTACH_USER_PASID;
		.user_pasid	= gpasid1;
	};
	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);

	/* Bind guest I/O page table. Because SET_PASID_TABLE has been
	  * used, the kernel will not update the PASID table. Instead, just
	  * track the bound I/O page table for handling invalidation and
	  * I/O page faults.
	  */
	bind_data = {
		.ioasid	= gva_ioasid;
		.addr	= gva_pgtable1;
		// and format information
	};
	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);

	...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 518+ messages in thread