All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] /dev/ioasid uAPI proposal
@ 2021-05-27  7:58 ` Tian, Kevin
  0 siblings, 0 replies; 518+ messages in thread
From: Tian, Kevin @ 2021-05-27  7:58 UTC (permalink / raw)
  To: LKML, Joerg Roedel, Jason Gunthorpe, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang
  Cc: Eric Auger, Jonathan Corbet, Raj, Ashok, Tian, Kevin, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

/dev/ioasid provides an unified interface for managing I/O page tables for 
devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, 
etc.) are expected to use this interface instead of creating their own logic to 
isolate untrusted device DMAs initiated by userspace. 

This proposal describes the uAPI of /dev/ioasid and also sample sequences 
with VFIO as example in typical usages. The driver-facing kernel API provided 
by the iommu layer is still TBD, which can be discussed after consensus is 
made on this uAPI.

It's based on a lengthy discussion starting from here:
	https://lore.kernel.org/linux-iommu/20210330132830.GO2356281@nvidia.com/ 

It ends up to be a long writing due to many things to be summarized and
non-trivial effort required to connect them into a complete proposal.
Hope it provides a clean base to converge.

TOC
====
1. Terminologies and Concepts
2. uAPI Proposal
    2.1. /dev/ioasid uAPI
    2.2. /dev/vfio uAPI
    2.3. /dev/kvm uAPI
3. Sample structures and helper functions
4. PASID virtualization
5. Use Cases and Flows
    5.1. A simple example
    5.2. Multiple IOASIDs (no nesting)
    5.3. IOASID nesting (software)
    5.4. IOASID nesting (hardware)
    5.5. Guest SVA (vSVA)
    5.6. I/O page fault
    5.7. BIND_PASID_TABLE
====

1. Terminologies and Concepts
-----------------------------------------

IOASID FD is the container holding multiple I/O address spaces. User 
manages those address spaces through FD operations. Multiple FD's are 
allowed per process, but with this proposal one FD should be sufficient for 
all intended usages.

IOASID is the FD-local software handle representing an I/O address space. 
Each IOASID is associated with a single I/O page table. IOASIDs can be 
nested together, implying the output address from one I/O page table 
(represented by child IOASID) must be further translated by another I/O 
page table (represented by parent IOASID).

I/O address space can be managed through two protocols, according to 
whether the corresponding I/O page table is constructed by the kernel or 
the user. When kernel-managed, a dma mapping protocol (similar to 
existing VFIO iommu type1) is provided for the user to explicitly specify 
how the I/O address space is mapped. Otherwise, a different protocol is 
provided for the user to bind an user-managed I/O page table to the 
IOMMU, plus necessary commands for iotlb invalidation and I/O fault 
handling. 

Pgtable binding protocol can be used only on the child IOASID's, implying 
IOASID nesting must be enabled. This is because the kernel doesn't trust 
userspace. Nesting allows the kernel to enforce its DMA isolation policy 
through the parent IOASID.

IOASID nesting can be implemented in two ways: hardware nesting and 
software nesting. With hardware support the child and parent I/O page 
tables are walked consecutively by the IOMMU to form a nested translation. 
When it's implemented in software, the ioasid driver is responsible for 
merging the two-level mappings into a single-level shadow I/O page table. 
Software nesting requires both child/parent page tables operated through 
the dma mapping protocol, so any change in either level can be captured 
by the kernel to update the corresponding shadow mapping.

An I/O address space takes effect in the IOMMU only after it is attached 
to a device. The device in the /dev/ioasid context always refers to a 
physical one or 'pdev' (PF or VF). 

One I/O address space could be attached to multiple devices. In this case, 
/dev/ioasid uAPI applies to all attached devices under the specified IOASID.

Based on the underlying IOMMU capability one device might be allowed 
to attach to multiple I/O address spaces, with DMAs accessing them by 
carrying different routing information. One of them is the default I/O 
address space routed by PCI Requestor ID (RID) or ARM Stream ID. The 
remaining are routed by RID + Process Address Space ID (PASID) or 
Stream+Substream ID. For simplicity the following context uses RID and
PASID when talking about the routing information for I/O address spaces.

Device attachment is initiated through passthrough framework uAPI (use
VFIO for simplicity in following context). VFIO is responsible for identifying 
the routing information and registering it to the ioasid driver when calling 
ioasid attach helper function. It could be RID if the assigned device is 
pdev (PF/VF) or RID+PASID if the device is mediated (mdev). In addition, 
user might also provide its view of virtual routing information (vPASID) in 
the attach call, e.g. when multiple user-managed I/O address spaces are 
attached to the vfio_device. In this case VFIO must figure out whether 
vPASID should be directly used (for pdev) or converted to a kernel-
allocated one (pPASID, for mdev) for physical routing (see section 4).

Device must be bound to an IOASID FD before attach operation can be
conducted. This is also through VFIO uAPI. In this proposal one device 
should not be bound to multiple FD's. Not sure about the gain of 
allowing it except adding unnecessary complexity. But if others have 
different view we can further discuss.

VFIO must ensure its device composes DMAs with the routing information
attached to the IOASID. For pdev it naturally happens since vPASID is 
directly programmed to the device by guest software. For mdev this 
implies any guest operation carrying a vPASID on this device must be 
trapped into VFIO and then converted to pPASID before sent to the 
device. A detail explanation about PASID virtualization policies can be 
found in section 4. 

Modern devices may support a scalable workload submission interface 
based on PCI DMWr capability, allowing a single work queue to access
multiple I/O address spaces. One example is Intel ENQCMD, having 
PASID saved in the CPU MSR and carried in the instruction payload 
when sent out to the device. Then a single work queue shared by 
multiple processes can compose DMAs carrying different PASIDs. 

When executing ENQCMD in the guest, the CPU MSR includes a vPASID 
which, if targeting a mdev, must be converted to pPASID before sent
to the wire. Intel CPU provides a hardware PASID translation capability 
for auto-conversion in the fast path. The user is expected to setup the 
PASID mapping through KVM uAPI, with information about {vpasid, 
ioasid_fd, ioasid}. The ioasid driver provides helper function for KVM 
to figure out the actual pPASID given an IOASID.

With above design /dev/ioasid uAPI is all about I/O address spaces. 
It doesn't include any device routing information, which is only 
indirectly registered to the ioasid driver through VFIO uAPI. For 
example, I/O page fault is always reported to userspace per IOASID, 
although it's physically reported per device (RID+PASID). If there is a 
need of further relaying this fault into the guest, the user is responsible 
of identifying the device attached to this IOASID (randomly pick one if 
multiple attached devices) and then generates a per-device virtual I/O 
page fault into guest. Similarly the iotlb invalidation uAPI describes the 
granularity in the I/O address space (all, or a range), different from the 
underlying IOMMU semantics (domain-wide, PASID-wide, range-based).

I/O page tables routed through PASID are installed in a per-RID PASID 
table structure. Some platforms implement the PASID table in the guest 
physical space (GPA), expecting it managed by the guest. The guest
PASID table is bound to the IOMMU also by attaching to an IOASID, 
representing the per-RID vPASID space. 

We propose the host kernel needs to explicitly track  guest I/O page 
tables even on these platforms, i.e. the same pgtable binding protocol 
should be used universally on all platforms (with only difference on who 
actually writes the PASID table). One opinion from previous discussion 
was treating this special IOASID as a container for all guest I/O page 
tables i.e. hiding them from the host. However this way significantly 
violates the philosophy in this /dev/ioasid proposal. It is not one IOASID 
one address space any more. Device routing information (indirectly 
marking hidden I/O spaces) has to be carried in iotlb invalidation and 
page faulting uAPI to help connect vIOMMU with the underlying 
pIOMMU. This is one design choice to be confirmed with ARM guys.

Devices may sit behind IOMMU's with incompatible capabilities. The
difference may lie in the I/O page table format, or availability of an user
visible uAPI (e.g. hardware nesting). /dev/ioasid is responsible for 
checking the incompatibility between newly-attached device and existing
devices under the specific IOASID and, if found, returning error to user.
Upon such error the user should create a new IOASID for the incompatible
device. 

There is no explicit group enforcement in /dev/ioasid uAPI, due to no 
device notation in this interface as aforementioned. But the ioasid driver 
does implicit check to make sure that devices within an iommu group 
must be all attached to the same IOASID before this IOASID starts to
accept any uAPI command. Otherwise error information is returned to 
the user.

There was a long debate in previous discussion whether VFIO should keep 
explicit container/group semantics in its uAPI. Jason Gunthorpe proposes 
a simplified model where every device bound to VFIO is explicitly listed 
under /dev/vfio thus a device fd can be acquired w/o going through legacy
container/group interface. In this case the user is responsible for 
understanding the group topology and meeting the implicit group check 
criteria enforced in /dev/ioasid. The use case examples in this proposal 
are based on the new model.

Of course for backward compatibility VFIO still needs to keep the existing 
uAPI and vfio iommu type1 will become a shim layer connecting VFIO 
iommu ops to internal ioasid helper functions.

Notes:
-   It might be confusing as IOASID is also used in the kernel (drivers/
    iommu/ioasid.c) to represent PCI PASID or ARM substream ID. We need
    find a better name later to differentiate.

-   PPC has not be considered yet as we haven't got time to fully understand
    its semantics. According to previous discussion there is some generality 
    between PPC window-based scheme and VFIO type1 semantics. Let's 
    first make consensus on this proposal and then further discuss how to 
    extend it to cover PPC's requirement.

-   There is a protocol between vfio group and kvm. Needs to think about
    how it will be affected following this proposal.

-   mdev in this context refers to mediated subfunctions (e.g. Intel SIOV) 
    which can be physically isolated in-between through PASID-granular
    IOMMU protection. Historically people also discussed one usage by 
    mediating a pdev into a mdev. This usage is not covered here, and is 
    supposed to be replaced by Max's work which allows overriding various 
    VFIO operations in vfio-pci driver.

2. uAPI Proposal
----------------------

/dev/ioasid uAPI covers everything about managing I/O address spaces.

/dev/vfio uAPI builds connection between devices and I/O address spaces.

/dev/kvm uAPI is optional required as far as ENQCMD is concerned.


2.1. /dev/ioasid uAPI
+++++++++++++++++

/*
  * Check whether an uAPI extension is supported. 
  *
  * This is for FD-level capabilities, such as locked page pre-registration. 
  * IOASID-level capabilities are reported through IOASID_GET_INFO.
  *
  * Return: 0 if not supported, 1 if supported.
  */
#define IOASID_CHECK_EXTENSION	_IO(IOASID_TYPE, IOASID_BASE + 0)


/*
  * Register user space memory where DMA is allowed.
  *
  * It pins user pages and does the locked memory accounting so sub-
  * sequent IOASID_MAP/UNMAP_DMA calls get faster.
  *
  * When this ioctl is not used, one user page might be accounted
  * multiple times when it is mapped by multiple IOASIDs which are
  * not nested together.
  *
  * Input parameters:
  *	- vaddr;
  *	- size;
  *
  * Return: 0 on success, -errno on failure.
  */
#define IOASID_REGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 1)
#define IOASID_UNREGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 2)


/*
  * Allocate an IOASID. 
  *
  * IOASID is the FD-local software handle representing an I/O address 
  * space. Each IOASID is associated with a single I/O page table. User 
  * must call this ioctl to get an IOASID for every I/O address space that is
  * intended to be enabled in the IOMMU.
  *
  * A newly-created IOASID doesn't accept any command before it is 
  * attached to a device. Once attached, an empty I/O page table is 
  * bound with the IOMMU then the user could use either DMA mapping 
  * or pgtable binding commands to manage this I/O page table.
  *
  * Device attachment is initiated through device driver uAPI (e.g. VFIO)
  *
  * Return: allocated ioasid on success, -errno on failure.
  */
#define IOASID_ALLOC	_IO(IOASID_TYPE, IOASID_BASE + 3)
#define IOASID_FREE	_IO(IOASID_TYPE, IOASID_BASE + 4)


/*
  * Get information about an I/O address space
  *
  * Supported capabilities:
  *	- VFIO type1 map/unmap;
  *	- pgtable/pasid_table binding
  *	- hardware nesting vs. software nesting;
  *	- ...
  *
  * Related attributes:
  * 	- supported page sizes, reserved IOVA ranges (DMA mapping);
  *	- vendor pgtable formats (pgtable binding);
  *	- number of child IOASIDs (nesting);
  *	- ...
  *
  * Above information is available only after one or more devices are
  * attached to the specified IOASID. Otherwise the IOASID is just a
  * number w/o any capability or attribute.
  *
  * Input parameters:
  *	- u32 ioasid;
  *
  * Output parameters:
  *	- many. TBD.
  */
#define IOASID_GET_INFO	_IO(IOASID_TYPE, IOASID_BASE + 5)


/*
  * Map/unmap process virtual addresses to I/O virtual addresses.
  *
  * Provide VFIO type1 equivalent semantics. Start with the same 
  * restriction e.g. the unmap size should match those used in the 
  * original mapping call. 
  *
  * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
  * must be already in the preregistered list.
  *
  * Input parameters:
  *	- u32 ioasid;
  *	- refer to vfio_iommu_type1_dma_{un}map
  *
  * Return: 0 on success, -errno on failure.
  */
#define IOASID_MAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 6)
#define IOASID_UNMAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 7)


/*
  * Create a nesting IOASID (child) on an existing IOASID (parent)
  *
  * IOASIDs can be nested together, implying that the output address 
  * from one I/O page table (child) must be further translated by 
  * another I/O page table (parent).
  *
  * As the child adds essentially another reference to the I/O page table 
  * represented by the parent, any device attached to the child ioasid 
  * must be already attached to the parent.
  *
  * In concept there is no limit on the number of the nesting levels. 
  * However for the majority case one nesting level is sufficient. The
  * user should check whether an IOASID supports nesting through 
  * IOASID_GET_INFO. For example, if only one nesting level is allowed,
  * the nesting capability is reported only on the parent instead of the
  * child.
  *
  * User also needs check (via IOASID_GET_INFO) whether the nesting 
  * is implemented in hardware or software. If software-based, DMA 
  * mapping protocol should be used on the child IOASID. Otherwise, 
  * the child should be operated with pgtable binding protocol.
  *
  * Input parameters:
  *	- u32 parent_ioasid;
  *
  * Return: child_ioasid on success, -errno on failure;
  */
#define IOASID_CREATE_NESTING	_IO(IOASID_TYPE, IOASID_BASE + 8)


/*
  * Bind an user-managed I/O page table with the IOMMU
  *
  * Because user page table is untrusted, IOASID nesting must be enabled 
  * for this ioasid so the kernel can enforce its DMA isolation policy 
  * through the parent ioasid.
  *
  * Pgtable binding protocol is different from DMA mapping. The latter 
  * has the I/O page table constructed by the kernel and updated 
  * according to user MAP/UNMAP commands. With pgtable binding the 
  * whole page table is created and updated by userspace, thus different 
  * set of commands are required (bind, iotlb invalidation, page fault, etc.).
  *
  * Because the page table is directly walked by the IOMMU, the user 
  * must  use a format compatible to the underlying hardware. It can 
  * check the format information through IOASID_GET_INFO.
  *
  * The page table is bound to the IOMMU according to the routing 
  * information of each attached device under the specified IOASID. The
  * routing information (RID and optional PASID) is registered when a 
  * device is attached to this IOASID through VFIO uAPI. 
  *
  * Input parameters:
  *	- child_ioasid;
  *	- address of the user page table;
  *	- formats (vendor, address_width, etc.);
  * 
  * Return: 0 on success, -errno on failure.
  */
#define IOASID_BIND_PGTABLE		_IO(IOASID_TYPE, IOASID_BASE + 9)
#define IOASID_UNBIND_PGTABLE	_IO(IOASID_TYPE, IOASID_BASE + 10)


/*
  * Bind an user-managed PASID table to the IOMMU
  *
  * This is required for platforms which place PASID table in the GPA space.
  * In this case the specified IOASID represents the per-RID PASID space.
  *
  * Alternatively this may be replaced by IOASID_BIND_PGTABLE plus a
  * special flag to indicate the difference from normal I/O address spaces.
  *
  * The format info of the PASID table is reported in IOASID_GET_INFO.
  *
  * As explained in the design section, user-managed I/O page tables must
  * be explicitly bound to the kernel even on these platforms. It allows
  * the kernel to uniformly manage I/O address spaces cross all platforms.
  * Otherwise, the iotlb invalidation and page faulting uAPI must be hacked
  * to carry device routing information to indirectly mark the hidden I/O
  * address spaces.
  *
  * Input parameters:
  *	- child_ioasid;
  *	- address of PASID table;
  *	- formats (vendor, size, etc.);
  *
  * Return: 0 on success, -errno on failure.
  */
#define IOASID_BIND_PASID_TABLE	_IO(IOASID_TYPE, IOASID_BASE + 11)
#define IOASID_UNBIND_PASID_TABLE	_IO(IOASID_TYPE, IOASID_BASE + 12)


/*
  * Invalidate IOTLB for an user-managed I/O page table
  *
  * Unlike what's defined in include/uapi/linux/iommu.h, this command 
  * doesn't allow the user to specify cache type and likely support only
  * two granularities (all, or a specified range) in the I/O address space.
  *
  * Physical IOMMU have three cache types (iotlb, dev_iotlb and pasid
  * cache). If the IOASID represents an I/O address space, the invalidation
  * always applies to the iotlb (and dev_iotlb if enabled). If the IOASID
  * represents a vPASID space, then this command applies to the PASID
  * cache.
  *
  * Similarly this command doesn't provide IOMMU-like granularity
  * info (domain-wide, pasid-wide, range-based), since it's all about the
  * I/O address space itself. The ioasid driver walks the attached
  * routing information to match the IOMMU semantics under the
  * hood. 
  *
  * Input parameters:
  *	- child_ioasid;
  *	- granularity
  * 
  * Return: 0 on success, -errno on failure
  */
#define IOASID_INVALIDATE_CACHE	_IO(IOASID_TYPE, IOASID_BASE + 13)


/*
  * Page fault report and response
  *
  * This is TBD. Can be added after other parts are cleared up. Likely it 
  * will be a ring buffer shared between user/kernel, an eventfd to notify 
  * the user and an ioctl to complete the fault.
  *
  * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
  */


/*
  * Dirty page tracking 
  *
  * Track and report memory pages dirtied in I/O address spaces. There 
  * is an ongoing work by Kunkun Jiang by extending existing VFIO type1. 
  * It needs be adapted to /dev/ioasid later.
  */


2.2. /dev/vfio uAPI
++++++++++++++++

/*
  * Bind a vfio_device to the specified IOASID fd
  *
  * Multiple vfio devices can be bound to a single ioasid_fd, but a single 
  * vfio device should not be bound to multiple ioasid_fd's. 
  *
  * Input parameters:
  *	- ioasid_fd;
  *
  * Return: 0 on success, -errno on failure.
  */
#define VFIO_BIND_IOASID_FD		_IO(VFIO_TYPE, VFIO_BASE + 22)
#define VFIO_UNBIND_IOASID_FD	_IO(VFIO_TYPE, VFIO_BASE + 23)


/*
  * Attach a vfio device to the specified IOASID
  *
  * Multiple vfio devices can be attached to the same IOASID, and vice 
  * versa. 
  *
  * User may optionally provide a "virtual PASID" to mark an I/O page 
  * table on this vfio device. Whether the virtual PASID is physically used 
  * or converted to another kernel-allocated PASID is a policy in vfio device 
  * driver.
  *
  * There is no need to specify ioasid_fd in this call due to the assumption 
  * of 1:1 connection between vfio device and the bound fd.
  *
  * Input parameter:
  *	- ioasid;
  *	- flag;
  *	- user_pasid (if specified);
  * 
  * Return: 0 on success, -errno on failure.
  */
#define VFIO_ATTACH_IOASID		_IO(VFIO_TYPE, VFIO_BASE + 24)
#define VFIO_DETACH_IOASID		_IO(VFIO_TYPE, VFIO_BASE + 25)


2.3. KVM uAPI
++++++++++++

/*
  * Update CPU PASID mapping
  *
  * This is necessary when ENQCMD will be used in the guest while the
  * targeted device doesn't accept the vPASID saved in the CPU MSR.
  *
  * This command allows user to set/clear the vPASID->pPASID mapping
  * in the CPU, by providing the IOASID (and FD) information representing
  * the I/O address space marked by this vPASID.
  *
  * Input parameters:
  *	- user_pasid;
  *	- ioasid_fd;
  *	- ioasid;
  */
#define KVM_MAP_PASID	_IO(KVMIO, 0xf0)
#define KVM_UNMAP_PASID	_IO(KVMIO, 0xf1)


3. Sample structures and helper functions
--------------------------------------------------------

Three helper functions are provided to support VFIO_BIND_IOASID_FD:

	struct ioasid_ctx *ioasid_ctx_fdget(int fd);
	int ioasid_register_device(struct ioasid_ctx *ctx, struct ioasid_dev *dev);
	int ioasid_unregister_device(struct ioasid_dev *dev);

An ioasid_ctx is created for each fd:

	struct ioasid_ctx {
		// a list of allocated IOASID data's
		struct list_head		ioasid_list;
		// a list of registered devices
		struct list_head		dev_list;
		// a list of pre-registered virtual address ranges
		struct list_head		prereg_list;
	};

Each registered device is represented by ioasid_dev:

	struct ioasid_dev {
		struct list_head		next;
		struct ioasid_ctx	*ctx;
		// always be the physical device
		struct device 		*device;
		struct kref		kref;
	};

Because we assume one vfio_device connected to at most one ioasid_fd, 
here ioasid_dev could be embedded in vfio_device and then linked to 
ioasid_ctx->dev_list when registration succeeds. For mdev the struct
device should be the pointer to the parent device. PASID marking this
mdev is specified later when VFIO_ATTACH_IOASID.

An ioasid_data is created when IOASID_ALLOC, as the main object 
describing characteristics about an I/O page table:

	struct ioasid_data {
		// link to ioasid_ctx->ioasid_list
		struct list_head		next;

		// the IOASID number
		u32			ioasid;

		// the handle to convey iommu operations
		// hold the pgd (TBD until discussing iommu api)
		struct iommu_domain *domain;

		// map metadata (vfio type1 semantics)
		struct rb_node		dma_list;

		// pointer to user-managed pgtable (for nesting case)
		u64			user_pgd;

		// link to the parent ioasid (for nesting)
		struct ioasid_data	*parent;

		// cache the global PASID shared by ENQCMD-capable
		// devices (see below explanation in section 4)
		u32			pasid;

		// a list of device attach data (routing information)
		struct list_head		attach_data;

		// a list of partially-attached devices (group)
		struct list_head		partial_devices;

		// a list of fault_data reported from the iommu layer
		struct list_head		fault_data;

		...
	}

ioasid_data and iommu_domain have overlapping roles as both are 
introduced to represent an I/O address space. It is still a big TBD how 
the two should be corelated or even merged, and whether new iommu 
ops are required to handle RID+PASID explicitly. We leave this as open 
for now as this proposal is mainly about uAPI. For simplification 
purpose the two objects are kept separate in this context, assuming an 
1:1 connection in-between and the domain as the place-holder 
representing the 1st class object in the iommu ops. 

Two helper functions are provided to support VFIO_ATTACH_IOASID:

	struct attach_info {
		u32	ioasid;
		// If valid, the PASID to be used physically
		u32	pasid;
	};
	int ioasid_device_attach(struct ioasid_dev *dev, 
		struct attach_info info);
	int ioasid_device_detach(struct ioasid_dev *dev, u32 ioasid);

The pasid parameter is optionally provided based on the policy in vfio
device driver. It could be the PASID marking the default I/O address 
space for a mdev, or the user-provided PASID marking an user I/O page
table, or another kernel-allocated PASID backing the user-provided one.
Please check next section for detail explanation.

A new object is introduced and linked to ioasid_data->attach_data for 
each successful attach operation:

	struct ioasid_attach_data {
		struct list_head		next;
		struct ioasid_dev	*dev;
		u32 			pasid;
	}

As explained in the design section, there is no explicit group enforcement
in /dev/ioasid uAPI or helper functions. But the ioasid driver does
implicit group check - before every device within an iommu group is 
attached to this IOASID, the previously-attached devices in this group are
put in ioasid_data->partial_devices. The IOASID rejects any command if
the partial_devices list is not empty.

Then is the last helper function:
	u32 ioasid_get_global_pasid(struct ioasid_ctx *ctx, 
		u32 ioasid, bool alloc);

ioasid_get_global_pasid is necessary in scenarios where multiple devices 
want to share a same PASID value on the attached I/O page table (e.g. 
when ENQCMD is enabled, as explained in next section). We need a 
centralized place (ioasid_data->pasid) to hold this value (allocated when
first called with alloc=true). vfio device driver calls this function (alloc=
true) to get the global PASID for an ioasid before calling ioasid_device_
attach. KVM also calls this function (alloc=false) to setup PASID translation 
structure when user calls KVM_MAP_PASID.

4. PASID Virtualization
------------------------------

When guest SVA (vSVA) is enabled, multiple GVA address spaces are 
created on the assigned vfio device. This leads to the concepts of 
"virtual PASID" (vPASID) vs. "physical PASID" (pPASID). vPASID is assigned 
by the guest to mark an GVA address space while pPASID is the one 
selected by the host and actually routed in the wire.

vPASID is conveyed to the kernel when user calls VFIO_ATTACH_IOASID.

vfio device driver translates vPASID to pPASID before calling ioasid_attach_
device, with two factors to be considered:

-    Whether vPASID is directly used (vPASID==pPASID) in the wire, or 
     should be instead converted to a newly-allocated one (vPASID!=
     pPASID);

-    If vPASID!=pPASID, whether pPASID is allocated from per-RID PASID
     space or a global PASID space (implying sharing pPASID cross devices,
     e.g. when supporting Intel ENQCMD which puts PASID in a CPU MSR
     as part of the process context);

The actual policy depends on pdev vs. mdev, and whether ENQCMD is
supported. There are three possible scenarios:

(Note: /dev/ioasid uAPI is not affected by underlying PASID virtualization 
policies.)

1)  pdev (w/ or w/o ENQCMD): vPASID==pPASID

     vPASIDs are directly programmed by the guest to the assigned MMIO 
     bar, implying all DMAs out of this device having vPASID in the packet 
     header. This mandates vPASID==pPASID, sort of delegating the entire 
     per-RID PASID space to the guest.

     When ENQCMD is enabled, the CPU MSR when running a guest task
     contains a vPASID. In this case the CPU PASID translation capability 
     should be disabled so this vPASID in CPU MSR is directly sent to the
     wire.

     This ensures consistent vPASID usage on pdev regardless of the 
     workload submitted through a MMIO register or ENQCMD instruction.

2)  mdev: vPASID!=pPASID (per-RID if w/o ENQCMD, otherwise global)

     PASIDs are also used by kernel to mark the default I/O address space 
     for mdev, thus cannot be delegated to the guest. Instead, the mdev 
     driver must allocate a new pPASID for each vPASID (thus vPASID!=
     pPASID) and then use pPASID when attaching this mdev to an ioasid.

     The mdev driver needs cache the PASID mapping so in mediation 
     path vPASID programmed by the guest can be converted to pPASID 
     before updating the physical MMIO register. The mapping should
     also be saved in the CPU PASID translation structure (via KVM uAPI), 
     so the vPASID saved in the CPU MSR is auto-translated to pPASID 
     before sent to the wire, when ENQCMD is enabled. 

     Generally pPASID could be allocated from the per-RID PASID space
     if all mdev's created on the parent device don't support ENQCMD.

     However if the parent supports ENQCMD-capable mdev, pPASIDs
     must be allocated from a global pool because the CPU PASID 
     translation structure is per-VM. It implies that when an guest I/O 
     page table is attached to two mdevs with a single vPASID (i.e. bind 
     to the same guest process), a same pPASID should be used for 
     both mdevs even when they belong to different parents. Sharing
     pPASID cross mdevs is achieved by calling aforementioned ioasid_
     get_global_pasid().

3)  Mix pdev/mdev together

     Above policies are per device type thus are not affected when mixing 
     those device types together (when assigned to a single guest). However, 
     there is one exception - when both pdev/mdev support ENQCMD.

     Remember the two types have conflicting requirements on whether 
     CPU PASID translation should be enabled. This capability is per-VM, 
     and must be enabled for mdev isolation. When enabled, pdev will 
     receive a mdev pPASID violating its vPASID expectation.

     In previous thread a PASID range split scheme was discussed to support
     this combination, but we haven't worked out a clean uAPI design yet.
     Therefore in this proposal we decide to not support it, implying the 
     user should have some intelligence to avoid such scenario. It could be
     a TODO task for future.

In spite of those subtle considerations, the kernel implementation could
start simple, e.g.:

-    v==p for pdev;
-    v!=p and always use a global PASID pool for all mdev's;

Regardless of the kernel policy, the user policy is unchanged:

-    provide vPASID when calling VFIO_ATTACH_IOASID;
-    call KVM uAPI to setup CPU PASID translation if ENQCMD-capable mdev;
-    Don't expose ENQCMD capability on both pdev and mdev;

Sample user flow is described in section 5.5.

5. Use Cases and Flows
-------------------------------

Here assume VFIO will support a new model where every bound device
is explicitly listed under /dev/vfio thus a device fd can be acquired w/o 
going through legacy container/group interface. For illustration purpose
those devices are just called dev[1...N]:

	device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);

As explained earlier, one IOASID fd is sufficient for all intended use cases:

	ioasid_fd = open("/dev/ioasid", mode);

For simplicity below examples are all made for the virtualization story.
They are representative and could be easily adapted to a non-virtualization
scenario.

Three types of IOASIDs are considered:

	gpa_ioasid[1...N]: 	for GPA address space
	giova_ioasid[1...N]:	for guest IOVA address space
	gva_ioasid[1...N]:	for guest CPU VA address space

At least one gpa_ioasid must always be created per guest, while the other 
two are relevant as far as vIOMMU is concerned.

Examples here apply to both pdev and mdev, if not explicitly marked out
(e.g. in section 5.5). VFIO device driver in the kernel will figure out the 
associated routing information in the attaching operation.

For illustration simplicity, IOASID_CHECK_EXTENSION and IOASID_GET_
INFO are skipped in these examples.

5.1. A simple example
++++++++++++++++++

Dev1 is assigned to the guest. One gpa_ioasid is created. The GPA address
space is managed through DMA mapping protocol:

	/* Bind device to IOASID fd */
	device_fd = open("/dev/vfio/devices/dev1", mode);
	ioasid_fd = open("/dev/ioasid", mode);
	ioctl(device_fd, VFIO_BIND_IOASID_FD, ioasid_fd);

	/* Attach device to IOASID */
	gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
	at_data = { .ioasid = gpa_ioasid};
	ioctl(device_fd, VFIO_ATTACH_IOASID, &at_data);

	/* Setup GPA mapping */
	dma_map = {
		.ioasid	= gpa_ioasid;
		.iova	= 0;		// GPA
		.vaddr	= 0x40000000;	// HVA
		.size	= 1GB;
	};
	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);

If the guest is assigned with more than dev1, user follows above sequence
to attach other devices to the same gpa_ioasid i.e. sharing the GPA 
address space cross all assigned devices.

5.2. Multiple IOASIDs (no nesting)
++++++++++++++++++++++++++++

Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
both devices are attached to gpa_ioasid. After boot the guest creates 
an GIOVA address space (giova_ioasid) for dev2, leaving dev1 in pass
through mode (gpa_ioasid).

Suppose IOASID nesting is not supported in this case. Qemu need to
generate shadow mappings in userspace for giova_ioasid (like how
VFIO works today).

To avoid duplicated locked page accounting, it's recommended to pre-
register the virtual address range that will be used for DMA:

	device_fd1 = open("/dev/vfio/devices/dev1", mode);
	device_fd2 = open("/dev/vfio/devices/dev2", mode);
	ioasid_fd = open("/dev/ioasid", mode);
	ioctl(device_fd1, VFIO_BIND_IOASID_FD, ioasid_fd);
	ioctl(device_fd2, VFIO_BIND_IOASID_FD, ioasid_fd);

	/* pre-register the virtual address range for accounting */
	mem_info = { .vaddr = 0x40000000; .size = 1GB };
	ioctl(ioasid_fd, IOASID_REGISTER_MEMORY, &mem_info);

	/* Attach dev1 and dev2 to gpa_ioasid */
	gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
	at_data = { .ioasid = gpa_ioasid};
	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);

	/* Setup GPA mapping */
	dma_map = {
		.ioasid	= gpa_ioasid;
		.iova	= 0; 		// GPA
		.vaddr	= 0x40000000;	// HVA
		.size	= 1GB;
	};
	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);

	/* After boot, guest enables an GIOVA space for dev2 */
	giova_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);

	/* First detach dev2 from previous address space */
	at_data = { .ioasid = gpa_ioasid};
	ioctl(device_fd2, VFIO_DETACH_IOASID, &at_data);

	/* Then attach dev2 to the new address space */
	at_data = { .ioasid = giova_ioasid};
	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);

	/* Setup a shadow DMA mapping according to vIOMMU
	  * GIOVA (0x2000) -> GPA (0x1000) -> HVA (0x40001000)
	  */
	dma_map = {
		.ioasid	= giova_ioasid;
		.iova	= 0x2000; 	// GIOVA
		.vaddr	= 0x40001000;	// HVA
		.size	= 4KB;
	};
	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);

5.3. IOASID nesting (software)
+++++++++++++++++++++++++

Same usage scenario as 5.2, with software-based IOASID nesting 
available. In this mode it is the kernel instead of user to create the
shadow mapping.

The flow before guest boots is same as 5.2, except one point. Because 
giova_ioasid is nested on gpa_ioasid, locked accounting is only 
conducted for gpa_ioasid. So it's not necessary to pre-register virtual 
memory.

To save space we only list the steps after boots (i.e. both dev1/dev2
have been attached to gpa_ioasid before guest boots):

	/* After boots */
	/* Make GIOVA space nested on GPA space */
	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
				gpa_ioasid);

	/* Attach dev2 to the new address space (child)
	  * Note dev2 is still attached to gpa_ioasid (parent)
	  */
	at_data = { .ioasid = giova_ioasid};
	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);

	/* Setup a GIOVA->GPA mapping for giova_ioasid, which will be 
	  * merged by the kernel with GPA->HVA mapping of gpa_ioasid
	  * to form a shadow mapping.
	  */
	dma_map = {
		.ioasid	= giova_ioasid;
		.iova	= 0x2000;	// GIOVA
		.vaddr	= 0x1000;	// GPA
		.size	= 4KB;
	};
	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);

5.4. IOASID nesting (hardware)
+++++++++++++++++++++++++

Same usage scenario as 5.2, with hardware-based IOASID nesting
available. In this mode the pgtable binding protocol is used to 
bind the guest IOVA page table with the IOMMU:

	/* After boots */
	/* Make GIOVA space nested on GPA space */
	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
				gpa_ioasid);

	/* Attach dev2 to the new address space (child)
	  * Note dev2 is still attached to gpa_ioasid (parent)
	  */
	at_data = { .ioasid = giova_ioasid};
	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);

	/* Bind guest I/O page table  */
	bind_data = {
		.ioasid	= giova_ioasid;
		.addr	= giova_pgtable;
		// and format information
	};
	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);

	/* Invalidate IOTLB when required */
	inv_data = {
		.ioasid	= giova_ioasid;
		// granular information
	};
	ioctl(ioasid_fd, IOASID_INVALIDATE_CACHE, &inv_data);

	/* See 5.6 for I/O page fault handling */
	
5.5. Guest SVA (vSVA)
++++++++++++++++++

After boots the guest further create a GVA address spaces (gpasid1) on 
dev1. Dev2 is not affected (still attached to giova_ioasid).

As explained in section 4, user should avoid expose ENQCMD on both
pdev and mdev.

The sequence applies to all device types (being pdev or mdev), except
one additional step to call KVM for ENQCMD-capable mdev:

	/* After boots */
	/* Make GVA space nested on GPA space */
	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
				gpa_ioasid);

	/* Attach dev1 to the new address space and specify vPASID */
	at_data = {
		.ioasid		= gva_ioasid;
		.flag 		= IOASID_ATTACH_USER_PASID;
		.user_pasid	= gpasid1;
	};
	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);

	/* if dev1 is ENQCMD-capable mdev, update CPU PASID 
	  * translation structure through KVM
	  */
	pa_data = {
		.ioasid_fd	= ioasid_fd;
		.ioasid		= gva_ioasid;
		.guest_pasid	= gpasid1;
	};
	ioctl(kvm_fd, KVM_MAP_PASID, &pa_data);

	/* Bind guest I/O page table  */
	bind_data = {
		.ioasid	= gva_ioasid;
		.addr	= gva_pgtable1;
		// and format information
	};
	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);

	...


5.6. I/O page fault
+++++++++++++++

(uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
to guest IOMMU driver and backwards).

-   Host IOMMU driver receives a page request with raw fault_data {rid, 
    pasid, addr};

-   Host IOMMU driver identifies the faulting I/O page table according to
    information registered by IOASID fault handler;

-   IOASID fault handler is called with raw fault_data (rid, pasid, addr), which 
    is saved in ioasid_data->fault_data (used for response);

-   IOASID fault handler generates an user fault_data (ioasid, addr), links it 
    to the shared ring buffer and triggers eventfd to userspace;

-   Upon received event, Qemu needs to find the virtual routing information 
    (v_rid + v_pasid) of the device attached to the faulting ioasid. If there are 
    multiple, pick a random one. This should be fine since the purpose is to
    fix the I/O page table on the guest;

-   Qemu generates a virtual I/O page fault through vIOMMU into guest,
    carrying the virtual fault data (v_rid, v_pasid, addr);

-   Guest IOMMU driver fixes up the fault, updates the I/O page table, and
    then sends a page response with virtual completion data (v_rid, v_pasid, 
    response_code) to vIOMMU;

-   Qemu finds the pending fault event, converts virtual completion data 
    into (ioasid, response_code), and then calls a /dev/ioasid ioctl to 
    complete the pending fault;

-   /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved in 
    ioasid_data->fault_data, and then calls iommu api to complete it with
    {rid, pasid, response_code};

5.7. BIND_PASID_TABLE
++++++++++++++++++++

PASID table is put in the GPA space on some platform, thus must be updated
by the guest. It is treated as another user page table to be bound with the 
IOMMU.

As explained earlier, the user still needs to explicitly bind every user I/O 
page table to the kernel so the same pgtable binding protocol (bind, cache 
invalidate and fault handling) is unified cross platforms.

vIOMMUs may include a caching mode (or paravirtualized way) which, once 
enabled, requires the guest to invalidate PASID cache for any change on the 
PASID table. This allows Qemu to track the lifespan of guest I/O page tables.

In case of missing such capability, Qemu could enable write-protection on
the guest PASID table to achieve the same effect.

	/* After boots */
	/* Make vPASID space nested on GPA space */
	pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
				gpa_ioasid);

	/* Attach dev1 to pasidtbl_ioasid */
	at_data = { .ioasid = pasidtbl_ioasid};
	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);

	/* Bind PASID table */
	bind_data = {
		.ioasid	= pasidtbl_ioasid;
		.addr	= gpa_pasid_table;
		// and format information
	};
	ioctl(ioasid_fd, IOASID_BIND_PASID_TABLE, &bind_data);

	/* vIOMMU detects a new GVA I/O space created */
	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
				gpa_ioasid);

	/* Attach dev1 to the new address space, with gpasid1 */
	at_data = {
		.ioasid		= gva_ioasid;
		.flag 		= IOASID_ATTACH_USER_PASID;
		.user_pasid	= gpasid1;
	};
	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);

	/* Bind guest I/O page table. Because SET_PASID_TABLE has been
	  * used, the kernel will not update the PASID table. Instead, just
	  * track the bound I/O page table for handling invalidation and
	  * I/O page faults.
	  */
	bind_data = {
		.ioasid	= gva_ioasid;
		.addr	= gva_pgtable1;
		// and format information
	};
	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);

	...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 518+ messages in thread

* [RFC] /dev/ioasid uAPI proposal
@ 2021-05-27  7:58 ` Tian, Kevin
  0 siblings, 0 replies; 518+ messages in thread
From: Tian, Kevin @ 2021-05-27  7:58 UTC (permalink / raw)
  To: LKML, Joerg Roedel, Jason Gunthorpe, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang
  Cc: Jean-Philippe Brucker, Tian, Kevin, Jiang, Dave, Raj, Ashok,
	Jonathan Corbet, Kirti Wankhede, David Gibson, Robin Murphy, Wu,
	Hao

/dev/ioasid provides an unified interface for managing I/O page tables for 
devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, 
etc.) are expected to use this interface instead of creating their own logic to 
isolate untrusted device DMAs initiated by userspace. 

This proposal describes the uAPI of /dev/ioasid and also sample sequences 
with VFIO as example in typical usages. The driver-facing kernel API provided 
by the iommu layer is still TBD, which can be discussed after consensus is 
made on this uAPI.

It's based on a lengthy discussion starting from here:
	https://lore.kernel.org/linux-iommu/20210330132830.GO2356281@nvidia.com/ 

It ends up to be a long writing due to many things to be summarized and
non-trivial effort required to connect them into a complete proposal.
Hope it provides a clean base to converge.

TOC
====
1. Terminologies and Concepts
2. uAPI Proposal
    2.1. /dev/ioasid uAPI
    2.2. /dev/vfio uAPI
    2.3. /dev/kvm uAPI
3. Sample structures and helper functions
4. PASID virtualization
5. Use Cases and Flows
    5.1. A simple example
    5.2. Multiple IOASIDs (no nesting)
    5.3. IOASID nesting (software)
    5.4. IOASID nesting (hardware)
    5.5. Guest SVA (vSVA)
    5.6. I/O page fault
    5.7. BIND_PASID_TABLE
====

1. Terminologies and Concepts
-----------------------------------------

IOASID FD is the container holding multiple I/O address spaces. User 
manages those address spaces through FD operations. Multiple FD's are 
allowed per process, but with this proposal one FD should be sufficient for 
all intended usages.

IOASID is the FD-local software handle representing an I/O address space. 
Each IOASID is associated with a single I/O page table. IOASIDs can be 
nested together, implying the output address from one I/O page table 
(represented by child IOASID) must be further translated by another I/O 
page table (represented by parent IOASID).

I/O address space can be managed through two protocols, according to 
whether the corresponding I/O page table is constructed by the kernel or 
the user. When kernel-managed, a dma mapping protocol (similar to 
existing VFIO iommu type1) is provided for the user to explicitly specify 
how the I/O address space is mapped. Otherwise, a different protocol is 
provided for the user to bind an user-managed I/O page table to the 
IOMMU, plus necessary commands for iotlb invalidation and I/O fault 
handling. 

Pgtable binding protocol can be used only on the child IOASID's, implying 
IOASID nesting must be enabled. This is because the kernel doesn't trust 
userspace. Nesting allows the kernel to enforce its DMA isolation policy 
through the parent IOASID.

IOASID nesting can be implemented in two ways: hardware nesting and 
software nesting. With hardware support the child and parent I/O page 
tables are walked consecutively by the IOMMU to form a nested translation. 
When it's implemented in software, the ioasid driver is responsible for 
merging the two-level mappings into a single-level shadow I/O page table. 
Software nesting requires both child/parent page tables operated through 
the dma mapping protocol, so any change in either level can be captured 
by the kernel to update the corresponding shadow mapping.

An I/O address space takes effect in the IOMMU only after it is attached 
to a device. The device in the /dev/ioasid context always refers to a 
physical one or 'pdev' (PF or VF). 

One I/O address space could be attached to multiple devices. In this case, 
/dev/ioasid uAPI applies to all attached devices under the specified IOASID.

Based on the underlying IOMMU capability one device might be allowed 
to attach to multiple I/O address spaces, with DMAs accessing them by 
carrying different routing information. One of them is the default I/O 
address space routed by PCI Requestor ID (RID) or ARM Stream ID. The 
remaining are routed by RID + Process Address Space ID (PASID) or 
Stream+Substream ID. For simplicity the following context uses RID and
PASID when talking about the routing information for I/O address spaces.

Device attachment is initiated through passthrough framework uAPI (use
VFIO for simplicity in following context). VFIO is responsible for identifying 
the routing information and registering it to the ioasid driver when calling 
ioasid attach helper function. It could be RID if the assigned device is 
pdev (PF/VF) or RID+PASID if the device is mediated (mdev). In addition, 
user might also provide its view of virtual routing information (vPASID) in 
the attach call, e.g. when multiple user-managed I/O address spaces are 
attached to the vfio_device. In this case VFIO must figure out whether 
vPASID should be directly used (for pdev) or converted to a kernel-
allocated one (pPASID, for mdev) for physical routing (see section 4).

Device must be bound to an IOASID FD before attach operation can be
conducted. This is also through VFIO uAPI. In this proposal one device 
should not be bound to multiple FD's. Not sure about the gain of 
allowing it except adding unnecessary complexity. But if others have 
different view we can further discuss.

VFIO must ensure its device composes DMAs with the routing information
attached to the IOASID. For pdev it naturally happens since vPASID is 
directly programmed to the device by guest software. For mdev this 
implies any guest operation carrying a vPASID on this device must be 
trapped into VFIO and then converted to pPASID before sent to the 
device. A detail explanation about PASID virtualization policies can be 
found in section 4. 

Modern devices may support a scalable workload submission interface 
based on PCI DMWr capability, allowing a single work queue to access
multiple I/O address spaces. One example is Intel ENQCMD, having 
PASID saved in the CPU MSR and carried in the instruction payload 
when sent out to the device. Then a single work queue shared by 
multiple processes can compose DMAs carrying different PASIDs. 

When executing ENQCMD in the guest, the CPU MSR includes a vPASID 
which, if targeting a mdev, must be converted to pPASID before sent
to the wire. Intel CPU provides a hardware PASID translation capability 
for auto-conversion in the fast path. The user is expected to setup the 
PASID mapping through KVM uAPI, with information about {vpasid, 
ioasid_fd, ioasid}. The ioasid driver provides helper function for KVM 
to figure out the actual pPASID given an IOASID.

With above design /dev/ioasid uAPI is all about I/O address spaces. 
It doesn't include any device routing information, which is only 
indirectly registered to the ioasid driver through VFIO uAPI. For 
example, I/O page fault is always reported to userspace per IOASID, 
although it's physically reported per device (RID+PASID). If there is a 
need of further relaying this fault into the guest, the user is responsible 
of identifying the device attached to this IOASID (randomly pick one if 
multiple attached devices) and then generates a per-device virtual I/O 
page fault into guest. Similarly the iotlb invalidation uAPI describes the 
granularity in the I/O address space (all, or a range), different from the 
underlying IOMMU semantics (domain-wide, PASID-wide, range-based).

I/O page tables routed through PASID are installed in a per-RID PASID 
table structure. Some platforms implement the PASID table in the guest 
physical space (GPA), expecting it managed by the guest. The guest
PASID table is bound to the IOMMU also by attaching to an IOASID, 
representing the per-RID vPASID space. 

We propose the host kernel needs to explicitly track  guest I/O page 
tables even on these platforms, i.e. the same pgtable binding protocol 
should be used universally on all platforms (with only difference on who 
actually writes the PASID table). One opinion from previous discussion 
was treating this special IOASID as a container for all guest I/O page 
tables i.e. hiding them from the host. However this way significantly 
violates the philosophy in this /dev/ioasid proposal. It is not one IOASID 
one address space any more. Device routing information (indirectly 
marking hidden I/O spaces) has to be carried in iotlb invalidation and 
page faulting uAPI to help connect vIOMMU with the underlying 
pIOMMU. This is one design choice to be confirmed with ARM guys.

Devices may sit behind IOMMU's with incompatible capabilities. The
difference may lie in the I/O page table format, or availability of an user
visible uAPI (e.g. hardware nesting). /dev/ioasid is responsible for 
checking the incompatibility between newly-attached device and existing
devices under the specific IOASID and, if found, returning error to user.
Upon such error the user should create a new IOASID for the incompatible
device. 

There is no explicit group enforcement in /dev/ioasid uAPI, due to no 
device notation in this interface as aforementioned. But the ioasid driver 
does implicit check to make sure that devices within an iommu group 
must be all attached to the same IOASID before this IOASID starts to
accept any uAPI command. Otherwise error information is returned to 
the user.

There was a long debate in previous discussion whether VFIO should keep 
explicit container/group semantics in its uAPI. Jason Gunthorpe proposes 
a simplified model where every device bound to VFIO is explicitly listed 
under /dev/vfio thus a device fd can be acquired w/o going through legacy
container/group interface. In this case the user is responsible for 
understanding the group topology and meeting the implicit group check 
criteria enforced in /dev/ioasid. The use case examples in this proposal 
are based on the new model.

Of course for backward compatibility VFIO still needs to keep the existing 
uAPI and vfio iommu type1 will become a shim layer connecting VFIO 
iommu ops to internal ioasid helper functions.

Notes:
-   It might be confusing as IOASID is also used in the kernel (drivers/
    iommu/ioasid.c) to represent PCI PASID or ARM substream ID. We need
    find a better name later to differentiate.

-   PPC has not be considered yet as we haven't got time to fully understand
    its semantics. According to previous discussion there is some generality 
    between PPC window-based scheme and VFIO type1 semantics. Let's 
    first make consensus on this proposal and then further discuss how to 
    extend it to cover PPC's requirement.

-   There is a protocol between vfio group and kvm. Needs to think about
    how it will be affected following this proposal.

-   mdev in this context refers to mediated subfunctions (e.g. Intel SIOV) 
    which can be physically isolated in-between through PASID-granular
    IOMMU protection. Historically people also discussed one usage by 
    mediating a pdev into a mdev. This usage is not covered here, and is 
    supposed to be replaced by Max's work which allows overriding various 
    VFIO operations in vfio-pci driver.

2. uAPI Proposal
----------------------

/dev/ioasid uAPI covers everything about managing I/O address spaces.

/dev/vfio uAPI builds connection between devices and I/O address spaces.

/dev/kvm uAPI is optional required as far as ENQCMD is concerned.


2.1. /dev/ioasid uAPI
+++++++++++++++++

/*
  * Check whether an uAPI extension is supported. 
  *
  * This is for FD-level capabilities, such as locked page pre-registration. 
  * IOASID-level capabilities are reported through IOASID_GET_INFO.
  *
  * Return: 0 if not supported, 1 if supported.
  */
#define IOASID_CHECK_EXTENSION	_IO(IOASID_TYPE, IOASID_BASE + 0)


/*
  * Register user space memory where DMA is allowed.
  *
  * It pins user pages and does the locked memory accounting so sub-
  * sequent IOASID_MAP/UNMAP_DMA calls get faster.
  *
  * When this ioctl is not used, one user page might be accounted
  * multiple times when it is mapped by multiple IOASIDs which are
  * not nested together.
  *
  * Input parameters:
  *	- vaddr;
  *	- size;
  *
  * Return: 0 on success, -errno on failure.
  */
#define IOASID_REGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 1)
#define IOASID_UNREGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 2)


/*
  * Allocate an IOASID. 
  *
  * IOASID is the FD-local software handle representing an I/O address 
  * space. Each IOASID is associated with a single I/O page table. User 
  * must call this ioctl to get an IOASID for every I/O address space that is
  * intended to be enabled in the IOMMU.
  *
  * A newly-created IOASID doesn't accept any command before it is 
  * attached to a device. Once attached, an empty I/O page table is 
  * bound with the IOMMU then the user could use either DMA mapping 
  * or pgtable binding commands to manage this I/O page table.
  *
  * Device attachment is initiated through device driver uAPI (e.g. VFIO)
  *
  * Return: allocated ioasid on success, -errno on failure.
  */
#define IOASID_ALLOC	_IO(IOASID_TYPE, IOASID_BASE + 3)
#define IOASID_FREE	_IO(IOASID_TYPE, IOASID_BASE + 4)


/*
  * Get information about an I/O address space
  *
  * Supported capabilities:
  *	- VFIO type1 map/unmap;
  *	- pgtable/pasid_table binding
  *	- hardware nesting vs. software nesting;
  *	- ...
  *
  * Related attributes:
  * 	- supported page sizes, reserved IOVA ranges (DMA mapping);
  *	- vendor pgtable formats (pgtable binding);
  *	- number of child IOASIDs (nesting);
  *	- ...
  *
  * Above information is available only after one or more devices are
  * attached to the specified IOASID. Otherwise the IOASID is just a
  * number w/o any capability or attribute.
  *
  * Input parameters:
  *	- u32 ioasid;
  *
  * Output parameters:
  *	- many. TBD.
  */
#define IOASID_GET_INFO	_IO(IOASID_TYPE, IOASID_BASE + 5)


/*
  * Map/unmap process virtual addresses to I/O virtual addresses.
  *
  * Provide VFIO type1 equivalent semantics. Start with the same 
  * restriction e.g. the unmap size should match those used in the 
  * original mapping call. 
  *
  * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
  * must be already in the preregistered list.
  *
  * Input parameters:
  *	- u32 ioasid;
  *	- refer to vfio_iommu_type1_dma_{un}map
  *
  * Return: 0 on success, -errno on failure.
  */
#define IOASID_MAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 6)
#define IOASID_UNMAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 7)


/*
  * Create a nesting IOASID (child) on an existing IOASID (parent)
  *
  * IOASIDs can be nested together, implying that the output address 
  * from one I/O page table (child) must be further translated by 
  * another I/O page table (parent).
  *
  * As the child adds essentially another reference to the I/O page table 
  * represented by the parent, any device attached to the child ioasid 
  * must be already attached to the parent.
  *
  * In concept there is no limit on the number of the nesting levels. 
  * However for the majority case one nesting level is sufficient. The
  * user should check whether an IOASID supports nesting through 
  * IOASID_GET_INFO. For example, if only one nesting level is allowed,
  * the nesting capability is reported only on the parent instead of the
  * child.
  *
  * User also needs check (via IOASID_GET_INFO) whether the nesting 
  * is implemented in hardware or software. If software-based, DMA 
  * mapping protocol should be used on the child IOASID. Otherwise, 
  * the child should be operated with pgtable binding protocol.
  *
  * Input parameters:
  *	- u32 parent_ioasid;
  *
  * Return: child_ioasid on success, -errno on failure;
  */
#define IOASID_CREATE_NESTING	_IO(IOASID_TYPE, IOASID_BASE + 8)


/*
  * Bind an user-managed I/O page table with the IOMMU
  *
  * Because user page table is untrusted, IOASID nesting must be enabled 
  * for this ioasid so the kernel can enforce its DMA isolation policy 
  * through the parent ioasid.
  *
  * Pgtable binding protocol is different from DMA mapping. The latter 
  * has the I/O page table constructed by the kernel and updated 
  * according to user MAP/UNMAP commands. With pgtable binding the 
  * whole page table is created and updated by userspace, thus different 
  * set of commands are required (bind, iotlb invalidation, page fault, etc.).
  *
  * Because the page table is directly walked by the IOMMU, the user 
  * must  use a format compatible to the underlying hardware. It can 
  * check the format information through IOASID_GET_INFO.
  *
  * The page table is bound to the IOMMU according to the routing 
  * information of each attached device under the specified IOASID. The
  * routing information (RID and optional PASID) is registered when a 
  * device is attached to this IOASID through VFIO uAPI. 
  *
  * Input parameters:
  *	- child_ioasid;
  *	- address of the user page table;
  *	- formats (vendor, address_width, etc.);
  * 
  * Return: 0 on success, -errno on failure.
  */
#define IOASID_BIND_PGTABLE		_IO(IOASID_TYPE, IOASID_BASE + 9)
#define IOASID_UNBIND_PGTABLE	_IO(IOASID_TYPE, IOASID_BASE + 10)


/*
  * Bind an user-managed PASID table to the IOMMU
  *
  * This is required for platforms which place PASID table in the GPA space.
  * In this case the specified IOASID represents the per-RID PASID space.
  *
  * Alternatively this may be replaced by IOASID_BIND_PGTABLE plus a
  * special flag to indicate the difference from normal I/O address spaces.
  *
  * The format info of the PASID table is reported in IOASID_GET_INFO.
  *
  * As explained in the design section, user-managed I/O page tables must
  * be explicitly bound to the kernel even on these platforms. It allows
  * the kernel to uniformly manage I/O address spaces cross all platforms.
  * Otherwise, the iotlb invalidation and page faulting uAPI must be hacked
  * to carry device routing information to indirectly mark the hidden I/O
  * address spaces.
  *
  * Input parameters:
  *	- child_ioasid;
  *	- address of PASID table;
  *	- formats (vendor, size, etc.);
  *
  * Return: 0 on success, -errno on failure.
  */
#define IOASID_BIND_PASID_TABLE	_IO(IOASID_TYPE, IOASID_BASE + 11)
#define IOASID_UNBIND_PASID_TABLE	_IO(IOASID_TYPE, IOASID_BASE + 12)


/*
  * Invalidate IOTLB for an user-managed I/O page table
  *
  * Unlike what's defined in include/uapi/linux/iommu.h, this command 
  * doesn't allow the user to specify cache type and likely support only
  * two granularities (all, or a specified range) in the I/O address space.
  *
  * Physical IOMMU have three cache types (iotlb, dev_iotlb and pasid
  * cache). If the IOASID represents an I/O address space, the invalidation
  * always applies to the iotlb (and dev_iotlb if enabled). If the IOASID
  * represents a vPASID space, then this command applies to the PASID
  * cache.
  *
  * Similarly this command doesn't provide IOMMU-like granularity
  * info (domain-wide, pasid-wide, range-based), since it's all about the
  * I/O address space itself. The ioasid driver walks the attached
  * routing information to match the IOMMU semantics under the
  * hood. 
  *
  * Input parameters:
  *	- child_ioasid;
  *	- granularity
  * 
  * Return: 0 on success, -errno on failure
  */
#define IOASID_INVALIDATE_CACHE	_IO(IOASID_TYPE, IOASID_BASE + 13)


/*
  * Page fault report and response
  *
  * This is TBD. Can be added after other parts are cleared up. Likely it 
  * will be a ring buffer shared between user/kernel, an eventfd to notify 
  * the user and an ioctl to complete the fault.
  *
  * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
  */


/*
  * Dirty page tracking 
  *
  * Track and report memory pages dirtied in I/O address spaces. There 
  * is an ongoing work by Kunkun Jiang by extending existing VFIO type1. 
  * It needs be adapted to /dev/ioasid later.
  */


2.2. /dev/vfio uAPI
++++++++++++++++

/*
  * Bind a vfio_device to the specified IOASID fd
  *
  * Multiple vfio devices can be bound to a single ioasid_fd, but a single 
  * vfio device should not be bound to multiple ioasid_fd's. 
  *
  * Input parameters:
  *	- ioasid_fd;
  *
  * Return: 0 on success, -errno on failure.
  */
#define VFIO_BIND_IOASID_FD		_IO(VFIO_TYPE, VFIO_BASE + 22)
#define VFIO_UNBIND_IOASID_FD	_IO(VFIO_TYPE, VFIO_BASE + 23)


/*
  * Attach a vfio device to the specified IOASID
  *
  * Multiple vfio devices can be attached to the same IOASID, and vice 
  * versa. 
  *
  * User may optionally provide a "virtual PASID" to mark an I/O page 
  * table on this vfio device. Whether the virtual PASID is physically used 
  * or converted to another kernel-allocated PASID is a policy in vfio device 
  * driver.
  *
  * There is no need to specify ioasid_fd in this call due to the assumption 
  * of 1:1 connection between vfio device and the bound fd.
  *
  * Input parameter:
  *	- ioasid;
  *	- flag;
  *	- user_pasid (if specified);
  * 
  * Return: 0 on success, -errno on failure.
  */
#define VFIO_ATTACH_IOASID		_IO(VFIO_TYPE, VFIO_BASE + 24)
#define VFIO_DETACH_IOASID		_IO(VFIO_TYPE, VFIO_BASE + 25)


2.3. KVM uAPI
++++++++++++

/*
  * Update CPU PASID mapping
  *
  * This is necessary when ENQCMD will be used in the guest while the
  * targeted device doesn't accept the vPASID saved in the CPU MSR.
  *
  * This command allows user to set/clear the vPASID->pPASID mapping
  * in the CPU, by providing the IOASID (and FD) information representing
  * the I/O address space marked by this vPASID.
  *
  * Input parameters:
  *	- user_pasid;
  *	- ioasid_fd;
  *	- ioasid;
  */
#define KVM_MAP_PASID	_IO(KVMIO, 0xf0)
#define KVM_UNMAP_PASID	_IO(KVMIO, 0xf1)


3. Sample structures and helper functions
--------------------------------------------------------

Three helper functions are provided to support VFIO_BIND_IOASID_FD:

	struct ioasid_ctx *ioasid_ctx_fdget(int fd);
	int ioasid_register_device(struct ioasid_ctx *ctx, struct ioasid_dev *dev);
	int ioasid_unregister_device(struct ioasid_dev *dev);

An ioasid_ctx is created for each fd:

	struct ioasid_ctx {
		// a list of allocated IOASID data's
		struct list_head		ioasid_list;
		// a list of registered devices
		struct list_head		dev_list;
		// a list of pre-registered virtual address ranges
		struct list_head		prereg_list;
	};

Each registered device is represented by ioasid_dev:

	struct ioasid_dev {
		struct list_head		next;
		struct ioasid_ctx	*ctx;
		// always be the physical device
		struct device 		*device;
		struct kref		kref;
	};

Because we assume one vfio_device connected to at most one ioasid_fd, 
here ioasid_dev could be embedded in vfio_device and then linked to 
ioasid_ctx->dev_list when registration succeeds. For mdev the struct
device should be the pointer to the parent device. PASID marking this
mdev is specified later when VFIO_ATTACH_IOASID.

An ioasid_data is created when IOASID_ALLOC, as the main object 
describing characteristics about an I/O page table:

	struct ioasid_data {
		// link to ioasid_ctx->ioasid_list
		struct list_head		next;

		// the IOASID number
		u32			ioasid;

		// the handle to convey iommu operations
		// hold the pgd (TBD until discussing iommu api)
		struct iommu_domain *domain;

		// map metadata (vfio type1 semantics)
		struct rb_node		dma_list;

		// pointer to user-managed pgtable (for nesting case)
		u64			user_pgd;

		// link to the parent ioasid (for nesting)
		struct ioasid_data	*parent;

		// cache the global PASID shared by ENQCMD-capable
		// devices (see below explanation in section 4)
		u32			pasid;

		// a list of device attach data (routing information)
		struct list_head		attach_data;

		// a list of partially-attached devices (group)
		struct list_head		partial_devices;

		// a list of fault_data reported from the iommu layer
		struct list_head		fault_data;

		...
	}

ioasid_data and iommu_domain have overlapping roles as both are 
introduced to represent an I/O address space. It is still a big TBD how 
the two should be corelated or even merged, and whether new iommu 
ops are required to handle RID+PASID explicitly. We leave this as open 
for now as this proposal is mainly about uAPI. For simplification 
purpose the two objects are kept separate in this context, assuming an 
1:1 connection in-between and the domain as the place-holder 
representing the 1st class object in the iommu ops. 

Two helper functions are provided to support VFIO_ATTACH_IOASID:

	struct attach_info {
		u32	ioasid;
		// If valid, the PASID to be used physically
		u32	pasid;
	};
	int ioasid_device_attach(struct ioasid_dev *dev, 
		struct attach_info info);
	int ioasid_device_detach(struct ioasid_dev *dev, u32 ioasid);

The pasid parameter is optionally provided based on the policy in vfio
device driver. It could be the PASID marking the default I/O address 
space for a mdev, or the user-provided PASID marking an user I/O page
table, or another kernel-allocated PASID backing the user-provided one.
Please check next section for detail explanation.

A new object is introduced and linked to ioasid_data->attach_data for 
each successful attach operation:

	struct ioasid_attach_data {
		struct list_head		next;
		struct ioasid_dev	*dev;
		u32 			pasid;
	}

As explained in the design section, there is no explicit group enforcement
in /dev/ioasid uAPI or helper functions. But the ioasid driver does
implicit group check - before every device within an iommu group is 
attached to this IOASID, the previously-attached devices in this group are
put in ioasid_data->partial_devices. The IOASID rejects any command if
the partial_devices list is not empty.

Then is the last helper function:
	u32 ioasid_get_global_pasid(struct ioasid_ctx *ctx, 
		u32 ioasid, bool alloc);

ioasid_get_global_pasid is necessary in scenarios where multiple devices 
want to share a same PASID value on the attached I/O page table (e.g. 
when ENQCMD is enabled, as explained in next section). We need a 
centralized place (ioasid_data->pasid) to hold this value (allocated when
first called with alloc=true). vfio device driver calls this function (alloc=
true) to get the global PASID for an ioasid before calling ioasid_device_
attach. KVM also calls this function (alloc=false) to setup PASID translation 
structure when user calls KVM_MAP_PASID.

4. PASID Virtualization
------------------------------

When guest SVA (vSVA) is enabled, multiple GVA address spaces are 
created on the assigned vfio device. This leads to the concepts of 
"virtual PASID" (vPASID) vs. "physical PASID" (pPASID). vPASID is assigned 
by the guest to mark an GVA address space while pPASID is the one 
selected by the host and actually routed in the wire.

vPASID is conveyed to the kernel when user calls VFIO_ATTACH_IOASID.

vfio device driver translates vPASID to pPASID before calling ioasid_attach_
device, with two factors to be considered:

-    Whether vPASID is directly used (vPASID==pPASID) in the wire, or 
     should be instead converted to a newly-allocated one (vPASID!=
     pPASID);

-    If vPASID!=pPASID, whether pPASID is allocated from per-RID PASID
     space or a global PASID space (implying sharing pPASID cross devices,
     e.g. when supporting Intel ENQCMD which puts PASID in a CPU MSR
     as part of the process context);

The actual policy depends on pdev vs. mdev, and whether ENQCMD is
supported. There are three possible scenarios:

(Note: /dev/ioasid uAPI is not affected by underlying PASID virtualization 
policies.)

1)  pdev (w/ or w/o ENQCMD): vPASID==pPASID

     vPASIDs are directly programmed by the guest to the assigned MMIO 
     bar, implying all DMAs out of this device having vPASID in the packet 
     header. This mandates vPASID==pPASID, sort of delegating the entire 
     per-RID PASID space to the guest.

     When ENQCMD is enabled, the CPU MSR when running a guest task
     contains a vPASID. In this case the CPU PASID translation capability 
     should be disabled so this vPASID in CPU MSR is directly sent to the
     wire.

     This ensures consistent vPASID usage on pdev regardless of the 
     workload submitted through a MMIO register or ENQCMD instruction.

2)  mdev: vPASID!=pPASID (per-RID if w/o ENQCMD, otherwise global)

     PASIDs are also used by kernel to mark the default I/O address space 
     for mdev, thus cannot be delegated to the guest. Instead, the mdev 
     driver must allocate a new pPASID for each vPASID (thus vPASID!=
     pPASID) and then use pPASID when attaching this mdev to an ioasid.

     The mdev driver needs cache the PASID mapping so in mediation 
     path vPASID programmed by the guest can be converted to pPASID 
     before updating the physical MMIO register. The mapping should
     also be saved in the CPU PASID translation structure (via KVM uAPI), 
     so the vPASID saved in the CPU MSR is auto-translated to pPASID 
     before sent to the wire, when ENQCMD is enabled. 

     Generally pPASID could be allocated from the per-RID PASID space
     if all mdev's created on the parent device don't support ENQCMD.

     However if the parent supports ENQCMD-capable mdev, pPASIDs
     must be allocated from a global pool because the CPU PASID 
     translation structure is per-VM. It implies that when an guest I/O 
     page table is attached to two mdevs with a single vPASID (i.e. bind 
     to the same guest process), a same pPASID should be used for 
     both mdevs even when they belong to different parents. Sharing
     pPASID cross mdevs is achieved by calling aforementioned ioasid_
     get_global_pasid().

3)  Mix pdev/mdev together

     Above policies are per device type thus are not affected when mixing 
     those device types together (when assigned to a single guest). However, 
     there is one exception - when both pdev/mdev support ENQCMD.

     Remember the two types have conflicting requirements on whether 
     CPU PASID translation should be enabled. This capability is per-VM, 
     and must be enabled for mdev isolation. When enabled, pdev will 
     receive a mdev pPASID violating its vPASID expectation.

     In previous thread a PASID range split scheme was discussed to support
     this combination, but we haven't worked out a clean uAPI design yet.
     Therefore in this proposal we decide to not support it, implying the 
     user should have some intelligence to avoid such scenario. It could be
     a TODO task for future.

In spite of those subtle considerations, the kernel implementation could
start simple, e.g.:

-    v==p for pdev;
-    v!=p and always use a global PASID pool for all mdev's;

Regardless of the kernel policy, the user policy is unchanged:

-    provide vPASID when calling VFIO_ATTACH_IOASID;
-    call KVM uAPI to setup CPU PASID translation if ENQCMD-capable mdev;
-    Don't expose ENQCMD capability on both pdev and mdev;

Sample user flow is described in section 5.5.

5. Use Cases and Flows
-------------------------------

Here assume VFIO will support a new model where every bound device
is explicitly listed under /dev/vfio thus a device fd can be acquired w/o 
going through legacy container/group interface. For illustration purpose
those devices are just called dev[1...N]:

	device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);

As explained earlier, one IOASID fd is sufficient for all intended use cases:

	ioasid_fd = open("/dev/ioasid", mode);

For simplicity below examples are all made for the virtualization story.
They are representative and could be easily adapted to a non-virtualization
scenario.

Three types of IOASIDs are considered:

	gpa_ioasid[1...N]: 	for GPA address space
	giova_ioasid[1...N]:	for guest IOVA address space
	gva_ioasid[1...N]:	for guest CPU VA address space

At least one gpa_ioasid must always be created per guest, while the other 
two are relevant as far as vIOMMU is concerned.

Examples here apply to both pdev and mdev, if not explicitly marked out
(e.g. in section 5.5). VFIO device driver in the kernel will figure out the 
associated routing information in the attaching operation.

For illustration simplicity, IOASID_CHECK_EXTENSION and IOASID_GET_
INFO are skipped in these examples.

5.1. A simple example
++++++++++++++++++

Dev1 is assigned to the guest. One gpa_ioasid is created. The GPA address
space is managed through DMA mapping protocol:

	/* Bind device to IOASID fd */
	device_fd = open("/dev/vfio/devices/dev1", mode);
	ioasid_fd = open("/dev/ioasid", mode);
	ioctl(device_fd, VFIO_BIND_IOASID_FD, ioasid_fd);

	/* Attach device to IOASID */
	gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
	at_data = { .ioasid = gpa_ioasid};
	ioctl(device_fd, VFIO_ATTACH_IOASID, &at_data);

	/* Setup GPA mapping */
	dma_map = {
		.ioasid	= gpa_ioasid;
		.iova	= 0;		// GPA
		.vaddr	= 0x40000000;	// HVA
		.size	= 1GB;
	};
	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);

If the guest is assigned with more than dev1, user follows above sequence
to attach other devices to the same gpa_ioasid i.e. sharing the GPA 
address space cross all assigned devices.

5.2. Multiple IOASIDs (no nesting)
++++++++++++++++++++++++++++

Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
both devices are attached to gpa_ioasid. After boot the guest creates 
an GIOVA address space (giova_ioasid) for dev2, leaving dev1 in pass
through mode (gpa_ioasid).

Suppose IOASID nesting is not supported in this case. Qemu need to
generate shadow mappings in userspace for giova_ioasid (like how
VFIO works today).

To avoid duplicated locked page accounting, it's recommended to pre-
register the virtual address range that will be used for DMA:

	device_fd1 = open("/dev/vfio/devices/dev1", mode);
	device_fd2 = open("/dev/vfio/devices/dev2", mode);
	ioasid_fd = open("/dev/ioasid", mode);
	ioctl(device_fd1, VFIO_BIND_IOASID_FD, ioasid_fd);
	ioctl(device_fd2, VFIO_BIND_IOASID_FD, ioasid_fd);

	/* pre-register the virtual address range for accounting */
	mem_info = { .vaddr = 0x40000000; .size = 1GB };
	ioctl(ioasid_fd, IOASID_REGISTER_MEMORY, &mem_info);

	/* Attach dev1 and dev2 to gpa_ioasid */
	gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
	at_data = { .ioasid = gpa_ioasid};
	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);

	/* Setup GPA mapping */
	dma_map = {
		.ioasid	= gpa_ioasid;
		.iova	= 0; 		// GPA
		.vaddr	= 0x40000000;	// HVA
		.size	= 1GB;
	};
	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);

	/* After boot, guest enables an GIOVA space for dev2 */
	giova_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);

	/* First detach dev2 from previous address space */
	at_data = { .ioasid = gpa_ioasid};
	ioctl(device_fd2, VFIO_DETACH_IOASID, &at_data);

	/* Then attach dev2 to the new address space */
	at_data = { .ioasid = giova_ioasid};
	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);

	/* Setup a shadow DMA mapping according to vIOMMU
	  * GIOVA (0x2000) -> GPA (0x1000) -> HVA (0x40001000)
	  */
	dma_map = {
		.ioasid	= giova_ioasid;
		.iova	= 0x2000; 	// GIOVA
		.vaddr	= 0x40001000;	// HVA
		.size	= 4KB;
	};
	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);

5.3. IOASID nesting (software)
+++++++++++++++++++++++++

Same usage scenario as 5.2, with software-based IOASID nesting 
available. In this mode it is the kernel instead of user to create the
shadow mapping.

The flow before guest boots is same as 5.2, except one point. Because 
giova_ioasid is nested on gpa_ioasid, locked accounting is only 
conducted for gpa_ioasid. So it's not necessary to pre-register virtual 
memory.

To save space we only list the steps after boots (i.e. both dev1/dev2
have been attached to gpa_ioasid before guest boots):

	/* After boots */
	/* Make GIOVA space nested on GPA space */
	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
				gpa_ioasid);

	/* Attach dev2 to the new address space (child)
	  * Note dev2 is still attached to gpa_ioasid (parent)
	  */
	at_data = { .ioasid = giova_ioasid};
	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);

	/* Setup a GIOVA->GPA mapping for giova_ioasid, which will be 
	  * merged by the kernel with GPA->HVA mapping of gpa_ioasid
	  * to form a shadow mapping.
	  */
	dma_map = {
		.ioasid	= giova_ioasid;
		.iova	= 0x2000;	// GIOVA
		.vaddr	= 0x1000;	// GPA
		.size	= 4KB;
	};
	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);

5.4. IOASID nesting (hardware)
+++++++++++++++++++++++++

Same usage scenario as 5.2, with hardware-based IOASID nesting
available. In this mode the pgtable binding protocol is used to 
bind the guest IOVA page table with the IOMMU:

	/* After boots */
	/* Make GIOVA space nested on GPA space */
	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
				gpa_ioasid);

	/* Attach dev2 to the new address space (child)
	  * Note dev2 is still attached to gpa_ioasid (parent)
	  */
	at_data = { .ioasid = giova_ioasid};
	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);

	/* Bind guest I/O page table  */
	bind_data = {
		.ioasid	= giova_ioasid;
		.addr	= giova_pgtable;
		// and format information
	};
	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);

	/* Invalidate IOTLB when required */
	inv_data = {
		.ioasid	= giova_ioasid;
		// granular information
	};
	ioctl(ioasid_fd, IOASID_INVALIDATE_CACHE, &inv_data);

	/* See 5.6 for I/O page fault handling */
	
5.5. Guest SVA (vSVA)
++++++++++++++++++

After boots the guest further create a GVA address spaces (gpasid1) on 
dev1. Dev2 is not affected (still attached to giova_ioasid).

As explained in section 4, user should avoid expose ENQCMD on both
pdev and mdev.

The sequence applies to all device types (being pdev or mdev), except
one additional step to call KVM for ENQCMD-capable mdev:

	/* After boots */
	/* Make GVA space nested on GPA space */
	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
				gpa_ioasid);

	/* Attach dev1 to the new address space and specify vPASID */
	at_data = {
		.ioasid		= gva_ioasid;
		.flag 		= IOASID_ATTACH_USER_PASID;
		.user_pasid	= gpasid1;
	};
	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);

	/* if dev1 is ENQCMD-capable mdev, update CPU PASID 
	  * translation structure through KVM
	  */
	pa_data = {
		.ioasid_fd	= ioasid_fd;
		.ioasid		= gva_ioasid;
		.guest_pasid	= gpasid1;
	};
	ioctl(kvm_fd, KVM_MAP_PASID, &pa_data);

	/* Bind guest I/O page table  */
	bind_data = {
		.ioasid	= gva_ioasid;
		.addr	= gva_pgtable1;
		// and format information
	};
	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);

	...


5.6. I/O page fault
+++++++++++++++

(uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
to guest IOMMU driver and backwards).

-   Host IOMMU driver receives a page request with raw fault_data {rid, 
    pasid, addr};

-   Host IOMMU driver identifies the faulting I/O page table according to
    information registered by IOASID fault handler;

-   IOASID fault handler is called with raw fault_data (rid, pasid, addr), which 
    is saved in ioasid_data->fault_data (used for response);

-   IOASID fault handler generates an user fault_data (ioasid, addr), links it 
    to the shared ring buffer and triggers eventfd to userspace;

-   Upon received event, Qemu needs to find the virtual routing information 
    (v_rid + v_pasid) of the device attached to the faulting ioasid. If there are 
    multiple, pick a random one. This should be fine since the purpose is to
    fix the I/O page table on the guest;

-   Qemu generates a virtual I/O page fault through vIOMMU into guest,
    carrying the virtual fault data (v_rid, v_pasid, addr);

-   Guest IOMMU driver fixes up the fault, updates the I/O page table, and
    then sends a page response with virtual completion data (v_rid, v_pasid, 
    response_code) to vIOMMU;

-   Qemu finds the pending fault event, converts virtual completion data 
    into (ioasid, response_code), and then calls a /dev/ioasid ioctl to 
    complete the pending fault;

-   /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved in 
    ioasid_data->fault_data, and then calls iommu api to complete it with
    {rid, pasid, response_code};

5.7. BIND_PASID_TABLE
++++++++++++++++++++

PASID table is put in the GPA space on some platform, thus must be updated
by the guest. It is treated as another user page table to be bound with the 
IOMMU.

As explained earlier, the user still needs to explicitly bind every user I/O 
page table to the kernel so the same pgtable binding protocol (bind, cache 
invalidate and fault handling) is unified cross platforms.

vIOMMUs may include a caching mode (or paravirtualized way) which, once 
enabled, requires the guest to invalidate PASID cache for any change on the 
PASID table. This allows Qemu to track the lifespan of guest I/O page tables.

In case of missing such capability, Qemu could enable write-protection on
the guest PASID table to achieve the same effect.

	/* After boots */
	/* Make vPASID space nested on GPA space */
	pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
				gpa_ioasid);

	/* Attach dev1 to pasidtbl_ioasid */
	at_data = { .ioasid = pasidtbl_ioasid};
	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);

	/* Bind PASID table */
	bind_data = {
		.ioasid	= pasidtbl_ioasid;
		.addr	= gpa_pasid_table;
		// and format information
	};
	ioctl(ioasid_fd, IOASID_BIND_PASID_TABLE, &bind_data);

	/* vIOMMU detects a new GVA I/O space created */
	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
				gpa_ioasid);

	/* Attach dev1 to the new address space, with gpasid1 */
	at_data = {
		.ioasid		= gva_ioasid;
		.flag 		= IOASID_ATTACH_USER_PASID;
		.user_pasid	= gpasid1;
	};
	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);

	/* Bind guest I/O page table. Because SET_PASID_TABLE has been
	  * used, the kernel will not update the PASID table. Instead, just
	  * track the bound I/O page table for handling invalidation and
	  * I/O page faults.
	  */
	bind_data = {
		.ioasid	= gva_ioasid;
		.addr	= gva_pgtable1;
		// and format information
	};
	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);

	...

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-27  7:58 ` Tian, Kevin
@ 2021-05-28  2:24   ` Jason Wang
  -1 siblings, 0 replies; 518+ messages in thread
From: Jason Wang @ 2021-05-28  2:24 UTC (permalink / raw)
  To: Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe, Lu Baolu,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com)
  Cc: Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu, Hao,
	Jiang, Dave, Jacob Pan, Jean-Philippe Brucker, David Gibson,
	Kirti Wankhede, Robin Murphy


在 2021/5/27 下午3:58, Tian, Kevin 写道:
> /dev/ioasid provides an unified interface for managing I/O page tables for
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
> etc.) are expected to use this interface instead of creating their own logic to
> isolate untrusted device DMAs initiated by userspace.


Not a native speaker but /dev/ioas seems better?


>
> This proposal describes the uAPI of /dev/ioasid and also sample sequences
> with VFIO as example in typical usages. The driver-facing kernel API provided
> by the iommu layer is still TBD, which can be discussed after consensus is
> made on this uAPI.
>
> It's based on a lengthy discussion starting from here:
> 	https://lore.kernel.org/linux-iommu/20210330132830.GO2356281@nvidia.com/
>
> It ends up to be a long writing due to many things to be summarized and
> non-trivial effort required to connect them into a complete proposal.
> Hope it provides a clean base to converge.
>
> TOC
> ====
> 1. Terminologies and Concepts
> 2. uAPI Proposal
>      2.1. /dev/ioasid uAPI
>      2.2. /dev/vfio uAPI
>      2.3. /dev/kvm uAPI
> 3. Sample structures and helper functions
> 4. PASID virtualization
> 5. Use Cases and Flows
>      5.1. A simple example
>      5.2. Multiple IOASIDs (no nesting)
>      5.3. IOASID nesting (software)
>      5.4. IOASID nesting (hardware)
>      5.5. Guest SVA (vSVA)
>      5.6. I/O page fault
>      5.7. BIND_PASID_TABLE
> ====
>
> 1. Terminologies and Concepts
> -----------------------------------------
>
> IOASID FD is the container holding multiple I/O address spaces. User
> manages those address spaces through FD operations. Multiple FD's are
> allowed per process, but with this proposal one FD should be sufficient for
> all intended usages.
>
> IOASID is the FD-local software handle representing an I/O address space.
> Each IOASID is associated with a single I/O page table. IOASIDs can be
> nested together, implying the output address from one I/O page table
> (represented by child IOASID) must be further translated by another I/O
> page table (represented by parent IOASID).
>
> I/O address space can be managed through two protocols, according to
> whether the corresponding I/O page table is constructed by the kernel or
> the user. When kernel-managed, a dma mapping protocol (similar to
> existing VFIO iommu type1) is provided for the user to explicitly specify
> how the I/O address space is mapped. Otherwise, a different protocol is
> provided for the user to bind an user-managed I/O page table to the
> IOMMU, plus necessary commands for iotlb invalidation and I/O fault
> handling.
>
> Pgtable binding protocol can be used only on the child IOASID's, implying
> IOASID nesting must be enabled. This is because the kernel doesn't trust
> userspace. Nesting allows the kernel to enforce its DMA isolation policy
> through the parent IOASID.
>
> IOASID nesting can be implemented in two ways: hardware nesting and
> software nesting. With hardware support the child and parent I/O page
> tables are walked consecutively by the IOMMU to form a nested translation.
> When it's implemented in software, the ioasid driver


Need to explain what did "ioasid driver" mean.

I guess it's the module that implements the IOASID abstraction:

1) RID
2) RID+PASID
3) others

And if yes, does it allow the device for software specific implementation:

1) swiotlb or
2) device specific IOASID implementation


> is responsible for
> merging the two-level mappings into a single-level shadow I/O page table.
> Software nesting requires both child/parent page tables operated through
> the dma mapping protocol, so any change in either level can be captured
> by the kernel to update the corresponding shadow mapping.
>
> An I/O address space takes effect in the IOMMU only after it is attached
> to a device. The device in the /dev/ioasid context always refers to a
> physical one or 'pdev' (PF or VF).
>
> One I/O address space could be attached to multiple devices. In this case,
> /dev/ioasid uAPI applies to all attached devices under the specified IOASID.
>
> Based on the underlying IOMMU capability one device might be allowed
> to attach to multiple I/O address spaces, with DMAs accessing them by
> carrying different routing information. One of them is the default I/O
> address space routed by PCI Requestor ID (RID) or ARM Stream ID. The
> remaining are routed by RID + Process Address Space ID (PASID) or
> Stream+Substream ID. For simplicity the following context uses RID and
> PASID when talking about the routing information for I/O address spaces.
>
> Device attachment is initiated through passthrough framework uAPI (use
> VFIO for simplicity in following context). VFIO is responsible for identifying
> the routing information and registering it to the ioasid driver when calling
> ioasid attach helper function. It could be RID if the assigned device is
> pdev (PF/VF) or RID+PASID if the device is mediated (mdev). In addition,
> user might also provide its view of virtual routing information (vPASID) in
> the attach call, e.g. when multiple user-managed I/O address spaces are
> attached to the vfio_device. In this case VFIO must figure out whether
> vPASID should be directly used (for pdev) or converted to a kernel-
> allocated one (pPASID, for mdev) for physical routing (see section 4).
>
> Device must be bound to an IOASID FD before attach operation can be
> conducted. This is also through VFIO uAPI. In this proposal one device
> should not be bound to multiple FD's. Not sure about the gain of
> allowing it except adding unnecessary complexity. But if others have
> different view we can further discuss.
>
> VFIO must ensure its device composes DMAs with the routing information
> attached to the IOASID. For pdev it naturally happens since vPASID is
> directly programmed to the device by guest software. For mdev this
> implies any guest operation carrying a vPASID on this device must be
> trapped into VFIO and then converted to pPASID before sent to the
> device. A detail explanation about PASID virtualization policies can be
> found in section 4.
>
> Modern devices may support a scalable workload submission interface
> based on PCI DMWr capability, allowing a single work queue to access
> multiple I/O address spaces. One example is Intel ENQCMD, having
> PASID saved in the CPU MSR and carried in the instruction payload
> when sent out to the device. Then a single work queue shared by
> multiple processes can compose DMAs carrying different PASIDs.
>
> When executing ENQCMD in the guest, the CPU MSR includes a vPASID
> which, if targeting a mdev, must be converted to pPASID before sent
> to the wire. Intel CPU provides a hardware PASID translation capability
> for auto-conversion in the fast path. The user is expected to setup the
> PASID mapping through KVM uAPI, with information about {vpasid,
> ioasid_fd, ioasid}. The ioasid driver provides helper function for KVM
> to figure out the actual pPASID given an IOASID.
>
> With above design /dev/ioasid uAPI is all about I/O address spaces.
> It doesn't include any device routing information, which is only
> indirectly registered to the ioasid driver through VFIO uAPI. For
> example, I/O page fault is always reported to userspace per IOASID,
> although it's physically reported per device (RID+PASID). If there is a
> need of further relaying this fault into the guest, the user is responsible
> of identifying the device attached to this IOASID (randomly pick one if
> multiple attached devices) and then generates a per-device virtual I/O
> page fault into guest. Similarly the iotlb invalidation uAPI describes the
> granularity in the I/O address space (all, or a range), different from the
> underlying IOMMU semantics (domain-wide, PASID-wide, range-based).
>
> I/O page tables routed through PASID are installed in a per-RID PASID
> table structure.


I'm not sure this is true for all archs.


>   Some platforms implement the PASID table in the guest
> physical space (GPA), expecting it managed by the guest. The guest
> PASID table is bound to the IOMMU also by attaching to an IOASID,
> representing the per-RID vPASID space.
>
> We propose the host kernel needs to explicitly track  guest I/O page
> tables even on these platforms, i.e. the same pgtable binding protocol
> should be used universally on all platforms (with only difference on who
> actually writes the PASID table). One opinion from previous discussion
> was treating this special IOASID as a container for all guest I/O page
> tables i.e. hiding them from the host. However this way significantly
> violates the philosophy in this /dev/ioasid proposal. It is not one IOASID
> one address space any more. Device routing information (indirectly
> marking hidden I/O spaces) has to be carried in iotlb invalidation and
> page faulting uAPI to help connect vIOMMU with the underlying
> pIOMMU. This is one design choice to be confirmed with ARM guys.
>
> Devices may sit behind IOMMU's with incompatible capabilities. The
> difference may lie in the I/O page table format, or availability of an user
> visible uAPI (e.g. hardware nesting). /dev/ioasid is responsible for
> checking the incompatibility between newly-attached device and existing
> devices under the specific IOASID and, if found, returning error to user.
> Upon such error the user should create a new IOASID for the incompatible
> device.
>
> There is no explicit group enforcement in /dev/ioasid uAPI, due to no
> device notation in this interface as aforementioned. But the ioasid driver
> does implicit check to make sure that devices within an iommu group
> must be all attached to the same IOASID before this IOASID starts to
> accept any uAPI command. Otherwise error information is returned to
> the user.
>
> There was a long debate in previous discussion whether VFIO should keep
> explicit container/group semantics in its uAPI. Jason Gunthorpe proposes
> a simplified model where every device bound to VFIO is explicitly listed
> under /dev/vfio thus a device fd can be acquired w/o going through legacy
> container/group interface. In this case the user is responsible for
> understanding the group topology and meeting the implicit group check
> criteria enforced in /dev/ioasid. The use case examples in this proposal
> are based on the new model.
>
> Of course for backward compatibility VFIO still needs to keep the existing
> uAPI and vfio iommu type1 will become a shim layer connecting VFIO
> iommu ops to internal ioasid helper functions.
>
> Notes:
> -   It might be confusing as IOASID is also used in the kernel (drivers/
>      iommu/ioasid.c) to represent PCI PASID or ARM substream ID. We need
>      find a better name later to differentiate.
>
> -   PPC has not be considered yet as we haven't got time to fully understand
>      its semantics. According to previous discussion there is some generality
>      between PPC window-based scheme and VFIO type1 semantics. Let's
>      first make consensus on this proposal and then further discuss how to
>      extend it to cover PPC's requirement.
>
> -   There is a protocol between vfio group and kvm. Needs to think about
>      how it will be affected following this proposal.
>
> -   mdev in this context refers to mediated subfunctions (e.g. Intel SIOV)
>      which can be physically isolated in-between through PASID-granular
>      IOMMU protection. Historically people also discussed one usage by
>      mediating a pdev into a mdev. This usage is not covered here, and is
>      supposed to be replaced by Max's work which allows overriding various
>      VFIO operations in vfio-pci driver.
>
> 2. uAPI Proposal
> ----------------------
>
> /dev/ioasid uAPI covers everything about managing I/O address spaces.
>
> /dev/vfio uAPI builds connection between devices and I/O address spaces.
>
> /dev/kvm uAPI is optional required as far as ENQCMD is concerned.
>
>
> 2.1. /dev/ioasid uAPI
> +++++++++++++++++
>
> /*
>    * Check whether an uAPI extension is supported.
>    *
>    * This is for FD-level capabilities, such as locked page pre-registration.
>    * IOASID-level capabilities are reported through IOASID_GET_INFO.
>    *
>    * Return: 0 if not supported, 1 if supported.
>    */
> #define IOASID_CHECK_EXTENSION	_IO(IOASID_TYPE, IOASID_BASE + 0)
>
>
> /*
>    * Register user space memory where DMA is allowed.
>    *
>    * It pins user pages and does the locked memory accounting so sub-
>    * sequent IOASID_MAP/UNMAP_DMA calls get faster.
>    *
>    * When this ioctl is not used, one user page might be accounted
>    * multiple times when it is mapped by multiple IOASIDs which are
>    * not nested together.
>    *
>    * Input parameters:
>    *	- vaddr;
>    *	- size;
>    *
>    * Return: 0 on success, -errno on failure.
>    */
> #define IOASID_REGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 1)
> #define IOASID_UNREGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 2)
>
>
> /*
>    * Allocate an IOASID.
>    *
>    * IOASID is the FD-local software handle representing an I/O address
>    * space. Each IOASID is associated with a single I/O page table. User
>    * must call this ioctl to get an IOASID for every I/O address space that is
>    * intended to be enabled in the IOMMU.
>    *
>    * A newly-created IOASID doesn't accept any command before it is
>    * attached to a device. Once attached, an empty I/O page table is
>    * bound with the IOMMU then the user could use either DMA mapping
>    * or pgtable binding commands to manage this I/O page table.
>    *
>    * Device attachment is initiated through device driver uAPI (e.g. VFIO)
>    *
>    * Return: allocated ioasid on success, -errno on failure.
>    */
> #define IOASID_ALLOC	_IO(IOASID_TYPE, IOASID_BASE + 3)
> #define IOASID_FREE	_IO(IOASID_TYPE, IOASID_BASE + 4)


I would like to know the reason for such indirection.

It looks to me the ioasid fd is sufficient for performing any operations.

Such allocation only work if as ioas fd can have multiple ioasid which 
seems not the case you describe here.


>
>
> /*
>    * Get information about an I/O address space
>    *
>    * Supported capabilities:
>    *	- VFIO type1 map/unmap;
>    *	- pgtable/pasid_table binding
>    *	- hardware nesting vs. software nesting;
>    *	- ...
>    *
>    * Related attributes:
>    * 	- supported page sizes, reserved IOVA ranges (DMA mapping);
>    *	- vendor pgtable formats (pgtable binding);
>    *	- number of child IOASIDs (nesting);
>    *	- ...
>    *
>    * Above information is available only after one or more devices are
>    * attached to the specified IOASID. Otherwise the IOASID is just a
>    * number w/o any capability or attribute.
>    *
>    * Input parameters:
>    *	- u32 ioasid;
>    *
>    * Output parameters:
>    *	- many. TBD.
>    */
> #define IOASID_GET_INFO	_IO(IOASID_TYPE, IOASID_BASE + 5)
>
>
> /*
>    * Map/unmap process virtual addresses to I/O virtual addresses.
>    *
>    * Provide VFIO type1 equivalent semantics. Start with the same
>    * restriction e.g. the unmap size should match those used in the
>    * original mapping call.
>    *
>    * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
>    * must be already in the preregistered list.
>    *
>    * Input parameters:
>    *	- u32 ioasid;
>    *	- refer to vfio_iommu_type1_dma_{un}map
>    *
>    * Return: 0 on success, -errno on failure.
>    */
> #define IOASID_MAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 6)
> #define IOASID_UNMAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 7)
>
>
> /*
>    * Create a nesting IOASID (child) on an existing IOASID (parent)
>    *
>    * IOASIDs can be nested together, implying that the output address
>    * from one I/O page table (child) must be further translated by
>    * another I/O page table (parent).
>    *
>    * As the child adds essentially another reference to the I/O page table
>    * represented by the parent, any device attached to the child ioasid
>    * must be already attached to the parent.
>    *
>    * In concept there is no limit on the number of the nesting levels.
>    * However for the majority case one nesting level is sufficient. The
>    * user should check whether an IOASID supports nesting through
>    * IOASID_GET_INFO. For example, if only one nesting level is allowed,
>    * the nesting capability is reported only on the parent instead of the
>    * child.
>    *
>    * User also needs check (via IOASID_GET_INFO) whether the nesting
>    * is implemented in hardware or software. If software-based, DMA
>    * mapping protocol should be used on the child IOASID. Otherwise,
>    * the child should be operated with pgtable binding protocol.
>    *
>    * Input parameters:
>    *	- u32 parent_ioasid;
>    *
>    * Return: child_ioasid on success, -errno on failure;
>    */
> #define IOASID_CREATE_NESTING	_IO(IOASID_TYPE, IOASID_BASE + 8)
>
>
> /*
>    * Bind an user-managed I/O page table with the IOMMU
>    *
>    * Because user page table is untrusted, IOASID nesting must be enabled
>    * for this ioasid so the kernel can enforce its DMA isolation policy
>    * through the parent ioasid.
>    *
>    * Pgtable binding protocol is different from DMA mapping. The latter
>    * has the I/O page table constructed by the kernel and updated
>    * according to user MAP/UNMAP commands. With pgtable binding the
>    * whole page table is created and updated by userspace, thus different
>    * set of commands are required (bind, iotlb invalidation, page fault, etc.).
>    *
>    * Because the page table is directly walked by the IOMMU, the user
>    * must  use a format compatible to the underlying hardware. It can
>    * check the format information through IOASID_GET_INFO.
>    *
>    * The page table is bound to the IOMMU according to the routing
>    * information of each attached device under the specified IOASID. The
>    * routing information (RID and optional PASID) is registered when a
>    * device is attached to this IOASID through VFIO uAPI.
>    *
>    * Input parameters:
>    *	- child_ioasid;
>    *	- address of the user page table;
>    *	- formats (vendor, address_width, etc.);
>    *
>    * Return: 0 on success, -errno on failure.
>    */
> #define IOASID_BIND_PGTABLE		_IO(IOASID_TYPE, IOASID_BASE + 9)
> #define IOASID_UNBIND_PGTABLE	_IO(IOASID_TYPE, IOASID_BASE + 10)
>
>
> /*
>    * Bind an user-managed PASID table to the IOMMU
>    *
>    * This is required for platforms which place PASID table in the GPA space.
>    * In this case the specified IOASID represents the per-RID PASID space.
>    *
>    * Alternatively this may be replaced by IOASID_BIND_PGTABLE plus a
>    * special flag to indicate the difference from normal I/O address spaces.
>    *
>    * The format info of the PASID table is reported in IOASID_GET_INFO.
>    *
>    * As explained in the design section, user-managed I/O page tables must
>    * be explicitly bound to the kernel even on these platforms. It allows
>    * the kernel to uniformly manage I/O address spaces cross all platforms.
>    * Otherwise, the iotlb invalidation and page faulting uAPI must be hacked
>    * to carry device routing information to indirectly mark the hidden I/O
>    * address spaces.
>    *
>    * Input parameters:
>    *	- child_ioasid;
>    *	- address of PASID table;
>    *	- formats (vendor, size, etc.);
>    *
>    * Return: 0 on success, -errno on failure.
>    */
> #define IOASID_BIND_PASID_TABLE	_IO(IOASID_TYPE, IOASID_BASE + 11)
> #define IOASID_UNBIND_PASID_TABLE	_IO(IOASID_TYPE, IOASID_BASE + 12)
>
>
> /*
>    * Invalidate IOTLB for an user-managed I/O page table
>    *
>    * Unlike what's defined in include/uapi/linux/iommu.h, this command
>    * doesn't allow the user to specify cache type and likely support only
>    * two granularities (all, or a specified range) in the I/O address space.
>    *
>    * Physical IOMMU have three cache types (iotlb, dev_iotlb and pasid
>    * cache). If the IOASID represents an I/O address space, the invalidation
>    * always applies to the iotlb (and dev_iotlb if enabled). If the IOASID
>    * represents a vPASID space, then this command applies to the PASID
>    * cache.
>    *
>    * Similarly this command doesn't provide IOMMU-like granularity
>    * info (domain-wide, pasid-wide, range-based), since it's all about the
>    * I/O address space itself. The ioasid driver walks the attached
>    * routing information to match the IOMMU semantics under the
>    * hood.
>    *
>    * Input parameters:
>    *	- child_ioasid;
>    *	- granularity
>    *
>    * Return: 0 on success, -errno on failure
>    */
> #define IOASID_INVALIDATE_CACHE	_IO(IOASID_TYPE, IOASID_BASE + 13)
>
>
> /*
>    * Page fault report and response
>    *
>    * This is TBD. Can be added after other parts are cleared up. Likely it
>    * will be a ring buffer shared between user/kernel, an eventfd to notify
>    * the user and an ioctl to complete the fault.
>    *
>    * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
>    */
>
>
> /*
>    * Dirty page tracking
>    *
>    * Track and report memory pages dirtied in I/O address spaces. There
>    * is an ongoing work by Kunkun Jiang by extending existing VFIO type1.
>    * It needs be adapted to /dev/ioasid later.
>    */
>
>
> 2.2. /dev/vfio uAPI
> ++++++++++++++++
>
> /*
>    * Bind a vfio_device to the specified IOASID fd
>    *
>    * Multiple vfio devices can be bound to a single ioasid_fd, but a single
>    * vfio device should not be bound to multiple ioasid_fd's.
>    *
>    * Input parameters:
>    *	- ioasid_fd;
>    *
>    * Return: 0 on success, -errno on failure.
>    */
> #define VFIO_BIND_IOASID_FD		_IO(VFIO_TYPE, VFIO_BASE + 22)
> #define VFIO_UNBIND_IOASID_FD	_IO(VFIO_TYPE, VFIO_BASE + 23)
>
>
> /*
>    * Attach a vfio device to the specified IOASID
>    *
>    * Multiple vfio devices can be attached to the same IOASID, and vice
>    * versa.
>    *
>    * User may optionally provide a "virtual PASID" to mark an I/O page
>    * table on this vfio device. Whether the virtual PASID is physically used
>    * or converted to another kernel-allocated PASID is a policy in vfio device
>    * driver.
>    *
>    * There is no need to specify ioasid_fd in this call due to the assumption
>    * of 1:1 connection between vfio device and the bound fd.
>    *
>    * Input parameter:
>    *	- ioasid;
>    *	- flag;
>    *	- user_pasid (if specified);
>    *
>    * Return: 0 on success, -errno on failure.
>    */
> #define VFIO_ATTACH_IOASID		_IO(VFIO_TYPE, VFIO_BASE + 24)
> #define VFIO_DETACH_IOASID		_IO(VFIO_TYPE, VFIO_BASE + 25)
>
>
> 2.3. KVM uAPI
> ++++++++++++
>
> /*
>    * Update CPU PASID mapping
>    *
>    * This is necessary when ENQCMD will be used in the guest while the
>    * targeted device doesn't accept the vPASID saved in the CPU MSR.
>    *
>    * This command allows user to set/clear the vPASID->pPASID mapping
>    * in the CPU, by providing the IOASID (and FD) information representing
>    * the I/O address space marked by this vPASID.
>    *
>    * Input parameters:
>    *	- user_pasid;
>    *	- ioasid_fd;
>    *	- ioasid;
>    */
> #define KVM_MAP_PASID	_IO(KVMIO, 0xf0)
> #define KVM_UNMAP_PASID	_IO(KVMIO, 0xf1)
>
>
> 3. Sample structures and helper functions
> --------------------------------------------------------
>
> Three helper functions are provided to support VFIO_BIND_IOASID_FD:
>
> 	struct ioasid_ctx *ioasid_ctx_fdget(int fd);
> 	int ioasid_register_device(struct ioasid_ctx *ctx, struct ioasid_dev *dev);
> 	int ioasid_unregister_device(struct ioasid_dev *dev);
>
> An ioasid_ctx is created for each fd:
>
> 	struct ioasid_ctx {
> 		// a list of allocated IOASID data's
> 		struct list_head		ioasid_list;
> 		// a list of registered devices
> 		struct list_head		dev_list;
> 		// a list of pre-registered virtual address ranges
> 		struct list_head		prereg_list;
> 	};
>
> Each registered device is represented by ioasid_dev:
>
> 	struct ioasid_dev {
> 		struct list_head		next;
> 		struct ioasid_ctx	*ctx;
> 		// always be the physical device
> 		struct device 		*device;
> 		struct kref		kref;
> 	};
>
> Because we assume one vfio_device connected to at most one ioasid_fd,
> here ioasid_dev could be embedded in vfio_device and then linked to
> ioasid_ctx->dev_list when registration succeeds. For mdev the struct
> device should be the pointer to the parent device. PASID marking this
> mdev is specified later when VFIO_ATTACH_IOASID.
>
> An ioasid_data is created when IOASID_ALLOC, as the main object
> describing characteristics about an I/O page table:
>
> 	struct ioasid_data {
> 		// link to ioasid_ctx->ioasid_list
> 		struct list_head		next;
>
> 		// the IOASID number
> 		u32			ioasid;
>
> 		// the handle to convey iommu operations
> 		// hold the pgd (TBD until discussing iommu api)
> 		struct iommu_domain *domain;
>
> 		// map metadata (vfio type1 semantics)
> 		struct rb_node		dma_list;
>
> 		// pointer to user-managed pgtable (for nesting case)
> 		u64			user_pgd;
>
> 		// link to the parent ioasid (for nesting)
> 		struct ioasid_data	*parent;
>
> 		// cache the global PASID shared by ENQCMD-capable
> 		// devices (see below explanation in section 4)
> 		u32			pasid;
>
> 		// a list of device attach data (routing information)
> 		struct list_head		attach_data;
>
> 		// a list of partially-attached devices (group)
> 		struct list_head		partial_devices;
>
> 		// a list of fault_data reported from the iommu layer
> 		struct list_head		fault_data;
>
> 		...
> 	}
>
> ioasid_data and iommu_domain have overlapping roles as both are
> introduced to represent an I/O address space. It is still a big TBD how
> the two should be corelated or even merged, and whether new iommu
> ops are required to handle RID+PASID explicitly. We leave this as open
> for now as this proposal is mainly about uAPI. For simplification
> purpose the two objects are kept separate in this context, assuming an
> 1:1 connection in-between and the domain as the place-holder
> representing the 1st class object in the iommu ops.
>
> Two helper functions are provided to support VFIO_ATTACH_IOASID:
>
> 	struct attach_info {
> 		u32	ioasid;
> 		// If valid, the PASID to be used physically
> 		u32	pasid;
> 	};
> 	int ioasid_device_attach(struct ioasid_dev *dev,
> 		struct attach_info info);
> 	int ioasid_device_detach(struct ioasid_dev *dev, u32 ioasid);
>
> The pasid parameter is optionally provided based on the policy in vfio
> device driver. It could be the PASID marking the default I/O address
> space for a mdev, or the user-provided PASID marking an user I/O page
> table, or another kernel-allocated PASID backing the user-provided one.
> Please check next section for detail explanation.
>
> A new object is introduced and linked to ioasid_data->attach_data for
> each successful attach operation:
>
> 	struct ioasid_attach_data {
> 		struct list_head		next;
> 		struct ioasid_dev	*dev;
> 		u32 			pasid;
> 	}
>
> As explained in the design section, there is no explicit group enforcement
> in /dev/ioasid uAPI or helper functions. But the ioasid driver does
> implicit group check - before every device within an iommu group is
> attached to this IOASID, the previously-attached devices in this group are
> put in ioasid_data->partial_devices. The IOASID rejects any command if
> the partial_devices list is not empty.
>
> Then is the last helper function:
> 	u32 ioasid_get_global_pasid(struct ioasid_ctx *ctx,
> 		u32 ioasid, bool alloc);
>
> ioasid_get_global_pasid is necessary in scenarios where multiple devices
> want to share a same PASID value on the attached I/O page table (e.g.
> when ENQCMD is enabled, as explained in next section). We need a
> centralized place (ioasid_data->pasid) to hold this value (allocated when
> first called with alloc=true). vfio device driver calls this function (alloc=
> true) to get the global PASID for an ioasid before calling ioasid_device_
> attach. KVM also calls this function (alloc=false) to setup PASID translation
> structure when user calls KVM_MAP_PASID.
>
> 4. PASID Virtualization
> ------------------------------
>
> When guest SVA (vSVA) is enabled, multiple GVA address spaces are
> created on the assigned vfio device. This leads to the concepts of
> "virtual PASID" (vPASID) vs. "physical PASID" (pPASID). vPASID is assigned
> by the guest to mark an GVA address space while pPASID is the one
> selected by the host and actually routed in the wire.
>
> vPASID is conveyed to the kernel when user calls VFIO_ATTACH_IOASID.
>
> vfio device driver translates vPASID to pPASID before calling ioasid_attach_
> device, with two factors to be considered:
>
> -    Whether vPASID is directly used (vPASID==pPASID) in the wire, or
>       should be instead converted to a newly-allocated one (vPASID!=
>       pPASID);
>
> -    If vPASID!=pPASID, whether pPASID is allocated from per-RID PASID
>       space or a global PASID space (implying sharing pPASID cross devices,
>       e.g. when supporting Intel ENQCMD which puts PASID in a CPU MSR
>       as part of the process context);
>
> The actual policy depends on pdev vs. mdev, and whether ENQCMD is
> supported. There are three possible scenarios:
>
> (Note: /dev/ioasid uAPI is not affected by underlying PASID virtualization
> policies.)
>
> 1)  pdev (w/ or w/o ENQCMD): vPASID==pPASID
>
>       vPASIDs are directly programmed by the guest to the assigned MMIO
>       bar, implying all DMAs out of this device having vPASID in the packet
>       header. This mandates vPASID==pPASID, sort of delegating the entire
>       per-RID PASID space to the guest.
>
>       When ENQCMD is enabled, the CPU MSR when running a guest task
>       contains a vPASID. In this case the CPU PASID translation capability
>       should be disabled so this vPASID in CPU MSR is directly sent to the
>       wire.
>
>       This ensures consistent vPASID usage on pdev regardless of the
>       workload submitted through a MMIO register or ENQCMD instruction.
>
> 2)  mdev: vPASID!=pPASID (per-RID if w/o ENQCMD, otherwise global)
>
>       PASIDs are also used by kernel to mark the default I/O address space
>       for mdev, thus cannot be delegated to the guest. Instead, the mdev
>       driver must allocate a new pPASID for each vPASID (thus vPASID!=
>       pPASID) and then use pPASID when attaching this mdev to an ioasid.
>
>       The mdev driver needs cache the PASID mapping so in mediation
>       path vPASID programmed by the guest can be converted to pPASID
>       before updating the physical MMIO register. The mapping should
>       also be saved in the CPU PASID translation structure (via KVM uAPI),
>       so the vPASID saved in the CPU MSR is auto-translated to pPASID
>       before sent to the wire, when ENQCMD is enabled.
>
>       Generally pPASID could be allocated from the per-RID PASID space
>       if all mdev's created on the parent device don't support ENQCMD.
>
>       However if the parent supports ENQCMD-capable mdev, pPASIDs
>       must be allocated from a global pool because the CPU PASID
>       translation structure is per-VM. It implies that when an guest I/O
>       page table is attached to two mdevs with a single vPASID (i.e. bind
>       to the same guest process), a same pPASID should be used for
>       both mdevs even when they belong to different parents. Sharing
>       pPASID cross mdevs is achieved by calling aforementioned ioasid_
>       get_global_pasid().
>
> 3)  Mix pdev/mdev together
>
>       Above policies are per device type thus are not affected when mixing
>       those device types together (when assigned to a single guest). However,
>       there is one exception - when both pdev/mdev support ENQCMD.
>
>       Remember the two types have conflicting requirements on whether
>       CPU PASID translation should be enabled. This capability is per-VM,
>       and must be enabled for mdev isolation. When enabled, pdev will
>       receive a mdev pPASID violating its vPASID expectation.
>
>       In previous thread a PASID range split scheme was discussed to support
>       this combination, but we haven't worked out a clean uAPI design yet.
>       Therefore in this proposal we decide to not support it, implying the
>       user should have some intelligence to avoid such scenario. It could be
>       a TODO task for future.
>
> In spite of those subtle considerations, the kernel implementation could
> start simple, e.g.:
>
> -    v==p for pdev;
> -    v!=p and always use a global PASID pool for all mdev's;
>
> Regardless of the kernel policy, the user policy is unchanged:
>
> -    provide vPASID when calling VFIO_ATTACH_IOASID;
> -    call KVM uAPI to setup CPU PASID translation if ENQCMD-capable mdev;
> -    Don't expose ENQCMD capability on both pdev and mdev;
>
> Sample user flow is described in section 5.5.
>
> 5. Use Cases and Flows
> -------------------------------
>
> Here assume VFIO will support a new model where every bound device
> is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> going through legacy container/group interface. For illustration purpose
> those devices are just called dev[1...N]:
>
> 	device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
>
> As explained earlier, one IOASID fd is sufficient for all intended use cases:
>
> 	ioasid_fd = open("/dev/ioasid", mode);
>
> For simplicity below examples are all made for the virtualization story.
> They are representative and could be easily adapted to a non-virtualization
> scenario.
>
> Three types of IOASIDs are considered:
>
> 	gpa_ioasid[1...N]: 	for GPA address space
> 	giova_ioasid[1...N]:	for guest IOVA address space
> 	gva_ioasid[1...N]:	for guest CPU VA address space
>
> At least one gpa_ioasid must always be created per guest, while the other
> two are relevant as far as vIOMMU is concerned.
>
> Examples here apply to both pdev and mdev, if not explicitly marked out
> (e.g. in section 5.5). VFIO device driver in the kernel will figure out the
> associated routing information in the attaching operation.
>
> For illustration simplicity, IOASID_CHECK_EXTENSION and IOASID_GET_
> INFO are skipped in these examples.
>
> 5.1. A simple example
> ++++++++++++++++++
>
> Dev1 is assigned to the guest. One gpa_ioasid is created. The GPA address
> space is managed through DMA mapping protocol:
>
> 	/* Bind device to IOASID fd */
> 	device_fd = open("/dev/vfio/devices/dev1", mode);
> 	ioasid_fd = open("/dev/ioasid", mode);
> 	ioctl(device_fd, VFIO_BIND_IOASID_FD, ioasid_fd);
>
> 	/* Attach device to IOASID */
> 	gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> 	at_data = { .ioasid = gpa_ioasid};
> 	ioctl(device_fd, VFIO_ATTACH_IOASID, &at_data);
>
> 	/* Setup GPA mapping */
> 	dma_map = {
> 		.ioasid	= gpa_ioasid;
> 		.iova	= 0;		// GPA
> 		.vaddr	= 0x40000000;	// HVA
> 		.size	= 1GB;
> 	};
> 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
>
> If the guest is assigned with more than dev1, user follows above sequence
> to attach other devices to the same gpa_ioasid i.e. sharing the GPA
> address space cross all assigned devices.
>
> 5.2. Multiple IOASIDs (no nesting)
> ++++++++++++++++++++++++++++
>
> Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
> both devices are attached to gpa_ioasid. After boot the guest creates
> an GIOVA address space (giova_ioasid) for dev2, leaving dev1 in pass
> through mode (gpa_ioasid).
>
> Suppose IOASID nesting is not supported in this case. Qemu need to
> generate shadow mappings in userspace for giova_ioasid (like how
> VFIO works today).
>
> To avoid duplicated locked page accounting, it's recommended to pre-
> register the virtual address range that will be used for DMA:
>
> 	device_fd1 = open("/dev/vfio/devices/dev1", mode);
> 	device_fd2 = open("/dev/vfio/devices/dev2", mode);
> 	ioasid_fd = open("/dev/ioasid", mode);
> 	ioctl(device_fd1, VFIO_BIND_IOASID_FD, ioasid_fd);
> 	ioctl(device_fd2, VFIO_BIND_IOASID_FD, ioasid_fd);
>
> 	/* pre-register the virtual address range for accounting */
> 	mem_info = { .vaddr = 0x40000000; .size = 1GB };
> 	ioctl(ioasid_fd, IOASID_REGISTER_MEMORY, &mem_info);
>
> 	/* Attach dev1 and dev2 to gpa_ioasid */
> 	gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> 	at_data = { .ioasid = gpa_ioasid};
> 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
>
> 	/* Setup GPA mapping */
> 	dma_map = {
> 		.ioasid	= gpa_ioasid;
> 		.iova	= 0; 		// GPA
> 		.vaddr	= 0x40000000;	// HVA
> 		.size	= 1GB;
> 	};
> 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
>
> 	/* After boot, guest enables an GIOVA space for dev2 */
> 	giova_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
>
> 	/* First detach dev2 from previous address space */
> 	at_data = { .ioasid = gpa_ioasid};
> 	ioctl(device_fd2, VFIO_DETACH_IOASID, &at_data);
>
> 	/* Then attach dev2 to the new address space */
> 	at_data = { .ioasid = giova_ioasid};
> 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
>
> 	/* Setup a shadow DMA mapping according to vIOMMU
> 	  * GIOVA (0x2000) -> GPA (0x1000) -> HVA (0x40001000)
> 	  */
> 	dma_map = {
> 		.ioasid	= giova_ioasid;
> 		.iova	= 0x2000; 	// GIOVA
> 		.vaddr	= 0x40001000;	// HVA
> 		.size	= 4KB;
> 	};
> 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
>
> 5.3. IOASID nesting (software)
> +++++++++++++++++++++++++
>
> Same usage scenario as 5.2, with software-based IOASID nesting
> available. In this mode it is the kernel instead of user to create the
> shadow mapping.
>
> The flow before guest boots is same as 5.2, except one point. Because
> giova_ioasid is nested on gpa_ioasid, locked accounting is only
> conducted for gpa_ioasid. So it's not necessary to pre-register virtual
> memory.
>
> To save space we only list the steps after boots (i.e. both dev1/dev2
> have been attached to gpa_ioasid before guest boots):
>
> 	/* After boots */
> 	/* Make GIOVA space nested on GPA space */
> 	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
>
> 	/* Attach dev2 to the new address space (child)
> 	  * Note dev2 is still attached to gpa_ioasid (parent)
> 	  */
> 	at_data = { .ioasid = giova_ioasid};
> 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);


For vDPA, we need something similar. And in the future, vDPA may allow 
multiple ioasid to be attached to a single device. It should work with 
the current design.


>
> 	/* Setup a GIOVA->GPA mapping for giova_ioasid, which will be
> 	  * merged by the kernel with GPA->HVA mapping of gpa_ioasid
> 	  * to form a shadow mapping.
> 	  */
> 	dma_map = {
> 		.ioasid	= giova_ioasid;
> 		.iova	= 0x2000;	// GIOVA
> 		.vaddr	= 0x1000;	// GPA
> 		.size	= 4KB;
> 	};
> 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
>
> 5.4. IOASID nesting (hardware)
> +++++++++++++++++++++++++
>
> Same usage scenario as 5.2, with hardware-based IOASID nesting
> available. In this mode the pgtable binding protocol is used to
> bind the guest IOVA page table with the IOMMU:
>
> 	/* After boots */
> 	/* Make GIOVA space nested on GPA space */
> 	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
>
> 	/* Attach dev2 to the new address space (child)
> 	  * Note dev2 is still attached to gpa_ioasid (parent)
> 	  */
> 	at_data = { .ioasid = giova_ioasid};
> 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);


I guess VFIO_ATTACH_IOASID will fail if the underlayer doesn't support 
hardware nesting. Or is there way to detect the capability before?

I think GET_INFO only works after the ATTACH.


>
> 	/* Bind guest I/O page table  */
> 	bind_data = {
> 		.ioasid	= giova_ioasid;
> 		.addr	= giova_pgtable;
> 		// and format information
> 	};
> 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
>
> 	/* Invalidate IOTLB when required */
> 	inv_data = {
> 		.ioasid	= giova_ioasid;
> 		// granular information
> 	};
> 	ioctl(ioasid_fd, IOASID_INVALIDATE_CACHE, &inv_data);
>
> 	/* See 5.6 for I/O page fault handling */
> 	
> 5.5. Guest SVA (vSVA)
> ++++++++++++++++++
>
> After boots the guest further create a GVA address spaces (gpasid1) on
> dev1. Dev2 is not affected (still attached to giova_ioasid).
>
> As explained in section 4, user should avoid expose ENQCMD on both
> pdev and mdev.
>
> The sequence applies to all device types (being pdev or mdev), except
> one additional step to call KVM for ENQCMD-capable mdev:


My understanding is ENQCMD is Intel specific and not a requirement for 
having vSVA.


>
> 	/* After boots */
> 	/* Make GVA space nested on GPA space */
> 	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
>
> 	/* Attach dev1 to the new address space and specify vPASID */
> 	at_data = {
> 		.ioasid		= gva_ioasid;
> 		.flag 		= IOASID_ATTACH_USER_PASID;
> 		.user_pasid	= gpasid1;
> 	};
> 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
>
> 	/* if dev1 is ENQCMD-capable mdev, update CPU PASID
> 	  * translation structure through KVM
> 	  */
> 	pa_data = {
> 		.ioasid_fd	= ioasid_fd;
> 		.ioasid		= gva_ioasid;
> 		.guest_pasid	= gpasid1;
> 	};
> 	ioctl(kvm_fd, KVM_MAP_PASID, &pa_data);
>
> 	/* Bind guest I/O page table  */
> 	bind_data = {
> 		.ioasid	= gva_ioasid;
> 		.addr	= gva_pgtable1;
> 		// and format information
> 	};
> 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
>
> 	...
>
>
> 5.6. I/O page fault
> +++++++++++++++
>
> (uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
> to guest IOMMU driver and backwards).
>
> -   Host IOMMU driver receives a page request with raw fault_data {rid,
>      pasid, addr};
>
> -   Host IOMMU driver identifies the faulting I/O page table according to
>      information registered by IOASID fault handler;
>
> -   IOASID fault handler is called with raw fault_data (rid, pasid, addr), which
>      is saved in ioasid_data->fault_data (used for response);
>
> -   IOASID fault handler generates an user fault_data (ioasid, addr), links it
>      to the shared ring buffer and triggers eventfd to userspace;
>
> -   Upon received event, Qemu needs to find the virtual routing information
>      (v_rid + v_pasid) of the device attached to the faulting ioasid. If there are
>      multiple, pick a random one. This should be fine since the purpose is to
>      fix the I/O page table on the guest;
>
> -   Qemu generates a virtual I/O page fault through vIOMMU into guest,
>      carrying the virtual fault data (v_rid, v_pasid, addr);
>
> -   Guest IOMMU driver fixes up the fault, updates the I/O page table, and
>      then sends a page response with virtual completion data (v_rid, v_pasid,
>      response_code) to vIOMMU;
>
> -   Qemu finds the pending fault event, converts virtual completion data
>      into (ioasid, response_code), and then calls a /dev/ioasid ioctl to
>      complete the pending fault;
>
> -   /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved in
>      ioasid_data->fault_data, and then calls iommu api to complete it with
>      {rid, pasid, response_code};
>
> 5.7. BIND_PASID_TABLE
> ++++++++++++++++++++
>
> PASID table is put in the GPA space on some platform, thus must be updated
> by the guest. It is treated as another user page table to be bound with the
> IOMMU.
>
> As explained earlier, the user still needs to explicitly bind every user I/O
> page table to the kernel so the same pgtable binding protocol (bind, cache
> invalidate and fault handling) is unified cross platforms.
>
> vIOMMUs may include a caching mode (or paravirtualized way) which, once
> enabled, requires the guest to invalidate PASID cache for any change on the
> PASID table. This allows Qemu to track the lifespan of guest I/O page tables.
>
> In case of missing such capability, Qemu could enable write-protection on
> the guest PASID table to achieve the same effect.
>
> 	/* After boots */
> 	/* Make vPASID space nested on GPA space */
> 	pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
>
> 	/* Attach dev1 to pasidtbl_ioasid */
> 	at_data = { .ioasid = pasidtbl_ioasid};
> 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
>
> 	/* Bind PASID table */
> 	bind_data = {
> 		.ioasid	= pasidtbl_ioasid;
> 		.addr	= gpa_pasid_table;
> 		// and format information
> 	};
> 	ioctl(ioasid_fd, IOASID_BIND_PASID_TABLE, &bind_data);
>
> 	/* vIOMMU detects a new GVA I/O space created */
> 	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
>
> 	/* Attach dev1 to the new address space, with gpasid1 */
> 	at_data = {
> 		.ioasid		= gva_ioasid;
> 		.flag 		= IOASID_ATTACH_USER_PASID;
> 		.user_pasid	= gpasid1;
> 	};
> 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);


Do we need VFIO_DETACH_IOASID?

Thanks


>
> 	/* Bind guest I/O page table. Because SET_PASID_TABLE has been
> 	  * used, the kernel will not update the PASID table. Instead, just
> 	  * track the bound I/O page table for handling invalidation and
> 	  * I/O page faults.
> 	  */
> 	bind_data = {
> 		.ioasid	= gva_ioasid;
> 		.addr	= gva_pgtable1;
> 		// and format information
> 	};
> 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
>
> 	...
>
> Thanks
> Kevin
>


^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-05-28  2:24   ` Jason Wang
  0 siblings, 0 replies; 518+ messages in thread
From: Jason Wang @ 2021-05-28  2:24 UTC (permalink / raw)
  To: Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe, Lu Baolu,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com)
  Cc: Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok, Jonathan Corbet,
	Kirti Wankhede, David Gibson, Robin Murphy, Wu, Hao


在 2021/5/27 下午3:58, Tian, Kevin 写道:
> /dev/ioasid provides an unified interface for managing I/O page tables for
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
> etc.) are expected to use this interface instead of creating their own logic to
> isolate untrusted device DMAs initiated by userspace.


Not a native speaker but /dev/ioas seems better?


>
> This proposal describes the uAPI of /dev/ioasid and also sample sequences
> with VFIO as example in typical usages. The driver-facing kernel API provided
> by the iommu layer is still TBD, which can be discussed after consensus is
> made on this uAPI.
>
> It's based on a lengthy discussion starting from here:
> 	https://lore.kernel.org/linux-iommu/20210330132830.GO2356281@nvidia.com/
>
> It ends up to be a long writing due to many things to be summarized and
> non-trivial effort required to connect them into a complete proposal.
> Hope it provides a clean base to converge.
>
> TOC
> ====
> 1. Terminologies and Concepts
> 2. uAPI Proposal
>      2.1. /dev/ioasid uAPI
>      2.2. /dev/vfio uAPI
>      2.3. /dev/kvm uAPI
> 3. Sample structures and helper functions
> 4. PASID virtualization
> 5. Use Cases and Flows
>      5.1. A simple example
>      5.2. Multiple IOASIDs (no nesting)
>      5.3. IOASID nesting (software)
>      5.4. IOASID nesting (hardware)
>      5.5. Guest SVA (vSVA)
>      5.6. I/O page fault
>      5.7. BIND_PASID_TABLE
> ====
>
> 1. Terminologies and Concepts
> -----------------------------------------
>
> IOASID FD is the container holding multiple I/O address spaces. User
> manages those address spaces through FD operations. Multiple FD's are
> allowed per process, but with this proposal one FD should be sufficient for
> all intended usages.
>
> IOASID is the FD-local software handle representing an I/O address space.
> Each IOASID is associated with a single I/O page table. IOASIDs can be
> nested together, implying the output address from one I/O page table
> (represented by child IOASID) must be further translated by another I/O
> page table (represented by parent IOASID).
>
> I/O address space can be managed through two protocols, according to
> whether the corresponding I/O page table is constructed by the kernel or
> the user. When kernel-managed, a dma mapping protocol (similar to
> existing VFIO iommu type1) is provided for the user to explicitly specify
> how the I/O address space is mapped. Otherwise, a different protocol is
> provided for the user to bind an user-managed I/O page table to the
> IOMMU, plus necessary commands for iotlb invalidation and I/O fault
> handling.
>
> Pgtable binding protocol can be used only on the child IOASID's, implying
> IOASID nesting must be enabled. This is because the kernel doesn't trust
> userspace. Nesting allows the kernel to enforce its DMA isolation policy
> through the parent IOASID.
>
> IOASID nesting can be implemented in two ways: hardware nesting and
> software nesting. With hardware support the child and parent I/O page
> tables are walked consecutively by the IOMMU to form a nested translation.
> When it's implemented in software, the ioasid driver


Need to explain what did "ioasid driver" mean.

I guess it's the module that implements the IOASID abstraction:

1) RID
2) RID+PASID
3) others

And if yes, does it allow the device for software specific implementation:

1) swiotlb or
2) device specific IOASID implementation


> is responsible for
> merging the two-level mappings into a single-level shadow I/O page table.
> Software nesting requires both child/parent page tables operated through
> the dma mapping protocol, so any change in either level can be captured
> by the kernel to update the corresponding shadow mapping.
>
> An I/O address space takes effect in the IOMMU only after it is attached
> to a device. The device in the /dev/ioasid context always refers to a
> physical one or 'pdev' (PF or VF).
>
> One I/O address space could be attached to multiple devices. In this case,
> /dev/ioasid uAPI applies to all attached devices under the specified IOASID.
>
> Based on the underlying IOMMU capability one device might be allowed
> to attach to multiple I/O address spaces, with DMAs accessing them by
> carrying different routing information. One of them is the default I/O
> address space routed by PCI Requestor ID (RID) or ARM Stream ID. The
> remaining are routed by RID + Process Address Space ID (PASID) or
> Stream+Substream ID. For simplicity the following context uses RID and
> PASID when talking about the routing information for I/O address spaces.
>
> Device attachment is initiated through passthrough framework uAPI (use
> VFIO for simplicity in following context). VFIO is responsible for identifying
> the routing information and registering it to the ioasid driver when calling
> ioasid attach helper function. It could be RID if the assigned device is
> pdev (PF/VF) or RID+PASID if the device is mediated (mdev). In addition,
> user might also provide its view of virtual routing information (vPASID) in
> the attach call, e.g. when multiple user-managed I/O address spaces are
> attached to the vfio_device. In this case VFIO must figure out whether
> vPASID should be directly used (for pdev) or converted to a kernel-
> allocated one (pPASID, for mdev) for physical routing (see section 4).
>
> Device must be bound to an IOASID FD before attach operation can be
> conducted. This is also through VFIO uAPI. In this proposal one device
> should not be bound to multiple FD's. Not sure about the gain of
> allowing it except adding unnecessary complexity. But if others have
> different view we can further discuss.
>
> VFIO must ensure its device composes DMAs with the routing information
> attached to the IOASID. For pdev it naturally happens since vPASID is
> directly programmed to the device by guest software. For mdev this
> implies any guest operation carrying a vPASID on this device must be
> trapped into VFIO and then converted to pPASID before sent to the
> device. A detail explanation about PASID virtualization policies can be
> found in section 4.
>
> Modern devices may support a scalable workload submission interface
> based on PCI DMWr capability, allowing a single work queue to access
> multiple I/O address spaces. One example is Intel ENQCMD, having
> PASID saved in the CPU MSR and carried in the instruction payload
> when sent out to the device. Then a single work queue shared by
> multiple processes can compose DMAs carrying different PASIDs.
>
> When executing ENQCMD in the guest, the CPU MSR includes a vPASID
> which, if targeting a mdev, must be converted to pPASID before sent
> to the wire. Intel CPU provides a hardware PASID translation capability
> for auto-conversion in the fast path. The user is expected to setup the
> PASID mapping through KVM uAPI, with information about {vpasid,
> ioasid_fd, ioasid}. The ioasid driver provides helper function for KVM
> to figure out the actual pPASID given an IOASID.
>
> With above design /dev/ioasid uAPI is all about I/O address spaces.
> It doesn't include any device routing information, which is only
> indirectly registered to the ioasid driver through VFIO uAPI. For
> example, I/O page fault is always reported to userspace per IOASID,
> although it's physically reported per device (RID+PASID). If there is a
> need of further relaying this fault into the guest, the user is responsible
> of identifying the device attached to this IOASID (randomly pick one if
> multiple attached devices) and then generates a per-device virtual I/O
> page fault into guest. Similarly the iotlb invalidation uAPI describes the
> granularity in the I/O address space (all, or a range), different from the
> underlying IOMMU semantics (domain-wide, PASID-wide, range-based).
>
> I/O page tables routed through PASID are installed in a per-RID PASID
> table structure.


I'm not sure this is true for all archs.


>   Some platforms implement the PASID table in the guest
> physical space (GPA), expecting it managed by the guest. The guest
> PASID table is bound to the IOMMU also by attaching to an IOASID,
> representing the per-RID vPASID space.
>
> We propose the host kernel needs to explicitly track  guest I/O page
> tables even on these platforms, i.e. the same pgtable binding protocol
> should be used universally on all platforms (with only difference on who
> actually writes the PASID table). One opinion from previous discussion
> was treating this special IOASID as a container for all guest I/O page
> tables i.e. hiding them from the host. However this way significantly
> violates the philosophy in this /dev/ioasid proposal. It is not one IOASID
> one address space any more. Device routing information (indirectly
> marking hidden I/O spaces) has to be carried in iotlb invalidation and
> page faulting uAPI to help connect vIOMMU with the underlying
> pIOMMU. This is one design choice to be confirmed with ARM guys.
>
> Devices may sit behind IOMMU's with incompatible capabilities. The
> difference may lie in the I/O page table format, or availability of an user
> visible uAPI (e.g. hardware nesting). /dev/ioasid is responsible for
> checking the incompatibility between newly-attached device and existing
> devices under the specific IOASID and, if found, returning error to user.
> Upon such error the user should create a new IOASID for the incompatible
> device.
>
> There is no explicit group enforcement in /dev/ioasid uAPI, due to no
> device notation in this interface as aforementioned. But the ioasid driver
> does implicit check to make sure that devices within an iommu group
> must be all attached to the same IOASID before this IOASID starts to
> accept any uAPI command. Otherwise error information is returned to
> the user.
>
> There was a long debate in previous discussion whether VFIO should keep
> explicit container/group semantics in its uAPI. Jason Gunthorpe proposes
> a simplified model where every device bound to VFIO is explicitly listed
> under /dev/vfio thus a device fd can be acquired w/o going through legacy
> container/group interface. In this case the user is responsible for
> understanding the group topology and meeting the implicit group check
> criteria enforced in /dev/ioasid. The use case examples in this proposal
> are based on the new model.
>
> Of course for backward compatibility VFIO still needs to keep the existing
> uAPI and vfio iommu type1 will become a shim layer connecting VFIO
> iommu ops to internal ioasid helper functions.
>
> Notes:
> -   It might be confusing as IOASID is also used in the kernel (drivers/
>      iommu/ioasid.c) to represent PCI PASID or ARM substream ID. We need
>      find a better name later to differentiate.
>
> -   PPC has not be considered yet as we haven't got time to fully understand
>      its semantics. According to previous discussion there is some generality
>      between PPC window-based scheme and VFIO type1 semantics. Let's
>      first make consensus on this proposal and then further discuss how to
>      extend it to cover PPC's requirement.
>
> -   There is a protocol between vfio group and kvm. Needs to think about
>      how it will be affected following this proposal.
>
> -   mdev in this context refers to mediated subfunctions (e.g. Intel SIOV)
>      which can be physically isolated in-between through PASID-granular
>      IOMMU protection. Historically people also discussed one usage by
>      mediating a pdev into a mdev. This usage is not covered here, and is
>      supposed to be replaced by Max's work which allows overriding various
>      VFIO operations in vfio-pci driver.
>
> 2. uAPI Proposal
> ----------------------
>
> /dev/ioasid uAPI covers everything about managing I/O address spaces.
>
> /dev/vfio uAPI builds connection between devices and I/O address spaces.
>
> /dev/kvm uAPI is optional required as far as ENQCMD is concerned.
>
>
> 2.1. /dev/ioasid uAPI
> +++++++++++++++++
>
> /*
>    * Check whether an uAPI extension is supported.
>    *
>    * This is for FD-level capabilities, such as locked page pre-registration.
>    * IOASID-level capabilities are reported through IOASID_GET_INFO.
>    *
>    * Return: 0 if not supported, 1 if supported.
>    */
> #define IOASID_CHECK_EXTENSION	_IO(IOASID_TYPE, IOASID_BASE + 0)
>
>
> /*
>    * Register user space memory where DMA is allowed.
>    *
>    * It pins user pages and does the locked memory accounting so sub-
>    * sequent IOASID_MAP/UNMAP_DMA calls get faster.
>    *
>    * When this ioctl is not used, one user page might be accounted
>    * multiple times when it is mapped by multiple IOASIDs which are
>    * not nested together.
>    *
>    * Input parameters:
>    *	- vaddr;
>    *	- size;
>    *
>    * Return: 0 on success, -errno on failure.
>    */
> #define IOASID_REGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 1)
> #define IOASID_UNREGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 2)
>
>
> /*
>    * Allocate an IOASID.
>    *
>    * IOASID is the FD-local software handle representing an I/O address
>    * space. Each IOASID is associated with a single I/O page table. User
>    * must call this ioctl to get an IOASID for every I/O address space that is
>    * intended to be enabled in the IOMMU.
>    *
>    * A newly-created IOASID doesn't accept any command before it is
>    * attached to a device. Once attached, an empty I/O page table is
>    * bound with the IOMMU then the user could use either DMA mapping
>    * or pgtable binding commands to manage this I/O page table.
>    *
>    * Device attachment is initiated through device driver uAPI (e.g. VFIO)
>    *
>    * Return: allocated ioasid on success, -errno on failure.
>    */
> #define IOASID_ALLOC	_IO(IOASID_TYPE, IOASID_BASE + 3)
> #define IOASID_FREE	_IO(IOASID_TYPE, IOASID_BASE + 4)


I would like to know the reason for such indirection.

It looks to me the ioasid fd is sufficient for performing any operations.

Such allocation only work if as ioas fd can have multiple ioasid which 
seems not the case you describe here.


>
>
> /*
>    * Get information about an I/O address space
>    *
>    * Supported capabilities:
>    *	- VFIO type1 map/unmap;
>    *	- pgtable/pasid_table binding
>    *	- hardware nesting vs. software nesting;
>    *	- ...
>    *
>    * Related attributes:
>    * 	- supported page sizes, reserved IOVA ranges (DMA mapping);
>    *	- vendor pgtable formats (pgtable binding);
>    *	- number of child IOASIDs (nesting);
>    *	- ...
>    *
>    * Above information is available only after one or more devices are
>    * attached to the specified IOASID. Otherwise the IOASID is just a
>    * number w/o any capability or attribute.
>    *
>    * Input parameters:
>    *	- u32 ioasid;
>    *
>    * Output parameters:
>    *	- many. TBD.
>    */
> #define IOASID_GET_INFO	_IO(IOASID_TYPE, IOASID_BASE + 5)
>
>
> /*
>    * Map/unmap process virtual addresses to I/O virtual addresses.
>    *
>    * Provide VFIO type1 equivalent semantics. Start with the same
>    * restriction e.g. the unmap size should match those used in the
>    * original mapping call.
>    *
>    * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
>    * must be already in the preregistered list.
>    *
>    * Input parameters:
>    *	- u32 ioasid;
>    *	- refer to vfio_iommu_type1_dma_{un}map
>    *
>    * Return: 0 on success, -errno on failure.
>    */
> #define IOASID_MAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 6)
> #define IOASID_UNMAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 7)
>
>
> /*
>    * Create a nesting IOASID (child) on an existing IOASID (parent)
>    *
>    * IOASIDs can be nested together, implying that the output address
>    * from one I/O page table (child) must be further translated by
>    * another I/O page table (parent).
>    *
>    * As the child adds essentially another reference to the I/O page table
>    * represented by the parent, any device attached to the child ioasid
>    * must be already attached to the parent.
>    *
>    * In concept there is no limit on the number of the nesting levels.
>    * However for the majority case one nesting level is sufficient. The
>    * user should check whether an IOASID supports nesting through
>    * IOASID_GET_INFO. For example, if only one nesting level is allowed,
>    * the nesting capability is reported only on the parent instead of the
>    * child.
>    *
>    * User also needs check (via IOASID_GET_INFO) whether the nesting
>    * is implemented in hardware or software. If software-based, DMA
>    * mapping protocol should be used on the child IOASID. Otherwise,
>    * the child should be operated with pgtable binding protocol.
>    *
>    * Input parameters:
>    *	- u32 parent_ioasid;
>    *
>    * Return: child_ioasid on success, -errno on failure;
>    */
> #define IOASID_CREATE_NESTING	_IO(IOASID_TYPE, IOASID_BASE + 8)
>
>
> /*
>    * Bind an user-managed I/O page table with the IOMMU
>    *
>    * Because user page table is untrusted, IOASID nesting must be enabled
>    * for this ioasid so the kernel can enforce its DMA isolation policy
>    * through the parent ioasid.
>    *
>    * Pgtable binding protocol is different from DMA mapping. The latter
>    * has the I/O page table constructed by the kernel and updated
>    * according to user MAP/UNMAP commands. With pgtable binding the
>    * whole page table is created and updated by userspace, thus different
>    * set of commands are required (bind, iotlb invalidation, page fault, etc.).
>    *
>    * Because the page table is directly walked by the IOMMU, the user
>    * must  use a format compatible to the underlying hardware. It can
>    * check the format information through IOASID_GET_INFO.
>    *
>    * The page table is bound to the IOMMU according to the routing
>    * information of each attached device under the specified IOASID. The
>    * routing information (RID and optional PASID) is registered when a
>    * device is attached to this IOASID through VFIO uAPI.
>    *
>    * Input parameters:
>    *	- child_ioasid;
>    *	- address of the user page table;
>    *	- formats (vendor, address_width, etc.);
>    *
>    * Return: 0 on success, -errno on failure.
>    */
> #define IOASID_BIND_PGTABLE		_IO(IOASID_TYPE, IOASID_BASE + 9)
> #define IOASID_UNBIND_PGTABLE	_IO(IOASID_TYPE, IOASID_BASE + 10)
>
>
> /*
>    * Bind an user-managed PASID table to the IOMMU
>    *
>    * This is required for platforms which place PASID table in the GPA space.
>    * In this case the specified IOASID represents the per-RID PASID space.
>    *
>    * Alternatively this may be replaced by IOASID_BIND_PGTABLE plus a
>    * special flag to indicate the difference from normal I/O address spaces.
>    *
>    * The format info of the PASID table is reported in IOASID_GET_INFO.
>    *
>    * As explained in the design section, user-managed I/O page tables must
>    * be explicitly bound to the kernel even on these platforms. It allows
>    * the kernel to uniformly manage I/O address spaces cross all platforms.
>    * Otherwise, the iotlb invalidation and page faulting uAPI must be hacked
>    * to carry device routing information to indirectly mark the hidden I/O
>    * address spaces.
>    *
>    * Input parameters:
>    *	- child_ioasid;
>    *	- address of PASID table;
>    *	- formats (vendor, size, etc.);
>    *
>    * Return: 0 on success, -errno on failure.
>    */
> #define IOASID_BIND_PASID_TABLE	_IO(IOASID_TYPE, IOASID_BASE + 11)
> #define IOASID_UNBIND_PASID_TABLE	_IO(IOASID_TYPE, IOASID_BASE + 12)
>
>
> /*
>    * Invalidate IOTLB for an user-managed I/O page table
>    *
>    * Unlike what's defined in include/uapi/linux/iommu.h, this command
>    * doesn't allow the user to specify cache type and likely support only
>    * two granularities (all, or a specified range) in the I/O address space.
>    *
>    * Physical IOMMU have three cache types (iotlb, dev_iotlb and pasid
>    * cache). If the IOASID represents an I/O address space, the invalidation
>    * always applies to the iotlb (and dev_iotlb if enabled). If the IOASID
>    * represents a vPASID space, then this command applies to the PASID
>    * cache.
>    *
>    * Similarly this command doesn't provide IOMMU-like granularity
>    * info (domain-wide, pasid-wide, range-based), since it's all about the
>    * I/O address space itself. The ioasid driver walks the attached
>    * routing information to match the IOMMU semantics under the
>    * hood.
>    *
>    * Input parameters:
>    *	- child_ioasid;
>    *	- granularity
>    *
>    * Return: 0 on success, -errno on failure
>    */
> #define IOASID_INVALIDATE_CACHE	_IO(IOASID_TYPE, IOASID_BASE + 13)
>
>
> /*
>    * Page fault report and response
>    *
>    * This is TBD. Can be added after other parts are cleared up. Likely it
>    * will be a ring buffer shared between user/kernel, an eventfd to notify
>    * the user and an ioctl to complete the fault.
>    *
>    * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
>    */
>
>
> /*
>    * Dirty page tracking
>    *
>    * Track and report memory pages dirtied in I/O address spaces. There
>    * is an ongoing work by Kunkun Jiang by extending existing VFIO type1.
>    * It needs be adapted to /dev/ioasid later.
>    */
>
>
> 2.2. /dev/vfio uAPI
> ++++++++++++++++
>
> /*
>    * Bind a vfio_device to the specified IOASID fd
>    *
>    * Multiple vfio devices can be bound to a single ioasid_fd, but a single
>    * vfio device should not be bound to multiple ioasid_fd's.
>    *
>    * Input parameters:
>    *	- ioasid_fd;
>    *
>    * Return: 0 on success, -errno on failure.
>    */
> #define VFIO_BIND_IOASID_FD		_IO(VFIO_TYPE, VFIO_BASE + 22)
> #define VFIO_UNBIND_IOASID_FD	_IO(VFIO_TYPE, VFIO_BASE + 23)
>
>
> /*
>    * Attach a vfio device to the specified IOASID
>    *
>    * Multiple vfio devices can be attached to the same IOASID, and vice
>    * versa.
>    *
>    * User may optionally provide a "virtual PASID" to mark an I/O page
>    * table on this vfio device. Whether the virtual PASID is physically used
>    * or converted to another kernel-allocated PASID is a policy in vfio device
>    * driver.
>    *
>    * There is no need to specify ioasid_fd in this call due to the assumption
>    * of 1:1 connection between vfio device and the bound fd.
>    *
>    * Input parameter:
>    *	- ioasid;
>    *	- flag;
>    *	- user_pasid (if specified);
>    *
>    * Return: 0 on success, -errno on failure.
>    */
> #define VFIO_ATTACH_IOASID		_IO(VFIO_TYPE, VFIO_BASE + 24)
> #define VFIO_DETACH_IOASID		_IO(VFIO_TYPE, VFIO_BASE + 25)
>
>
> 2.3. KVM uAPI
> ++++++++++++
>
> /*
>    * Update CPU PASID mapping
>    *
>    * This is necessary when ENQCMD will be used in the guest while the
>    * targeted device doesn't accept the vPASID saved in the CPU MSR.
>    *
>    * This command allows user to set/clear the vPASID->pPASID mapping
>    * in the CPU, by providing the IOASID (and FD) information representing
>    * the I/O address space marked by this vPASID.
>    *
>    * Input parameters:
>    *	- user_pasid;
>    *	- ioasid_fd;
>    *	- ioasid;
>    */
> #define KVM_MAP_PASID	_IO(KVMIO, 0xf0)
> #define KVM_UNMAP_PASID	_IO(KVMIO, 0xf1)
>
>
> 3. Sample structures and helper functions
> --------------------------------------------------------
>
> Three helper functions are provided to support VFIO_BIND_IOASID_FD:
>
> 	struct ioasid_ctx *ioasid_ctx_fdget(int fd);
> 	int ioasid_register_device(struct ioasid_ctx *ctx, struct ioasid_dev *dev);
> 	int ioasid_unregister_device(struct ioasid_dev *dev);
>
> An ioasid_ctx is created for each fd:
>
> 	struct ioasid_ctx {
> 		// a list of allocated IOASID data's
> 		struct list_head		ioasid_list;
> 		// a list of registered devices
> 		struct list_head		dev_list;
> 		// a list of pre-registered virtual address ranges
> 		struct list_head		prereg_list;
> 	};
>
> Each registered device is represented by ioasid_dev:
>
> 	struct ioasid_dev {
> 		struct list_head		next;
> 		struct ioasid_ctx	*ctx;
> 		// always be the physical device
> 		struct device 		*device;
> 		struct kref		kref;
> 	};
>
> Because we assume one vfio_device connected to at most one ioasid_fd,
> here ioasid_dev could be embedded in vfio_device and then linked to
> ioasid_ctx->dev_list when registration succeeds. For mdev the struct
> device should be the pointer to the parent device. PASID marking this
> mdev is specified later when VFIO_ATTACH_IOASID.
>
> An ioasid_data is created when IOASID_ALLOC, as the main object
> describing characteristics about an I/O page table:
>
> 	struct ioasid_data {
> 		// link to ioasid_ctx->ioasid_list
> 		struct list_head		next;
>
> 		// the IOASID number
> 		u32			ioasid;
>
> 		// the handle to convey iommu operations
> 		// hold the pgd (TBD until discussing iommu api)
> 		struct iommu_domain *domain;
>
> 		// map metadata (vfio type1 semantics)
> 		struct rb_node		dma_list;
>
> 		// pointer to user-managed pgtable (for nesting case)
> 		u64			user_pgd;
>
> 		// link to the parent ioasid (for nesting)
> 		struct ioasid_data	*parent;
>
> 		// cache the global PASID shared by ENQCMD-capable
> 		// devices (see below explanation in section 4)
> 		u32			pasid;
>
> 		// a list of device attach data (routing information)
> 		struct list_head		attach_data;
>
> 		// a list of partially-attached devices (group)
> 		struct list_head		partial_devices;
>
> 		// a list of fault_data reported from the iommu layer
> 		struct list_head		fault_data;
>
> 		...
> 	}
>
> ioasid_data and iommu_domain have overlapping roles as both are
> introduced to represent an I/O address space. It is still a big TBD how
> the two should be corelated or even merged, and whether new iommu
> ops are required to handle RID+PASID explicitly. We leave this as open
> for now as this proposal is mainly about uAPI. For simplification
> purpose the two objects are kept separate in this context, assuming an
> 1:1 connection in-between and the domain as the place-holder
> representing the 1st class object in the iommu ops.
>
> Two helper functions are provided to support VFIO_ATTACH_IOASID:
>
> 	struct attach_info {
> 		u32	ioasid;
> 		// If valid, the PASID to be used physically
> 		u32	pasid;
> 	};
> 	int ioasid_device_attach(struct ioasid_dev *dev,
> 		struct attach_info info);
> 	int ioasid_device_detach(struct ioasid_dev *dev, u32 ioasid);
>
> The pasid parameter is optionally provided based on the policy in vfio
> device driver. It could be the PASID marking the default I/O address
> space for a mdev, or the user-provided PASID marking an user I/O page
> table, or another kernel-allocated PASID backing the user-provided one.
> Please check next section for detail explanation.
>
> A new object is introduced and linked to ioasid_data->attach_data for
> each successful attach operation:
>
> 	struct ioasid_attach_data {
> 		struct list_head		next;
> 		struct ioasid_dev	*dev;
> 		u32 			pasid;
> 	}
>
> As explained in the design section, there is no explicit group enforcement
> in /dev/ioasid uAPI or helper functions. But the ioasid driver does
> implicit group check - before every device within an iommu group is
> attached to this IOASID, the previously-attached devices in this group are
> put in ioasid_data->partial_devices. The IOASID rejects any command if
> the partial_devices list is not empty.
>
> Then is the last helper function:
> 	u32 ioasid_get_global_pasid(struct ioasid_ctx *ctx,
> 		u32 ioasid, bool alloc);
>
> ioasid_get_global_pasid is necessary in scenarios where multiple devices
> want to share a same PASID value on the attached I/O page table (e.g.
> when ENQCMD is enabled, as explained in next section). We need a
> centralized place (ioasid_data->pasid) to hold this value (allocated when
> first called with alloc=true). vfio device driver calls this function (alloc=
> true) to get the global PASID for an ioasid before calling ioasid_device_
> attach. KVM also calls this function (alloc=false) to setup PASID translation
> structure when user calls KVM_MAP_PASID.
>
> 4. PASID Virtualization
> ------------------------------
>
> When guest SVA (vSVA) is enabled, multiple GVA address spaces are
> created on the assigned vfio device. This leads to the concepts of
> "virtual PASID" (vPASID) vs. "physical PASID" (pPASID). vPASID is assigned
> by the guest to mark an GVA address space while pPASID is the one
> selected by the host and actually routed in the wire.
>
> vPASID is conveyed to the kernel when user calls VFIO_ATTACH_IOASID.
>
> vfio device driver translates vPASID to pPASID before calling ioasid_attach_
> device, with two factors to be considered:
>
> -    Whether vPASID is directly used (vPASID==pPASID) in the wire, or
>       should be instead converted to a newly-allocated one (vPASID!=
>       pPASID);
>
> -    If vPASID!=pPASID, whether pPASID is allocated from per-RID PASID
>       space or a global PASID space (implying sharing pPASID cross devices,
>       e.g. when supporting Intel ENQCMD which puts PASID in a CPU MSR
>       as part of the process context);
>
> The actual policy depends on pdev vs. mdev, and whether ENQCMD is
> supported. There are three possible scenarios:
>
> (Note: /dev/ioasid uAPI is not affected by underlying PASID virtualization
> policies.)
>
> 1)  pdev (w/ or w/o ENQCMD): vPASID==pPASID
>
>       vPASIDs are directly programmed by the guest to the assigned MMIO
>       bar, implying all DMAs out of this device having vPASID in the packet
>       header. This mandates vPASID==pPASID, sort of delegating the entire
>       per-RID PASID space to the guest.
>
>       When ENQCMD is enabled, the CPU MSR when running a guest task
>       contains a vPASID. In this case the CPU PASID translation capability
>       should be disabled so this vPASID in CPU MSR is directly sent to the
>       wire.
>
>       This ensures consistent vPASID usage on pdev regardless of the
>       workload submitted through a MMIO register or ENQCMD instruction.
>
> 2)  mdev: vPASID!=pPASID (per-RID if w/o ENQCMD, otherwise global)
>
>       PASIDs are also used by kernel to mark the default I/O address space
>       for mdev, thus cannot be delegated to the guest. Instead, the mdev
>       driver must allocate a new pPASID for each vPASID (thus vPASID!=
>       pPASID) and then use pPASID when attaching this mdev to an ioasid.
>
>       The mdev driver needs cache the PASID mapping so in mediation
>       path vPASID programmed by the guest can be converted to pPASID
>       before updating the physical MMIO register. The mapping should
>       also be saved in the CPU PASID translation structure (via KVM uAPI),
>       so the vPASID saved in the CPU MSR is auto-translated to pPASID
>       before sent to the wire, when ENQCMD is enabled.
>
>       Generally pPASID could be allocated from the per-RID PASID space
>       if all mdev's created on the parent device don't support ENQCMD.
>
>       However if the parent supports ENQCMD-capable mdev, pPASIDs
>       must be allocated from a global pool because the CPU PASID
>       translation structure is per-VM. It implies that when an guest I/O
>       page table is attached to two mdevs with a single vPASID (i.e. bind
>       to the same guest process), a same pPASID should be used for
>       both mdevs even when they belong to different parents. Sharing
>       pPASID cross mdevs is achieved by calling aforementioned ioasid_
>       get_global_pasid().
>
> 3)  Mix pdev/mdev together
>
>       Above policies are per device type thus are not affected when mixing
>       those device types together (when assigned to a single guest). However,
>       there is one exception - when both pdev/mdev support ENQCMD.
>
>       Remember the two types have conflicting requirements on whether
>       CPU PASID translation should be enabled. This capability is per-VM,
>       and must be enabled for mdev isolation. When enabled, pdev will
>       receive a mdev pPASID violating its vPASID expectation.
>
>       In previous thread a PASID range split scheme was discussed to support
>       this combination, but we haven't worked out a clean uAPI design yet.
>       Therefore in this proposal we decide to not support it, implying the
>       user should have some intelligence to avoid such scenario. It could be
>       a TODO task for future.
>
> In spite of those subtle considerations, the kernel implementation could
> start simple, e.g.:
>
> -    v==p for pdev;
> -    v!=p and always use a global PASID pool for all mdev's;
>
> Regardless of the kernel policy, the user policy is unchanged:
>
> -    provide vPASID when calling VFIO_ATTACH_IOASID;
> -    call KVM uAPI to setup CPU PASID translation if ENQCMD-capable mdev;
> -    Don't expose ENQCMD capability on both pdev and mdev;
>
> Sample user flow is described in section 5.5.
>
> 5. Use Cases and Flows
> -------------------------------
>
> Here assume VFIO will support a new model where every bound device
> is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> going through legacy container/group interface. For illustration purpose
> those devices are just called dev[1...N]:
>
> 	device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
>
> As explained earlier, one IOASID fd is sufficient for all intended use cases:
>
> 	ioasid_fd = open("/dev/ioasid", mode);
>
> For simplicity below examples are all made for the virtualization story.
> They are representative and could be easily adapted to a non-virtualization
> scenario.
>
> Three types of IOASIDs are considered:
>
> 	gpa_ioasid[1...N]: 	for GPA address space
> 	giova_ioasid[1...N]:	for guest IOVA address space
> 	gva_ioasid[1...N]:	for guest CPU VA address space
>
> At least one gpa_ioasid must always be created per guest, while the other
> two are relevant as far as vIOMMU is concerned.
>
> Examples here apply to both pdev and mdev, if not explicitly marked out
> (e.g. in section 5.5). VFIO device driver in the kernel will figure out the
> associated routing information in the attaching operation.
>
> For illustration simplicity, IOASID_CHECK_EXTENSION and IOASID_GET_
> INFO are skipped in these examples.
>
> 5.1. A simple example
> ++++++++++++++++++
>
> Dev1 is assigned to the guest. One gpa_ioasid is created. The GPA address
> space is managed through DMA mapping protocol:
>
> 	/* Bind device to IOASID fd */
> 	device_fd = open("/dev/vfio/devices/dev1", mode);
> 	ioasid_fd = open("/dev/ioasid", mode);
> 	ioctl(device_fd, VFIO_BIND_IOASID_FD, ioasid_fd);
>
> 	/* Attach device to IOASID */
> 	gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> 	at_data = { .ioasid = gpa_ioasid};
> 	ioctl(device_fd, VFIO_ATTACH_IOASID, &at_data);
>
> 	/* Setup GPA mapping */
> 	dma_map = {
> 		.ioasid	= gpa_ioasid;
> 		.iova	= 0;		// GPA
> 		.vaddr	= 0x40000000;	// HVA
> 		.size	= 1GB;
> 	};
> 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
>
> If the guest is assigned with more than dev1, user follows above sequence
> to attach other devices to the same gpa_ioasid i.e. sharing the GPA
> address space cross all assigned devices.
>
> 5.2. Multiple IOASIDs (no nesting)
> ++++++++++++++++++++++++++++
>
> Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
> both devices are attached to gpa_ioasid. After boot the guest creates
> an GIOVA address space (giova_ioasid) for dev2, leaving dev1 in pass
> through mode (gpa_ioasid).
>
> Suppose IOASID nesting is not supported in this case. Qemu need to
> generate shadow mappings in userspace for giova_ioasid (like how
> VFIO works today).
>
> To avoid duplicated locked page accounting, it's recommended to pre-
> register the virtual address range that will be used for DMA:
>
> 	device_fd1 = open("/dev/vfio/devices/dev1", mode);
> 	device_fd2 = open("/dev/vfio/devices/dev2", mode);
> 	ioasid_fd = open("/dev/ioasid", mode);
> 	ioctl(device_fd1, VFIO_BIND_IOASID_FD, ioasid_fd);
> 	ioctl(device_fd2, VFIO_BIND_IOASID_FD, ioasid_fd);
>
> 	/* pre-register the virtual address range for accounting */
> 	mem_info = { .vaddr = 0x40000000; .size = 1GB };
> 	ioctl(ioasid_fd, IOASID_REGISTER_MEMORY, &mem_info);
>
> 	/* Attach dev1 and dev2 to gpa_ioasid */
> 	gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> 	at_data = { .ioasid = gpa_ioasid};
> 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
>
> 	/* Setup GPA mapping */
> 	dma_map = {
> 		.ioasid	= gpa_ioasid;
> 		.iova	= 0; 		// GPA
> 		.vaddr	= 0x40000000;	// HVA
> 		.size	= 1GB;
> 	};
> 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
>
> 	/* After boot, guest enables an GIOVA space for dev2 */
> 	giova_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
>
> 	/* First detach dev2 from previous address space */
> 	at_data = { .ioasid = gpa_ioasid};
> 	ioctl(device_fd2, VFIO_DETACH_IOASID, &at_data);
>
> 	/* Then attach dev2 to the new address space */
> 	at_data = { .ioasid = giova_ioasid};
> 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
>
> 	/* Setup a shadow DMA mapping according to vIOMMU
> 	  * GIOVA (0x2000) -> GPA (0x1000) -> HVA (0x40001000)
> 	  */
> 	dma_map = {
> 		.ioasid	= giova_ioasid;
> 		.iova	= 0x2000; 	// GIOVA
> 		.vaddr	= 0x40001000;	// HVA
> 		.size	= 4KB;
> 	};
> 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
>
> 5.3. IOASID nesting (software)
> +++++++++++++++++++++++++
>
> Same usage scenario as 5.2, with software-based IOASID nesting
> available. In this mode it is the kernel instead of user to create the
> shadow mapping.
>
> The flow before guest boots is same as 5.2, except one point. Because
> giova_ioasid is nested on gpa_ioasid, locked accounting is only
> conducted for gpa_ioasid. So it's not necessary to pre-register virtual
> memory.
>
> To save space we only list the steps after boots (i.e. both dev1/dev2
> have been attached to gpa_ioasid before guest boots):
>
> 	/* After boots */
> 	/* Make GIOVA space nested on GPA space */
> 	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
>
> 	/* Attach dev2 to the new address space (child)
> 	  * Note dev2 is still attached to gpa_ioasid (parent)
> 	  */
> 	at_data = { .ioasid = giova_ioasid};
> 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);


For vDPA, we need something similar. And in the future, vDPA may allow 
multiple ioasid to be attached to a single device. It should work with 
the current design.


>
> 	/* Setup a GIOVA->GPA mapping for giova_ioasid, which will be
> 	  * merged by the kernel with GPA->HVA mapping of gpa_ioasid
> 	  * to form a shadow mapping.
> 	  */
> 	dma_map = {
> 		.ioasid	= giova_ioasid;
> 		.iova	= 0x2000;	// GIOVA
> 		.vaddr	= 0x1000;	// GPA
> 		.size	= 4KB;
> 	};
> 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
>
> 5.4. IOASID nesting (hardware)
> +++++++++++++++++++++++++
>
> Same usage scenario as 5.2, with hardware-based IOASID nesting
> available. In this mode the pgtable binding protocol is used to
> bind the guest IOVA page table with the IOMMU:
>
> 	/* After boots */
> 	/* Make GIOVA space nested on GPA space */
> 	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
>
> 	/* Attach dev2 to the new address space (child)
> 	  * Note dev2 is still attached to gpa_ioasid (parent)
> 	  */
> 	at_data = { .ioasid = giova_ioasid};
> 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);


I guess VFIO_ATTACH_IOASID will fail if the underlayer doesn't support 
hardware nesting. Or is there way to detect the capability before?

I think GET_INFO only works after the ATTACH.


>
> 	/* Bind guest I/O page table  */
> 	bind_data = {
> 		.ioasid	= giova_ioasid;
> 		.addr	= giova_pgtable;
> 		// and format information
> 	};
> 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
>
> 	/* Invalidate IOTLB when required */
> 	inv_data = {
> 		.ioasid	= giova_ioasid;
> 		// granular information
> 	};
> 	ioctl(ioasid_fd, IOASID_INVALIDATE_CACHE, &inv_data);
>
> 	/* See 5.6 for I/O page fault handling */
> 	
> 5.5. Guest SVA (vSVA)
> ++++++++++++++++++
>
> After boots the guest further create a GVA address spaces (gpasid1) on
> dev1. Dev2 is not affected (still attached to giova_ioasid).
>
> As explained in section 4, user should avoid expose ENQCMD on both
> pdev and mdev.
>
> The sequence applies to all device types (being pdev or mdev), except
> one additional step to call KVM for ENQCMD-capable mdev:


My understanding is ENQCMD is Intel specific and not a requirement for 
having vSVA.


>
> 	/* After boots */
> 	/* Make GVA space nested on GPA space */
> 	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
>
> 	/* Attach dev1 to the new address space and specify vPASID */
> 	at_data = {
> 		.ioasid		= gva_ioasid;
> 		.flag 		= IOASID_ATTACH_USER_PASID;
> 		.user_pasid	= gpasid1;
> 	};
> 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
>
> 	/* if dev1 is ENQCMD-capable mdev, update CPU PASID
> 	  * translation structure through KVM
> 	  */
> 	pa_data = {
> 		.ioasid_fd	= ioasid_fd;
> 		.ioasid		= gva_ioasid;
> 		.guest_pasid	= gpasid1;
> 	};
> 	ioctl(kvm_fd, KVM_MAP_PASID, &pa_data);
>
> 	/* Bind guest I/O page table  */
> 	bind_data = {
> 		.ioasid	= gva_ioasid;
> 		.addr	= gva_pgtable1;
> 		// and format information
> 	};
> 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
>
> 	...
>
>
> 5.6. I/O page fault
> +++++++++++++++
>
> (uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
> to guest IOMMU driver and backwards).
>
> -   Host IOMMU driver receives a page request with raw fault_data {rid,
>      pasid, addr};
>
> -   Host IOMMU driver identifies the faulting I/O page table according to
>      information registered by IOASID fault handler;
>
> -   IOASID fault handler is called with raw fault_data (rid, pasid, addr), which
>      is saved in ioasid_data->fault_data (used for response);
>
> -   IOASID fault handler generates an user fault_data (ioasid, addr), links it
>      to the shared ring buffer and triggers eventfd to userspace;
>
> -   Upon received event, Qemu needs to find the virtual routing information
>      (v_rid + v_pasid) of the device attached to the faulting ioasid. If there are
>      multiple, pick a random one. This should be fine since the purpose is to
>      fix the I/O page table on the guest;
>
> -   Qemu generates a virtual I/O page fault through vIOMMU into guest,
>      carrying the virtual fault data (v_rid, v_pasid, addr);
>
> -   Guest IOMMU driver fixes up the fault, updates the I/O page table, and
>      then sends a page response with virtual completion data (v_rid, v_pasid,
>      response_code) to vIOMMU;
>
> -   Qemu finds the pending fault event, converts virtual completion data
>      into (ioasid, response_code), and then calls a /dev/ioasid ioctl to
>      complete the pending fault;
>
> -   /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved in
>      ioasid_data->fault_data, and then calls iommu api to complete it with
>      {rid, pasid, response_code};
>
> 5.7. BIND_PASID_TABLE
> ++++++++++++++++++++
>
> PASID table is put in the GPA space on some platform, thus must be updated
> by the guest. It is treated as another user page table to be bound with the
> IOMMU.
>
> As explained earlier, the user still needs to explicitly bind every user I/O
> page table to the kernel so the same pgtable binding protocol (bind, cache
> invalidate and fault handling) is unified cross platforms.
>
> vIOMMUs may include a caching mode (or paravirtualized way) which, once
> enabled, requires the guest to invalidate PASID cache for any change on the
> PASID table. This allows Qemu to track the lifespan of guest I/O page tables.
>
> In case of missing such capability, Qemu could enable write-protection on
> the guest PASID table to achieve the same effect.
>
> 	/* After boots */
> 	/* Make vPASID space nested on GPA space */
> 	pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
>
> 	/* Attach dev1 to pasidtbl_ioasid */
> 	at_data = { .ioasid = pasidtbl_ioasid};
> 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
>
> 	/* Bind PASID table */
> 	bind_data = {
> 		.ioasid	= pasidtbl_ioasid;
> 		.addr	= gpa_pasid_table;
> 		// and format information
> 	};
> 	ioctl(ioasid_fd, IOASID_BIND_PASID_TABLE, &bind_data);
>
> 	/* vIOMMU detects a new GVA I/O space created */
> 	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
>
> 	/* Attach dev1 to the new address space, with gpasid1 */
> 	at_data = {
> 		.ioasid		= gva_ioasid;
> 		.flag 		= IOASID_ATTACH_USER_PASID;
> 		.user_pasid	= gpasid1;
> 	};
> 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);


Do we need VFIO_DETACH_IOASID?

Thanks


>
> 	/* Bind guest I/O page table. Because SET_PASID_TABLE has been
> 	  * used, the kernel will not update the PASID table. Instead, just
> 	  * track the bound I/O page table for handling invalidation and
> 	  * I/O page faults.
> 	  */
> 	bind_data = {
> 		.ioasid	= gva_ioasid;
> 		.addr	= gva_pgtable1;
> 		// and format information
> 	};
> 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
>
> 	...
>
> Thanks
> Kevin
>


_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-27  7:58 ` Tian, Kevin
@ 2021-05-28 16:23   ` Jean-Philippe Brucker
  -1 siblings, 0 replies; 518+ messages in thread
From: Jean-Philippe Brucker @ 2021-05-28 16:23 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: LKML, Joerg Roedel, Jason Gunthorpe, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, David Gibson, Kirti Wankhede,
	Robin Murphy

On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> /dev/ioasid provides an unified interface for managing I/O page tables for 
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, 
> etc.) are expected to use this interface instead of creating their own logic to 
> isolate untrusted device DMAs initiated by userspace. 
> 
> This proposal describes the uAPI of /dev/ioasid and also sample sequences 
> with VFIO as example in typical usages. The driver-facing kernel API provided 
> by the iommu layer is still TBD, which can be discussed after consensus is 
> made on this uAPI.
> 
> It's based on a lengthy discussion starting from here:
> 	https://lore.kernel.org/linux-iommu/20210330132830.GO2356281@nvidia.com/ 
> 
> It ends up to be a long writing due to many things to be summarized and
> non-trivial effort required to connect them into a complete proposal.
> Hope it provides a clean base to converge.

Firstly thanks for writing this up and for your patience. I've not read in
detail the second half yet, will take another look later.

> 1. Terminologies and Concepts
> -----------------------------------------
> 
> IOASID FD is the container holding multiple I/O address spaces. User 
> manages those address spaces through FD operations. Multiple FD's are 
> allowed per process, but with this proposal one FD should be sufficient for 
> all intended usages.
> 
> IOASID is the FD-local software handle representing an I/O address space. 
> Each IOASID is associated with a single I/O page table. IOASIDs can be 
> nested together, implying the output address from one I/O page table 
> (represented by child IOASID) must be further translated by another I/O 
> page table (represented by parent IOASID).
> 
> I/O address space can be managed through two protocols, according to 
> whether the corresponding I/O page table is constructed by the kernel or 
> the user. When kernel-managed, a dma mapping protocol (similar to 
> existing VFIO iommu type1) is provided for the user to explicitly specify 
> how the I/O address space is mapped. Otherwise, a different protocol is 
> provided for the user to bind an user-managed I/O page table to the 
> IOMMU, plus necessary commands for iotlb invalidation and I/O fault 
> handling. 
> 
> Pgtable binding protocol can be used only on the child IOASID's, implying 
> IOASID nesting must be enabled. This is because the kernel doesn't trust 
> userspace. Nesting allows the kernel to enforce its DMA isolation policy 
> through the parent IOASID.
> 
> IOASID nesting can be implemented in two ways: hardware nesting and 
> software nesting. With hardware support the child and parent I/O page 
> tables are walked consecutively by the IOMMU to form a nested translation. 
> When it's implemented in software, the ioasid driver is responsible for 
> merging the two-level mappings into a single-level shadow I/O page table. 
> Software nesting requires both child/parent page tables operated through 
> the dma mapping protocol, so any change in either level can be captured 
> by the kernel to update the corresponding shadow mapping.

Is there an advantage to moving software nesting into the kernel?
We could just have the guest do its usual combined map/unmap on the child
fd

> 
> An I/O address space takes effect in the IOMMU only after it is attached 
> to a device. The device in the /dev/ioasid context always refers to a 
> physical one or 'pdev' (PF or VF). 
> 
> One I/O address space could be attached to multiple devices. In this case, 
> /dev/ioasid uAPI applies to all attached devices under the specified IOASID.
> 
> Based on the underlying IOMMU capability one device might be allowed 
> to attach to multiple I/O address spaces, with DMAs accessing them by 
> carrying different routing information. One of them is the default I/O 
> address space routed by PCI Requestor ID (RID) or ARM Stream ID. The 
> remaining are routed by RID + Process Address Space ID (PASID) or 
> Stream+Substream ID. For simplicity the following context uses RID and
> PASID when talking about the routing information for I/O address spaces.
> 
> Device attachment is initiated through passthrough framework uAPI (use
> VFIO for simplicity in following context). VFIO is responsible for identifying 
> the routing information and registering it to the ioasid driver when calling 
> ioasid attach helper function. It could be RID if the assigned device is 
> pdev (PF/VF) or RID+PASID if the device is mediated (mdev). In addition, 
> user might also provide its view of virtual routing information (vPASID) in 
> the attach call, e.g. when multiple user-managed I/O address spaces are 
> attached to the vfio_device. In this case VFIO must figure out whether 
> vPASID should be directly used (for pdev) or converted to a kernel-
> allocated one (pPASID, for mdev) for physical routing (see section 4).
> 
> Device must be bound to an IOASID FD before attach operation can be
> conducted. This is also through VFIO uAPI. In this proposal one device 
> should not be bound to multiple FD's. Not sure about the gain of 
> allowing it except adding unnecessary complexity. But if others have 
> different view we can further discuss.
> 
> VFIO must ensure its device composes DMAs with the routing information
> attached to the IOASID. For pdev it naturally happens since vPASID is 
> directly programmed to the device by guest software. For mdev this 
> implies any guest operation carrying a vPASID on this device must be 
> trapped into VFIO and then converted to pPASID before sent to the 
> device. A detail explanation about PASID virtualization policies can be 
> found in section 4. 
> 
> Modern devices may support a scalable workload submission interface 
> based on PCI DMWr capability, allowing a single work queue to access
> multiple I/O address spaces. One example is Intel ENQCMD, having 
> PASID saved in the CPU MSR and carried in the instruction payload 
> when sent out to the device. Then a single work queue shared by 
> multiple processes can compose DMAs carrying different PASIDs. 
> 
> When executing ENQCMD in the guest, the CPU MSR includes a vPASID 
> which, if targeting a mdev, must be converted to pPASID before sent
> to the wire. Intel CPU provides a hardware PASID translation capability 
> for auto-conversion in the fast path. The user is expected to setup the 
> PASID mapping through KVM uAPI, with information about {vpasid, 
> ioasid_fd, ioasid}. The ioasid driver provides helper function for KVM 
> to figure out the actual pPASID given an IOASID.
> 
> With above design /dev/ioasid uAPI is all about I/O address spaces. 
> It doesn't include any device routing information, which is only 
> indirectly registered to the ioasid driver through VFIO uAPI. For 
> example, I/O page fault is always reported to userspace per IOASID, 
> although it's physically reported per device (RID+PASID). If there is a 
> need of further relaying this fault into the guest, the user is responsible 
> of identifying the device attached to this IOASID (randomly pick one if 
> multiple attached devices)

We need to report accurate information for faults. If the guest tells
device A to DMA, it shouldn't receive a fault report for device B. This is
important if the guest needs to kill a misbehaving device, or even just
for statistics and debugging. It may also simplify routing the page
response, which has to be fast.

> and then generates a per-device virtual I/O 
> page fault into guest. Similarly the iotlb invalidation uAPI describes the 
> granularity in the I/O address space (all, or a range), different from the 
> underlying IOMMU semantics (domain-wide, PASID-wide, range-based).
> 
> I/O page tables routed through PASID are installed in a per-RID PASID 
> table structure. Some platforms implement the PASID table in the guest 
> physical space (GPA), expecting it managed by the guest. The guest
> PASID table is bound to the IOMMU also by attaching to an IOASID, 
> representing the per-RID vPASID space. 
> 
> We propose the host kernel needs to explicitly track  guest I/O page 
> tables even on these platforms, i.e. the same pgtable binding protocol 
> should be used universally on all platforms (with only difference on who 
> actually writes the PASID table).

This adds significant complexity for Arm (and AMD). Userspace will now
need to walk the PASID table, serializing against invalidation. At least
the SMMU has caching mode for PASID tables so there is no need to trap,
but I'd rather avoid this. I really don't want to make virtio-iommu
devices walk PASID tables unless absolutely necessary, they need to stay
simple.

> One opinion from previous discussion 
> was treating this special IOASID as a container for all guest I/O page 
> tables i.e. hiding them from the host. However this way significantly 
> violates the philosophy in this /dev/ioasid proposal.

It does correspond better to the underlying architecture and hardware
implementation, of which userspace is well aware since it has to report
them to the guest and deal with different descriptor formats.

> It is not one IOASID 
> one address space any more. Device routing information (indirectly 
> marking hidden I/O spaces) has to be carried in iotlb invalidation and 
> page faulting uAPI to help connect vIOMMU with the underlying 
> pIOMMU.

As above, I think it's essential that we carry device information in fault
reports. In addition to the two previous reasons, on these platforms
userspace will route all faults through the same channel (vIOMMU event
queue) regardless of the PASID, so we do not need them split and tracked
by PASID. Given that IOPF will be a hot path we should make sure there is
no unnecessary indirection.

Regarding the invalidation, I think limiting it to IOASID may work but it
does bother me that we can't directly forward all invalidations received
on the vIOMMU: if the guest sends a device-wide invalidation, do we
iterate over all IOASIDs and issue one ioctl for each?  Sure the guest is
probably sending that because of detaching the PASID table, for which the
kernel did perform the invalidation, but we can't just assume that and
ignore the request, there may be a different reason. Iterating is going to
take a lot time, whereas with the current API we can send a single request
and issue a single command to the IOMMU hardware.

Similarly, if the guest sends an ATC invalidation for a whole device (in
the SMMU, that's an ATC_INV without SSID), we'll have to transform that
into multiple IOTLB invalidations?  We can't just send it on IOASID #0,
because it may not have been created by the guest.

Maybe we could at least have invalidation requests on the parent fd for
this kind of global case?  But I'd much rather avoid the PASID tracking
altogether and keep the existing cache invalidate API, let the pIOMMU
driver decode that stuff.

> This is one design choice to be confirmed with ARM guys.
> 
> Devices may sit behind IOMMU's with incompatible capabilities. The
> difference may lie in the I/O page table format, or availability of an user
> visible uAPI (e.g. hardware nesting). /dev/ioasid is responsible for 
> checking the incompatibility between newly-attached device and existing
> devices under the specific IOASID and, if found, returning error to user.
> Upon such error the user should create a new IOASID for the incompatible
> device. 
> 
> There is no explicit group enforcement in /dev/ioasid uAPI, due to no 
> device notation in this interface as aforementioned. But the ioasid driver 
> does implicit check to make sure that devices within an iommu group 
> must be all attached to the same IOASID before this IOASID starts to
> accept any uAPI command. Otherwise error information is returned to 
> the user.
> 
> There was a long debate in previous discussion whether VFIO should keep 
> explicit container/group semantics in its uAPI. Jason Gunthorpe proposes 
> a simplified model where every device bound to VFIO is explicitly listed 
> under /dev/vfio thus a device fd can be acquired w/o going through legacy
> container/group interface. In this case the user is responsible for 
> understanding the group topology and meeting the implicit group check 
> criteria enforced in /dev/ioasid. The use case examples in this proposal 
> are based on the new model.
> 
> Of course for backward compatibility VFIO still needs to keep the existing 
> uAPI and vfio iommu type1 will become a shim layer connecting VFIO 
> iommu ops to internal ioasid helper functions.
> 
> Notes:
> -   It might be confusing as IOASID is also used in the kernel (drivers/
>     iommu/ioasid.c) to represent PCI PASID or ARM substream ID. We need
>     find a better name later to differentiate.

Yes this isn't just about allocating PASIDs anymore. /dev/iommu or
/dev/ioas would make more sense.

> 
> -   PPC has not be considered yet as we haven't got time to fully understand
>     its semantics. According to previous discussion there is some generality 
>     between PPC window-based scheme and VFIO type1 semantics. Let's 
>     first make consensus on this proposal and then further discuss how to 
>     extend it to cover PPC's requirement.
> 
> -   There is a protocol between vfio group and kvm. Needs to think about
>     how it will be affected following this proposal.

(Arm also needs this, obtaining the VMID allocated by KVM and write it to
the SMMU descriptor when installing the PASID table
https://lore.kernel.org/linux-iommu/20210222155338.26132-1-shameerali.kolothum.thodi@huawei.com/)

> 
> -   mdev in this context refers to mediated subfunctions (e.g. Intel SIOV) 
>     which can be physically isolated in-between through PASID-granular
>     IOMMU protection. Historically people also discussed one usage by 
>     mediating a pdev into a mdev. This usage is not covered here, and is 
>     supposed to be replaced by Max's work which allows overriding various 
>     VFIO operations in vfio-pci driver.
> 
> 2. uAPI Proposal
> ----------------------
[...]

> /*
>   * Get information about an I/O address space
>   *
>   * Supported capabilities:
>   *	- VFIO type1 map/unmap;
>   *	- pgtable/pasid_table binding
>   *	- hardware nesting vs. software nesting;
>   *	- ...
>   *
>   * Related attributes:
>   * 	- supported page sizes, reserved IOVA ranges (DMA mapping);
>   *	- vendor pgtable formats (pgtable binding);
>   *	- number of child IOASIDs (nesting);
>   *	- ...
>   *
>   * Above information is available only after one or more devices are
>   * attached to the specified IOASID. Otherwise the IOASID is just a
>   * number w/o any capability or attribute.
>   *
>   * Input parameters:
>   *	- u32 ioasid;
>   *
>   * Output parameters:
>   *	- many. TBD.

We probably need a capability format similar to PCI and VFIO.

>   */
> #define IOASID_GET_INFO	_IO(IOASID_TYPE, IOASID_BASE + 5)
[...]

> 2.2. /dev/vfio uAPI
> ++++++++++++++++
> 
> /*
>   * Bind a vfio_device to the specified IOASID fd
>   *
>   * Multiple vfio devices can be bound to a single ioasid_fd, but a single 
>   * vfio device should not be bound to multiple ioasid_fd's. 
>   *
>   * Input parameters:
>   *	- ioasid_fd;

How about adding a 32-bit "virtual RID" at this point, that the kernel can
provide to userspace during fault reporting?

Thanks,
Jean

>   *
>   * Return: 0 on success, -errno on failure.
>   */
> #define VFIO_BIND_IOASID_FD		_IO(VFIO_TYPE, VFIO_BASE + 22)
> #define VFIO_UNBIND_IOASID_FD	_IO(VFIO_TYPE, VFIO_BASE + 23)


^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-05-28 16:23   ` Jean-Philippe Brucker
  0 siblings, 0 replies; 518+ messages in thread
From: Jean-Philippe Brucker @ 2021-05-28 16:23 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, David Gibson, Jason Gunthorpe, Jiang,
	Dave, Wu, Hao, David Woodhouse, Jason Wang

On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> /dev/ioasid provides an unified interface for managing I/O page tables for 
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, 
> etc.) are expected to use this interface instead of creating their own logic to 
> isolate untrusted device DMAs initiated by userspace. 
> 
> This proposal describes the uAPI of /dev/ioasid and also sample sequences 
> with VFIO as example in typical usages. The driver-facing kernel API provided 
> by the iommu layer is still TBD, which can be discussed after consensus is 
> made on this uAPI.
> 
> It's based on a lengthy discussion starting from here:
> 	https://lore.kernel.org/linux-iommu/20210330132830.GO2356281@nvidia.com/ 
> 
> It ends up to be a long writing due to many things to be summarized and
> non-trivial effort required to connect them into a complete proposal.
> Hope it provides a clean base to converge.

Firstly thanks for writing this up and for your patience. I've not read in
detail the second half yet, will take another look later.

> 1. Terminologies and Concepts
> -----------------------------------------
> 
> IOASID FD is the container holding multiple I/O address spaces. User 
> manages those address spaces through FD operations. Multiple FD's are 
> allowed per process, but with this proposal one FD should be sufficient for 
> all intended usages.
> 
> IOASID is the FD-local software handle representing an I/O address space. 
> Each IOASID is associated with a single I/O page table. IOASIDs can be 
> nested together, implying the output address from one I/O page table 
> (represented by child IOASID) must be further translated by another I/O 
> page table (represented by parent IOASID).
> 
> I/O address space can be managed through two protocols, according to 
> whether the corresponding I/O page table is constructed by the kernel or 
> the user. When kernel-managed, a dma mapping protocol (similar to 
> existing VFIO iommu type1) is provided for the user to explicitly specify 
> how the I/O address space is mapped. Otherwise, a different protocol is 
> provided for the user to bind an user-managed I/O page table to the 
> IOMMU, plus necessary commands for iotlb invalidation and I/O fault 
> handling. 
> 
> Pgtable binding protocol can be used only on the child IOASID's, implying 
> IOASID nesting must be enabled. This is because the kernel doesn't trust 
> userspace. Nesting allows the kernel to enforce its DMA isolation policy 
> through the parent IOASID.
> 
> IOASID nesting can be implemented in two ways: hardware nesting and 
> software nesting. With hardware support the child and parent I/O page 
> tables are walked consecutively by the IOMMU to form a nested translation. 
> When it's implemented in software, the ioasid driver is responsible for 
> merging the two-level mappings into a single-level shadow I/O page table. 
> Software nesting requires both child/parent page tables operated through 
> the dma mapping protocol, so any change in either level can be captured 
> by the kernel to update the corresponding shadow mapping.

Is there an advantage to moving software nesting into the kernel?
We could just have the guest do its usual combined map/unmap on the child
fd

> 
> An I/O address space takes effect in the IOMMU only after it is attached 
> to a device. The device in the /dev/ioasid context always refers to a 
> physical one or 'pdev' (PF or VF). 
> 
> One I/O address space could be attached to multiple devices. In this case, 
> /dev/ioasid uAPI applies to all attached devices under the specified IOASID.
> 
> Based on the underlying IOMMU capability one device might be allowed 
> to attach to multiple I/O address spaces, with DMAs accessing them by 
> carrying different routing information. One of them is the default I/O 
> address space routed by PCI Requestor ID (RID) or ARM Stream ID. The 
> remaining are routed by RID + Process Address Space ID (PASID) or 
> Stream+Substream ID. For simplicity the following context uses RID and
> PASID when talking about the routing information for I/O address spaces.
> 
> Device attachment is initiated through passthrough framework uAPI (use
> VFIO for simplicity in following context). VFIO is responsible for identifying 
> the routing information and registering it to the ioasid driver when calling 
> ioasid attach helper function. It could be RID if the assigned device is 
> pdev (PF/VF) or RID+PASID if the device is mediated (mdev). In addition, 
> user might also provide its view of virtual routing information (vPASID) in 
> the attach call, e.g. when multiple user-managed I/O address spaces are 
> attached to the vfio_device. In this case VFIO must figure out whether 
> vPASID should be directly used (for pdev) or converted to a kernel-
> allocated one (pPASID, for mdev) for physical routing (see section 4).
> 
> Device must be bound to an IOASID FD before attach operation can be
> conducted. This is also through VFIO uAPI. In this proposal one device 
> should not be bound to multiple FD's. Not sure about the gain of 
> allowing it except adding unnecessary complexity. But if others have 
> different view we can further discuss.
> 
> VFIO must ensure its device composes DMAs with the routing information
> attached to the IOASID. For pdev it naturally happens since vPASID is 
> directly programmed to the device by guest software. For mdev this 
> implies any guest operation carrying a vPASID on this device must be 
> trapped into VFIO and then converted to pPASID before sent to the 
> device. A detail explanation about PASID virtualization policies can be 
> found in section 4. 
> 
> Modern devices may support a scalable workload submission interface 
> based on PCI DMWr capability, allowing a single work queue to access
> multiple I/O address spaces. One example is Intel ENQCMD, having 
> PASID saved in the CPU MSR and carried in the instruction payload 
> when sent out to the device. Then a single work queue shared by 
> multiple processes can compose DMAs carrying different PASIDs. 
> 
> When executing ENQCMD in the guest, the CPU MSR includes a vPASID 
> which, if targeting a mdev, must be converted to pPASID before sent
> to the wire. Intel CPU provides a hardware PASID translation capability 
> for auto-conversion in the fast path. The user is expected to setup the 
> PASID mapping through KVM uAPI, with information about {vpasid, 
> ioasid_fd, ioasid}. The ioasid driver provides helper function for KVM 
> to figure out the actual pPASID given an IOASID.
> 
> With above design /dev/ioasid uAPI is all about I/O address spaces. 
> It doesn't include any device routing information, which is only 
> indirectly registered to the ioasid driver through VFIO uAPI. For 
> example, I/O page fault is always reported to userspace per IOASID, 
> although it's physically reported per device (RID+PASID). If there is a 
> need of further relaying this fault into the guest, the user is responsible 
> of identifying the device attached to this IOASID (randomly pick one if 
> multiple attached devices)

We need to report accurate information for faults. If the guest tells
device A to DMA, it shouldn't receive a fault report for device B. This is
important if the guest needs to kill a misbehaving device, or even just
for statistics and debugging. It may also simplify routing the page
response, which has to be fast.

> and then generates a per-device virtual I/O 
> page fault into guest. Similarly the iotlb invalidation uAPI describes the 
> granularity in the I/O address space (all, or a range), different from the 
> underlying IOMMU semantics (domain-wide, PASID-wide, range-based).
> 
> I/O page tables routed through PASID are installed in a per-RID PASID 
> table structure. Some platforms implement the PASID table in the guest 
> physical space (GPA), expecting it managed by the guest. The guest
> PASID table is bound to the IOMMU also by attaching to an IOASID, 
> representing the per-RID vPASID space. 
> 
> We propose the host kernel needs to explicitly track  guest I/O page 
> tables even on these platforms, i.e. the same pgtable binding protocol 
> should be used universally on all platforms (with only difference on who 
> actually writes the PASID table).

This adds significant complexity for Arm (and AMD). Userspace will now
need to walk the PASID table, serializing against invalidation. At least
the SMMU has caching mode for PASID tables so there is no need to trap,
but I'd rather avoid this. I really don't want to make virtio-iommu
devices walk PASID tables unless absolutely necessary, they need to stay
simple.

> One opinion from previous discussion 
> was treating this special IOASID as a container for all guest I/O page 
> tables i.e. hiding them from the host. However this way significantly 
> violates the philosophy in this /dev/ioasid proposal.

It does correspond better to the underlying architecture and hardware
implementation, of which userspace is well aware since it has to report
them to the guest and deal with different descriptor formats.

> It is not one IOASID 
> one address space any more. Device routing information (indirectly 
> marking hidden I/O spaces) has to be carried in iotlb invalidation and 
> page faulting uAPI to help connect vIOMMU with the underlying 
> pIOMMU.

As above, I think it's essential that we carry device information in fault
reports. In addition to the two previous reasons, on these platforms
userspace will route all faults through the same channel (vIOMMU event
queue) regardless of the PASID, so we do not need them split and tracked
by PASID. Given that IOPF will be a hot path we should make sure there is
no unnecessary indirection.

Regarding the invalidation, I think limiting it to IOASID may work but it
does bother me that we can't directly forward all invalidations received
on the vIOMMU: if the guest sends a device-wide invalidation, do we
iterate over all IOASIDs and issue one ioctl for each?  Sure the guest is
probably sending that because of detaching the PASID table, for which the
kernel did perform the invalidation, but we can't just assume that and
ignore the request, there may be a different reason. Iterating is going to
take a lot time, whereas with the current API we can send a single request
and issue a single command to the IOMMU hardware.

Similarly, if the guest sends an ATC invalidation for a whole device (in
the SMMU, that's an ATC_INV without SSID), we'll have to transform that
into multiple IOTLB invalidations?  We can't just send it on IOASID #0,
because it may not have been created by the guest.

Maybe we could at least have invalidation requests on the parent fd for
this kind of global case?  But I'd much rather avoid the PASID tracking
altogether and keep the existing cache invalidate API, let the pIOMMU
driver decode that stuff.

> This is one design choice to be confirmed with ARM guys.
> 
> Devices may sit behind IOMMU's with incompatible capabilities. The
> difference may lie in the I/O page table format, or availability of an user
> visible uAPI (e.g. hardware nesting). /dev/ioasid is responsible for 
> checking the incompatibility between newly-attached device and existing
> devices under the specific IOASID and, if found, returning error to user.
> Upon such error the user should create a new IOASID for the incompatible
> device. 
> 
> There is no explicit group enforcement in /dev/ioasid uAPI, due to no 
> device notation in this interface as aforementioned. But the ioasid driver 
> does implicit check to make sure that devices within an iommu group 
> must be all attached to the same IOASID before this IOASID starts to
> accept any uAPI command. Otherwise error information is returned to 
> the user.
> 
> There was a long debate in previous discussion whether VFIO should keep 
> explicit container/group semantics in its uAPI. Jason Gunthorpe proposes 
> a simplified model where every device bound to VFIO is explicitly listed 
> under /dev/vfio thus a device fd can be acquired w/o going through legacy
> container/group interface. In this case the user is responsible for 
> understanding the group topology and meeting the implicit group check 
> criteria enforced in /dev/ioasid. The use case examples in this proposal 
> are based on the new model.
> 
> Of course for backward compatibility VFIO still needs to keep the existing 
> uAPI and vfio iommu type1 will become a shim layer connecting VFIO 
> iommu ops to internal ioasid helper functions.
> 
> Notes:
> -   It might be confusing as IOASID is also used in the kernel (drivers/
>     iommu/ioasid.c) to represent PCI PASID or ARM substream ID. We need
>     find a better name later to differentiate.

Yes this isn't just about allocating PASIDs anymore. /dev/iommu or
/dev/ioas would make more sense.

> 
> -   PPC has not be considered yet as we haven't got time to fully understand
>     its semantics. According to previous discussion there is some generality 
>     between PPC window-based scheme and VFIO type1 semantics. Let's 
>     first make consensus on this proposal and then further discuss how to 
>     extend it to cover PPC's requirement.
> 
> -   There is a protocol between vfio group and kvm. Needs to think about
>     how it will be affected following this proposal.

(Arm also needs this, obtaining the VMID allocated by KVM and write it to
the SMMU descriptor when installing the PASID table
https://lore.kernel.org/linux-iommu/20210222155338.26132-1-shameerali.kolothum.thodi@huawei.com/)

> 
> -   mdev in this context refers to mediated subfunctions (e.g. Intel SIOV) 
>     which can be physically isolated in-between through PASID-granular
>     IOMMU protection. Historically people also discussed one usage by 
>     mediating a pdev into a mdev. This usage is not covered here, and is 
>     supposed to be replaced by Max's work which allows overriding various 
>     VFIO operations in vfio-pci driver.
> 
> 2. uAPI Proposal
> ----------------------
[...]

> /*
>   * Get information about an I/O address space
>   *
>   * Supported capabilities:
>   *	- VFIO type1 map/unmap;
>   *	- pgtable/pasid_table binding
>   *	- hardware nesting vs. software nesting;
>   *	- ...
>   *
>   * Related attributes:
>   * 	- supported page sizes, reserved IOVA ranges (DMA mapping);
>   *	- vendor pgtable formats (pgtable binding);
>   *	- number of child IOASIDs (nesting);
>   *	- ...
>   *
>   * Above information is available only after one or more devices are
>   * attached to the specified IOASID. Otherwise the IOASID is just a
>   * number w/o any capability or attribute.
>   *
>   * Input parameters:
>   *	- u32 ioasid;
>   *
>   * Output parameters:
>   *	- many. TBD.

We probably need a capability format similar to PCI and VFIO.

>   */
> #define IOASID_GET_INFO	_IO(IOASID_TYPE, IOASID_BASE + 5)
[...]

> 2.2. /dev/vfio uAPI
> ++++++++++++++++
> 
> /*
>   * Bind a vfio_device to the specified IOASID fd
>   *
>   * Multiple vfio devices can be bound to a single ioasid_fd, but a single 
>   * vfio device should not be bound to multiple ioasid_fd's. 
>   *
>   * Input parameters:
>   *	- ioasid_fd;

How about adding a 32-bit "virtual RID" at this point, that the kernel can
provide to userspace during fault reporting?

Thanks,
Jean

>   *
>   * Return: 0 on success, -errno on failure.
>   */
> #define VFIO_BIND_IOASID_FD		_IO(VFIO_TYPE, VFIO_BASE + 22)
> #define VFIO_UNBIND_IOASID_FD	_IO(VFIO_TYPE, VFIO_BASE + 23)

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-27  7:58 ` Tian, Kevin
@ 2021-05-28 17:35   ` Jason Gunthorpe
  -1 siblings, 0 replies; 518+ messages in thread
From: Jason Gunthorpe @ 2021-05-28 17:35 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:

> IOASID nesting can be implemented in two ways: hardware nesting and 
> software nesting. With hardware support the child and parent I/O page 
> tables are walked consecutively by the IOMMU to form a nested translation. 
> When it's implemented in software, the ioasid driver is responsible for 
> merging the two-level mappings into a single-level shadow I/O page table. 
> Software nesting requires both child/parent page tables operated through 
> the dma mapping protocol, so any change in either level can be captured 
> by the kernel to update the corresponding shadow mapping.

Why? A SW emulation could do this synchronization during invalidation
processing if invalidation contained an IOVA range.

I think this document would be stronger to include some "Rational"
statements in key places

> Based on the underlying IOMMU capability one device might be allowed 
> to attach to multiple I/O address spaces, with DMAs accessing them by 
> carrying different routing information. One of them is the default I/O 
> address space routed by PCI Requestor ID (RID) or ARM Stream ID. The 
> remaining are routed by RID + Process Address Space ID (PASID) or 
> Stream+Substream ID. For simplicity the following context uses RID and
> PASID when talking about the routing information for I/O address spaces.

I wonder if we should just adopt the ARM naming as the API
standard. It is general and doesn't have the SVA connotation that
"Process Address Space ID" carries.
 
> Device must be bound to an IOASID FD before attach operation can be
> conducted. This is also through VFIO uAPI. In this proposal one device 
> should not be bound to multiple FD's. Not sure about the gain of 
> allowing it except adding unnecessary complexity. But if others have 
> different view we can further discuss.

Unless there is some internal kernel design reason to block it, I
wouldn't go out of my way to prevent it.

> VFIO must ensure its device composes DMAs with the routing information
> attached to the IOASID. For pdev it naturally happens since vPASID is 
> directly programmed to the device by guest software. For mdev this 
> implies any guest operation carrying a vPASID on this device must be 
> trapped into VFIO and then converted to pPASID before sent to the 
> device. A detail explanation about PASID virtualization policies can be 
> found in section 4. 

vPASID and related seems like it needs other IOMMU vendors to take a
very careful look. I'm really glad to see this starting to be spelled
out in such a clear way, as it was hard to see from the patches there
is vendor variation.

> With above design /dev/ioasid uAPI is all about I/O address spaces. 
> It doesn't include any device routing information, which is only 
> indirectly registered to the ioasid driver through VFIO uAPI. For
> example, I/O page fault is always reported to userspace per IOASID,
> although it's physically reported per device (RID+PASID). 

I agree with Jean-Philippe - at the very least erasing this
information needs a major rational - but I don't really see why it
must be erased? The HW reports the originating device, is it just a
matter of labeling the devices attached to the /dev/ioasid FD so it
can be reported to userspace?

> multiple attached devices) and then generates a per-device virtual I/O 
> page fault into guest. Similarly the iotlb invalidation uAPI describes the 
> granularity in the I/O address space (all, or a range), different from the 
> underlying IOMMU semantics (domain-wide, PASID-wide, range-based).

This seems OK though, I can't think of a reason to allow an IOASID to
be left partially invalidated???
 
> I/O page tables routed through PASID are installed in a per-RID PASID 
> table structure. Some platforms implement the PASID table in the guest 
> physical space (GPA), expecting it managed by the guest. The guest
> PASID table is bound to the IOMMU also by attaching to an IOASID, 
> representing the per-RID vPASID space. 
> 
> We propose the host kernel needs to explicitly track  guest I/O page 
> tables even on these platforms, i.e. the same pgtable binding protocol 
> should be used universally on all platforms (with only difference on who
> actually writes the PASID table). One opinion from previous discussion 
> was treating this special IOASID as a container for all guest I/O page 
> tables i.e. hiding them from the host. 

> However this way significantly 
> violates the philosophy in this /dev/ioasid proposal. It is not one IOASID 
> one address space any more. Device routing information (indirectly 
> marking hidden I/O spaces) has to be carried in iotlb invalidation and 
> page faulting uAPI to help connect vIOMMU with the underlying 
> pIOMMU. This is one design choice to be confirmed with ARM guys.

I'm confused by this rational.

For a vIOMMU that has IO page tables in the guest the basic
choices are:
 - Do we have a hypervisor trap to bind the page table or not? (RID
   and PASID may differ here)
 - Do we have a hypervisor trap to invaliate the page tables or not?

If the first is a hypervisor trap then I agree it makes sense to create a
child IOASID that points to each guest page table and manage it
directly. This should not require walking guest page tables as it is
really just informing the HW where the page table lives. HW will walk
them.

If there are no hypervisor traps (does this exist?) then there is no
way to involve the hypervisor here and the child IOASID should simply
be a pointer to the guest's data structure that describes binding. In
this case that IOASID should claim all PASIDs when bound to a
RID. 

Invalidation should be passed up the to the IOMMU driver in terms of
the guest tables information and either the HW or software has to walk
to guest tables to make sense of it.

Events from the IOMMU to userspace should be tagged with the attached
device label and the PASID/substream ID. This means there is no issue
to have a a 'all PASID' IOASID.

> Notes:
> -   It might be confusing as IOASID is also used in the kernel (drivers/
>     iommu/ioasid.c) to represent PCI PASID or ARM substream ID. We need
>     find a better name later to differentiate.

+1 on Jean-Philippe's remarks

> -   PPC has not be considered yet as we haven't got time to fully understand
>     its semantics. According to previous discussion there is some generality 
>     between PPC window-based scheme and VFIO type1 semantics. Let's 
>     first make consensus on this proposal and then further discuss how to 
>     extend it to cover PPC's requirement.

From what I understood PPC is not so bad, Nesting IOASID's did its
preload feature and it needed a way to specify/query the IOVA range a
IOASID will cover.

> -   There is a protocol between vfio group and kvm. Needs to think about
>     how it will be affected following this proposal.

Ugh, I always stop looking when I reach that boundary. Can anyone
summarize what is going on there?

Most likely passing the /dev/ioasid into KVM's FD (or vicevera) is the
right answer. Eg if ARM needs to get the VMID from KVM and set it to
ioasid then a KVM "ioctl set_arm_vmid(/dev/ioasid)" call is
reasonable. Certainly better than the symbol get sutff we have right
now.

I will read through the detail below in another email

Jason

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-05-28 17:35   ` Jason Gunthorpe
  0 siblings, 0 replies; 518+ messages in thread
From: Jason Gunthorpe @ 2021-05-28 17:35 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jean-Philippe Brucker,
	Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, David Gibson, Jiang, Dave, Wu, Hao,
	David Woodhouse, Jason Wang

On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:

> IOASID nesting can be implemented in two ways: hardware nesting and 
> software nesting. With hardware support the child and parent I/O page 
> tables are walked consecutively by the IOMMU to form a nested translation. 
> When it's implemented in software, the ioasid driver is responsible for 
> merging the two-level mappings into a single-level shadow I/O page table. 
> Software nesting requires both child/parent page tables operated through 
> the dma mapping protocol, so any change in either level can be captured 
> by the kernel to update the corresponding shadow mapping.

Why? A SW emulation could do this synchronization during invalidation
processing if invalidation contained an IOVA range.

I think this document would be stronger to include some "Rational"
statements in key places

> Based on the underlying IOMMU capability one device might be allowed 
> to attach to multiple I/O address spaces, with DMAs accessing them by 
> carrying different routing information. One of them is the default I/O 
> address space routed by PCI Requestor ID (RID) or ARM Stream ID. The 
> remaining are routed by RID + Process Address Space ID (PASID) or 
> Stream+Substream ID. For simplicity the following context uses RID and
> PASID when talking about the routing information for I/O address spaces.

I wonder if we should just adopt the ARM naming as the API
standard. It is general and doesn't have the SVA connotation that
"Process Address Space ID" carries.
 
> Device must be bound to an IOASID FD before attach operation can be
> conducted. This is also through VFIO uAPI. In this proposal one device 
> should not be bound to multiple FD's. Not sure about the gain of 
> allowing it except adding unnecessary complexity. But if others have 
> different view we can further discuss.

Unless there is some internal kernel design reason to block it, I
wouldn't go out of my way to prevent it.

> VFIO must ensure its device composes DMAs with the routing information
> attached to the IOASID. For pdev it naturally happens since vPASID is 
> directly programmed to the device by guest software. For mdev this 
> implies any guest operation carrying a vPASID on this device must be 
> trapped into VFIO and then converted to pPASID before sent to the 
> device. A detail explanation about PASID virtualization policies can be 
> found in section 4. 

vPASID and related seems like it needs other IOMMU vendors to take a
very careful look. I'm really glad to see this starting to be spelled
out in such a clear way, as it was hard to see from the patches there
is vendor variation.

> With above design /dev/ioasid uAPI is all about I/O address spaces. 
> It doesn't include any device routing information, which is only 
> indirectly registered to the ioasid driver through VFIO uAPI. For
> example, I/O page fault is always reported to userspace per IOASID,
> although it's physically reported per device (RID+PASID). 

I agree with Jean-Philippe - at the very least erasing this
information needs a major rational - but I don't really see why it
must be erased? The HW reports the originating device, is it just a
matter of labeling the devices attached to the /dev/ioasid FD so it
can be reported to userspace?

> multiple attached devices) and then generates a per-device virtual I/O 
> page fault into guest. Similarly the iotlb invalidation uAPI describes the 
> granularity in the I/O address space (all, or a range), different from the 
> underlying IOMMU semantics (domain-wide, PASID-wide, range-based).

This seems OK though, I can't think of a reason to allow an IOASID to
be left partially invalidated???
 
> I/O page tables routed through PASID are installed in a per-RID PASID 
> table structure. Some platforms implement the PASID table in the guest 
> physical space (GPA), expecting it managed by the guest. The guest
> PASID table is bound to the IOMMU also by attaching to an IOASID, 
> representing the per-RID vPASID space. 
> 
> We propose the host kernel needs to explicitly track  guest I/O page 
> tables even on these platforms, i.e. the same pgtable binding protocol 
> should be used universally on all platforms (with only difference on who
> actually writes the PASID table). One opinion from previous discussion 
> was treating this special IOASID as a container for all guest I/O page 
> tables i.e. hiding them from the host. 

> However this way significantly 
> violates the philosophy in this /dev/ioasid proposal. It is not one IOASID 
> one address space any more. Device routing information (indirectly 
> marking hidden I/O spaces) has to be carried in iotlb invalidation and 
> page faulting uAPI to help connect vIOMMU with the underlying 
> pIOMMU. This is one design choice to be confirmed with ARM guys.

I'm confused by this rational.

For a vIOMMU that has IO page tables in the guest the basic
choices are:
 - Do we have a hypervisor trap to bind the page table or not? (RID
   and PASID may differ here)
 - Do we have a hypervisor trap to invaliate the page tables or not?

If the first is a hypervisor trap then I agree it makes sense to create a
child IOASID that points to each guest page table and manage it
directly. This should not require walking guest page tables as it is
really just informing the HW where the page table lives. HW will walk
them.

If there are no hypervisor traps (does this exist?) then there is no
way to involve the hypervisor here and the child IOASID should simply
be a pointer to the guest's data structure that describes binding. In
this case that IOASID should claim all PASIDs when bound to a
RID. 

Invalidation should be passed up the to the IOMMU driver in terms of
the guest tables information and either the HW or software has to walk
to guest tables to make sense of it.

Events from the IOMMU to userspace should be tagged with the attached
device label and the PASID/substream ID. This means there is no issue
to have a a 'all PASID' IOASID.

> Notes:
> -   It might be confusing as IOASID is also used in the kernel (drivers/
>     iommu/ioasid.c) to represent PCI PASID or ARM substream ID. We need
>     find a better name later to differentiate.

+1 on Jean-Philippe's remarks

> -   PPC has not be considered yet as we haven't got time to fully understand
>     its semantics. According to previous discussion there is some generality 
>     between PPC window-based scheme and VFIO type1 semantics. Let's 
>     first make consensus on this proposal and then further discuss how to 
>     extend it to cover PPC's requirement.

From what I understood PPC is not so bad, Nesting IOASID's did its
preload feature and it needed a way to specify/query the IOVA range a
IOASID will cover.

> -   There is a protocol between vfio group and kvm. Needs to think about
>     how it will be affected following this proposal.

Ugh, I always stop looking when I reach that boundary. Can anyone
summarize what is going on there?

Most likely passing the /dev/ioasid into KVM's FD (or vicevera) is the
right answer. Eg if ARM needs to get the VMID from KVM and set it to
ioasid then a KVM "ioctl set_arm_vmid(/dev/ioasid)" call is
reasonable. Certainly better than the symbol get sutff we have right
now.

I will read through the detail below in another email

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-27  7:58 ` Tian, Kevin
@ 2021-05-28 19:58   ` Jason Gunthorpe
  -1 siblings, 0 replies; 518+ messages in thread
From: Jason Gunthorpe @ 2021-05-28 19:58 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> 
> 5. Use Cases and Flows
> 
> Here assume VFIO will support a new model where every bound device
> is explicitly listed under /dev/vfio thus a device fd can be acquired w/o 
> going through legacy container/group interface. For illustration purpose
> those devices are just called dev[1...N]:
> 
> 	device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> 
> As explained earlier, one IOASID fd is sufficient for all intended use cases:
> 
> 	ioasid_fd = open("/dev/ioasid", mode);
> 
> For simplicity below examples are all made for the virtualization story.
> They are representative and could be easily adapted to a non-virtualization
> scenario.

For others, I don't think this is *strictly* necessary, we can
probably still get to the device_fd using the group_fd and fit in
/dev/ioasid. It does make the rest of this more readable though.


> Three types of IOASIDs are considered:
> 
> 	gpa_ioasid[1...N]: 	for GPA address space
> 	giova_ioasid[1...N]:	for guest IOVA address space
> 	gva_ioasid[1...N]:	for guest CPU VA address space
> 
> At least one gpa_ioasid must always be created per guest, while the other 
> two are relevant as far as vIOMMU is concerned.
> 
> Examples here apply to both pdev and mdev, if not explicitly marked out
> (e.g. in section 5.5). VFIO device driver in the kernel will figure out the 
> associated routing information in the attaching operation.
> 
> For illustration simplicity, IOASID_CHECK_EXTENSION and IOASID_GET_
> INFO are skipped in these examples.
> 
> 5.1. A simple example
> ++++++++++++++++++
> 
> Dev1 is assigned to the guest. One gpa_ioasid is created. The GPA address
> space is managed through DMA mapping protocol:
> 
> 	/* Bind device to IOASID fd */
> 	device_fd = open("/dev/vfio/devices/dev1", mode);
> 	ioasid_fd = open("/dev/ioasid", mode);
> 	ioctl(device_fd, VFIO_BIND_IOASID_FD, ioasid_fd);
> 
> 	/* Attach device to IOASID */
> 	gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> 	at_data = { .ioasid = gpa_ioasid};
> 	ioctl(device_fd, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Setup GPA mapping */
> 	dma_map = {
> 		.ioasid	= gpa_ioasid;
> 		.iova	= 0;		// GPA
> 		.vaddr	= 0x40000000;	// HVA
> 		.size	= 1GB;
> 	};
> 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> 
> If the guest is assigned with more than dev1, user follows above sequence
> to attach other devices to the same gpa_ioasid i.e. sharing the GPA 
> address space cross all assigned devices.

eg

 	device2_fd = open("/dev/vfio/devices/dev1", mode);
 	ioctl(device2_fd, VFIO_BIND_IOASID_FD, ioasid_fd);
 	ioctl(device2_fd, VFIO_ATTACH_IOASID, &at_data);

Right?

> 
> 5.2. Multiple IOASIDs (no nesting)
> ++++++++++++++++++++++++++++
> 
> Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
> both devices are attached to gpa_ioasid. After boot the guest creates 
> an GIOVA address space (giova_ioasid) for dev2, leaving dev1 in pass
> through mode (gpa_ioasid).
> 
> Suppose IOASID nesting is not supported in this case. Qemu need to
> generate shadow mappings in userspace for giova_ioasid (like how
> VFIO works today).
> 
> To avoid duplicated locked page accounting, it's recommended to pre-
> register the virtual address range that will be used for DMA:
> 
> 	device_fd1 = open("/dev/vfio/devices/dev1", mode);
> 	device_fd2 = open("/dev/vfio/devices/dev2", mode);
> 	ioasid_fd = open("/dev/ioasid", mode);
> 	ioctl(device_fd1, VFIO_BIND_IOASID_FD, ioasid_fd);
> 	ioctl(device_fd2, VFIO_BIND_IOASID_FD, ioasid_fd);
> 
> 	/* pre-register the virtual address range for accounting */
> 	mem_info = { .vaddr = 0x40000000; .size = 1GB };
> 	ioctl(ioasid_fd, IOASID_REGISTER_MEMORY, &mem_info);
> 
> 	/* Attach dev1 and dev2 to gpa_ioasid */
> 	gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> 	at_data = { .ioasid = gpa_ioasid};
> 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Setup GPA mapping */
> 	dma_map = {
> 		.ioasid	= gpa_ioasid;
> 		.iova	= 0; 		// GPA
> 		.vaddr	= 0x40000000;	// HVA
> 		.size	= 1GB;
> 	};
> 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> 
> 	/* After boot, guest enables an GIOVA space for dev2 */
> 	giova_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> 
> 	/* First detach dev2 from previous address space */
> 	at_data = { .ioasid = gpa_ioasid};
> 	ioctl(device_fd2, VFIO_DETACH_IOASID, &at_data);
> 
> 	/* Then attach dev2 to the new address space */
> 	at_data = { .ioasid = giova_ioasid};
> 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Setup a shadow DMA mapping according to vIOMMU
> 	  * GIOVA (0x2000) -> GPA (0x1000) -> HVA (0x40001000)
> 	  */

Here "shadow DMA" means relay the guest's vIOMMU page tables to the HW
IOMMU?

> 	dma_map = {
> 		.ioasid	= giova_ioasid;
> 		.iova	= 0x2000; 	// GIOVA
> 		.vaddr	= 0x40001000;	// HVA

eg HVA came from reading the guest's page tables and finding it wanted
GPA 0x1000 mapped to IOVA 0x2000?


> 5.3. IOASID nesting (software)
> +++++++++++++++++++++++++
> 
> Same usage scenario as 5.2, with software-based IOASID nesting 
> available. In this mode it is the kernel instead of user to create the
> shadow mapping.
> 
> The flow before guest boots is same as 5.2, except one point. Because 
> giova_ioasid is nested on gpa_ioasid, locked accounting is only 
> conducted for gpa_ioasid. So it's not necessary to pre-register virtual 
> memory.
> 
> To save space we only list the steps after boots (i.e. both dev1/dev2
> have been attached to gpa_ioasid before guest boots):
> 
> 	/* After boots */
> 	/* Make GIOVA space nested on GPA space */
> 	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
> 
> 	/* Attach dev2 to the new address space (child)
> 	  * Note dev2 is still attached to gpa_ioasid (parent)
> 	  */
> 	at_data = { .ioasid = giova_ioasid};
> 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Setup a GIOVA->GPA mapping for giova_ioasid, which will be 
> 	  * merged by the kernel with GPA->HVA mapping of gpa_ioasid
> 	  * to form a shadow mapping.
> 	  */
> 	dma_map = {
> 		.ioasid	= giova_ioasid;
> 		.iova	= 0x2000;	// GIOVA
> 		.vaddr	= 0x1000;	// GPA
> 		.size	= 4KB;
> 	};
> 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);

And in this version the kernel reaches into the parent IOASID's page
tables to translate 0x1000 to 0x40001000 to physical page? So we
basically remove the qemu process address space entirely from this
translation. It does seem convenient

> 5.4. IOASID nesting (hardware)
> +++++++++++++++++++++++++
> 
> Same usage scenario as 5.2, with hardware-based IOASID nesting
> available. In this mode the pgtable binding protocol is used to 
> bind the guest IOVA page table with the IOMMU:
> 
> 	/* After boots */
> 	/* Make GIOVA space nested on GPA space */
> 	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
> 
> 	/* Attach dev2 to the new address space (child)
> 	  * Note dev2 is still attached to gpa_ioasid (parent)
> 	  */
> 	at_data = { .ioasid = giova_ioasid};
> 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Bind guest I/O page table  */
> 	bind_data = {
> 		.ioasid	= giova_ioasid;
> 		.addr	= giova_pgtable;
> 		// and format information
> 	};
> 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);

I really think you need to use consistent language. Things that
allocate a new IOASID should be calle IOASID_ALLOC_IOASID. If multiple
IOCTLs are needed then it is IOASID_ALLOC_IOASID_PGTABLE, etc.
alloc/create/bind is too confusing.

> 5.5. Guest SVA (vSVA)
> ++++++++++++++++++
> 
> After boots the guest further create a GVA address spaces (gpasid1) on 
> dev1. Dev2 is not affected (still attached to giova_ioasid).
> 
> As explained in section 4, user should avoid expose ENQCMD on both
> pdev and mdev.
> 
> The sequence applies to all device types (being pdev or mdev), except
> one additional step to call KVM for ENQCMD-capable mdev:
> 
> 	/* After boots */
> 	/* Make GVA space nested on GPA space */
> 	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
> 
> 	/* Attach dev1 to the new address space and specify vPASID */
> 	at_data = {
> 		.ioasid		= gva_ioasid;
> 		.flag 		= IOASID_ATTACH_USER_PASID;
> 		.user_pasid	= gpasid1;
> 	};
> 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);

Still a little unsure why the vPASID is here not on the gva_ioasid. Is
there any scenario where we want different vpasid's for the same
IOASID? I guess it is OK like this. Hum.

> 	/* if dev1 is ENQCMD-capable mdev, update CPU PASID 
> 	  * translation structure through KVM
> 	  */
> 	pa_data = {
> 		.ioasid_fd	= ioasid_fd;
> 		.ioasid		= gva_ioasid;
> 		.guest_pasid	= gpasid1;
> 	};
> 	ioctl(kvm_fd, KVM_MAP_PASID, &pa_data);

Make sense

> 	/* Bind guest I/O page table  */
> 	bind_data = {
> 		.ioasid	= gva_ioasid;
> 		.addr	= gva_pgtable1;
> 		// and format information
> 	};
> 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);

Again I do wonder if this should just be part of alloc_ioasid. Is
there any reason to split these things? The only advantage to the
split is the device is known, but the device shouldn't impact
anything..

> 5.6. I/O page fault
> +++++++++++++++
> 
> (uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
> to guest IOMMU driver and backwards).
> 
> -   Host IOMMU driver receives a page request with raw fault_data {rid, 
>     pasid, addr};
> 
> -   Host IOMMU driver identifies the faulting I/O page table according to
>     information registered by IOASID fault handler;
> 
> -   IOASID fault handler is called with raw fault_data (rid, pasid, addr), which 
>     is saved in ioasid_data->fault_data (used for response);
> 
> -   IOASID fault handler generates an user fault_data (ioasid, addr), links it 
>     to the shared ring buffer and triggers eventfd to userspace;

Here rid should be translated to a labeled device and return the
device label from VFIO_BIND_IOASID_FD. Depending on how the device
bound the label might match to a rid or to a rid,pasid

> -   Upon received event, Qemu needs to find the virtual routing information 
>     (v_rid + v_pasid) of the device attached to the faulting ioasid. If there are 
>     multiple, pick a random one. This should be fine since the purpose is to
>     fix the I/O page table on the guest;

The device label should fix this
 
> -   Qemu finds the pending fault event, converts virtual completion data 
>     into (ioasid, response_code), and then calls a /dev/ioasid ioctl to 
>     complete the pending fault;
> 
> -   /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved in 
>     ioasid_data->fault_data, and then calls iommu api to complete it with
>     {rid, pasid, response_code};

So resuming a fault on an ioasid will resume all devices pending on
the fault?

> 5.7. BIND_PASID_TABLE
> ++++++++++++++++++++
> 
> PASID table is put in the GPA space on some platform, thus must be updated
> by the guest. It is treated as another user page table to be bound with the 
> IOMMU.
> 
> As explained earlier, the user still needs to explicitly bind every user I/O 
> page table to the kernel so the same pgtable binding protocol (bind, cache 
> invalidate and fault handling) is unified cross platforms.
>
> vIOMMUs may include a caching mode (or paravirtualized way) which, once 
> enabled, requires the guest to invalidate PASID cache for any change on the 
> PASID table. This allows Qemu to track the lifespan of guest I/O page tables.
>
> In case of missing such capability, Qemu could enable write-protection on
> the guest PASID table to achieve the same effect.
> 
> 	/* After boots */
> 	/* Make vPASID space nested on GPA space */
> 	pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
> 
> 	/* Attach dev1 to pasidtbl_ioasid */
> 	at_data = { .ioasid = pasidtbl_ioasid};
> 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Bind PASID table */
> 	bind_data = {
> 		.ioasid	= pasidtbl_ioasid;
> 		.addr	= gpa_pasid_table;
> 		// and format information
> 	};
> 	ioctl(ioasid_fd, IOASID_BIND_PASID_TABLE, &bind_data);
> 
> 	/* vIOMMU detects a new GVA I/O space created */
> 	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
> 
> 	/* Attach dev1 to the new address space, with gpasid1 */
> 	at_data = {
> 		.ioasid		= gva_ioasid;
> 		.flag 		= IOASID_ATTACH_USER_PASID;
> 		.user_pasid	= gpasid1;
> 	};
> 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Bind guest I/O page table. Because SET_PASID_TABLE has been
> 	  * used, the kernel will not update the PASID table. Instead, just
> 	  * track the bound I/O page table for handling invalidation and
> 	  * I/O page faults.
> 	  */
> 	bind_data = {
> 		.ioasid	= gva_ioasid;
> 		.addr	= gva_pgtable1;
> 		// and format information
> 	};
> 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);

I still don't quite get the benifit from doing this.

The idea to create an all PASID IOASID seems to work better with less
fuss on HW that is directly parsing the guest's PASID table.

Cache invalidate seems easy enough to support

Fault handling needs to return the (ioasid, device_label, pasid) when
working with this kind of ioasid.

It is true that it does create an additional flow qemu has to
implement, but it does directly mirror the HW.

Jason

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-05-28 19:58   ` Jason Gunthorpe
  0 siblings, 0 replies; 518+ messages in thread
From: Jason Gunthorpe @ 2021-05-28 19:58 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jean-Philippe Brucker,
	Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, David Gibson, Jiang, Dave, Wu, Hao,
	David Woodhouse, Jason Wang

On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> 
> 5. Use Cases and Flows
> 
> Here assume VFIO will support a new model where every bound device
> is explicitly listed under /dev/vfio thus a device fd can be acquired w/o 
> going through legacy container/group interface. For illustration purpose
> those devices are just called dev[1...N]:
> 
> 	device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> 
> As explained earlier, one IOASID fd is sufficient for all intended use cases:
> 
> 	ioasid_fd = open("/dev/ioasid", mode);
> 
> For simplicity below examples are all made for the virtualization story.
> They are representative and could be easily adapted to a non-virtualization
> scenario.

For others, I don't think this is *strictly* necessary, we can
probably still get to the device_fd using the group_fd and fit in
/dev/ioasid. It does make the rest of this more readable though.


> Three types of IOASIDs are considered:
> 
> 	gpa_ioasid[1...N]: 	for GPA address space
> 	giova_ioasid[1...N]:	for guest IOVA address space
> 	gva_ioasid[1...N]:	for guest CPU VA address space
> 
> At least one gpa_ioasid must always be created per guest, while the other 
> two are relevant as far as vIOMMU is concerned.
> 
> Examples here apply to both pdev and mdev, if not explicitly marked out
> (e.g. in section 5.5). VFIO device driver in the kernel will figure out the 
> associated routing information in the attaching operation.
> 
> For illustration simplicity, IOASID_CHECK_EXTENSION and IOASID_GET_
> INFO are skipped in these examples.
> 
> 5.1. A simple example
> ++++++++++++++++++
> 
> Dev1 is assigned to the guest. One gpa_ioasid is created. The GPA address
> space is managed through DMA mapping protocol:
> 
> 	/* Bind device to IOASID fd */
> 	device_fd = open("/dev/vfio/devices/dev1", mode);
> 	ioasid_fd = open("/dev/ioasid", mode);
> 	ioctl(device_fd, VFIO_BIND_IOASID_FD, ioasid_fd);
> 
> 	/* Attach device to IOASID */
> 	gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> 	at_data = { .ioasid = gpa_ioasid};
> 	ioctl(device_fd, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Setup GPA mapping */
> 	dma_map = {
> 		.ioasid	= gpa_ioasid;
> 		.iova	= 0;		// GPA
> 		.vaddr	= 0x40000000;	// HVA
> 		.size	= 1GB;
> 	};
> 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> 
> If the guest is assigned with more than dev1, user follows above sequence
> to attach other devices to the same gpa_ioasid i.e. sharing the GPA 
> address space cross all assigned devices.

eg

 	device2_fd = open("/dev/vfio/devices/dev1", mode);
 	ioctl(device2_fd, VFIO_BIND_IOASID_FD, ioasid_fd);
 	ioctl(device2_fd, VFIO_ATTACH_IOASID, &at_data);

Right?

> 
> 5.2. Multiple IOASIDs (no nesting)
> ++++++++++++++++++++++++++++
> 
> Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
> both devices are attached to gpa_ioasid. After boot the guest creates 
> an GIOVA address space (giova_ioasid) for dev2, leaving dev1 in pass
> through mode (gpa_ioasid).
> 
> Suppose IOASID nesting is not supported in this case. Qemu need to
> generate shadow mappings in userspace for giova_ioasid (like how
> VFIO works today).
> 
> To avoid duplicated locked page accounting, it's recommended to pre-
> register the virtual address range that will be used for DMA:
> 
> 	device_fd1 = open("/dev/vfio/devices/dev1", mode);
> 	device_fd2 = open("/dev/vfio/devices/dev2", mode);
> 	ioasid_fd = open("/dev/ioasid", mode);
> 	ioctl(device_fd1, VFIO_BIND_IOASID_FD, ioasid_fd);
> 	ioctl(device_fd2, VFIO_BIND_IOASID_FD, ioasid_fd);
> 
> 	/* pre-register the virtual address range for accounting */
> 	mem_info = { .vaddr = 0x40000000; .size = 1GB };
> 	ioctl(ioasid_fd, IOASID_REGISTER_MEMORY, &mem_info);
> 
> 	/* Attach dev1 and dev2 to gpa_ioasid */
> 	gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> 	at_data = { .ioasid = gpa_ioasid};
> 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Setup GPA mapping */
> 	dma_map = {
> 		.ioasid	= gpa_ioasid;
> 		.iova	= 0; 		// GPA
> 		.vaddr	= 0x40000000;	// HVA
> 		.size	= 1GB;
> 	};
> 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> 
> 	/* After boot, guest enables an GIOVA space for dev2 */
> 	giova_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> 
> 	/* First detach dev2 from previous address space */
> 	at_data = { .ioasid = gpa_ioasid};
> 	ioctl(device_fd2, VFIO_DETACH_IOASID, &at_data);
> 
> 	/* Then attach dev2 to the new address space */
> 	at_data = { .ioasid = giova_ioasid};
> 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Setup a shadow DMA mapping according to vIOMMU
> 	  * GIOVA (0x2000) -> GPA (0x1000) -> HVA (0x40001000)
> 	  */

Here "shadow DMA" means relay the guest's vIOMMU page tables to the HW
IOMMU?

> 	dma_map = {
> 		.ioasid	= giova_ioasid;
> 		.iova	= 0x2000; 	// GIOVA
> 		.vaddr	= 0x40001000;	// HVA

eg HVA came from reading the guest's page tables and finding it wanted
GPA 0x1000 mapped to IOVA 0x2000?


> 5.3. IOASID nesting (software)
> +++++++++++++++++++++++++
> 
> Same usage scenario as 5.2, with software-based IOASID nesting 
> available. In this mode it is the kernel instead of user to create the
> shadow mapping.
> 
> The flow before guest boots is same as 5.2, except one point. Because 
> giova_ioasid is nested on gpa_ioasid, locked accounting is only 
> conducted for gpa_ioasid. So it's not necessary to pre-register virtual 
> memory.
> 
> To save space we only list the steps after boots (i.e. both dev1/dev2
> have been attached to gpa_ioasid before guest boots):
> 
> 	/* After boots */
> 	/* Make GIOVA space nested on GPA space */
> 	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
> 
> 	/* Attach dev2 to the new address space (child)
> 	  * Note dev2 is still attached to gpa_ioasid (parent)
> 	  */
> 	at_data = { .ioasid = giova_ioasid};
> 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Setup a GIOVA->GPA mapping for giova_ioasid, which will be 
> 	  * merged by the kernel with GPA->HVA mapping of gpa_ioasid
> 	  * to form a shadow mapping.
> 	  */
> 	dma_map = {
> 		.ioasid	= giova_ioasid;
> 		.iova	= 0x2000;	// GIOVA
> 		.vaddr	= 0x1000;	// GPA
> 		.size	= 4KB;
> 	};
> 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);

And in this version the kernel reaches into the parent IOASID's page
tables to translate 0x1000 to 0x40001000 to physical page? So we
basically remove the qemu process address space entirely from this
translation. It does seem convenient

> 5.4. IOASID nesting (hardware)
> +++++++++++++++++++++++++
> 
> Same usage scenario as 5.2, with hardware-based IOASID nesting
> available. In this mode the pgtable binding protocol is used to 
> bind the guest IOVA page table with the IOMMU:
> 
> 	/* After boots */
> 	/* Make GIOVA space nested on GPA space */
> 	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
> 
> 	/* Attach dev2 to the new address space (child)
> 	  * Note dev2 is still attached to gpa_ioasid (parent)
> 	  */
> 	at_data = { .ioasid = giova_ioasid};
> 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Bind guest I/O page table  */
> 	bind_data = {
> 		.ioasid	= giova_ioasid;
> 		.addr	= giova_pgtable;
> 		// and format information
> 	};
> 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);

I really think you need to use consistent language. Things that
allocate a new IOASID should be calle IOASID_ALLOC_IOASID. If multiple
IOCTLs are needed then it is IOASID_ALLOC_IOASID_PGTABLE, etc.
alloc/create/bind is too confusing.

> 5.5. Guest SVA (vSVA)
> ++++++++++++++++++
> 
> After boots the guest further create a GVA address spaces (gpasid1) on 
> dev1. Dev2 is not affected (still attached to giova_ioasid).
> 
> As explained in section 4, user should avoid expose ENQCMD on both
> pdev and mdev.
> 
> The sequence applies to all device types (being pdev or mdev), except
> one additional step to call KVM for ENQCMD-capable mdev:
> 
> 	/* After boots */
> 	/* Make GVA space nested on GPA space */
> 	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
> 
> 	/* Attach dev1 to the new address space and specify vPASID */
> 	at_data = {
> 		.ioasid		= gva_ioasid;
> 		.flag 		= IOASID_ATTACH_USER_PASID;
> 		.user_pasid	= gpasid1;
> 	};
> 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);

Still a little unsure why the vPASID is here not on the gva_ioasid. Is
there any scenario where we want different vpasid's for the same
IOASID? I guess it is OK like this. Hum.

> 	/* if dev1 is ENQCMD-capable mdev, update CPU PASID 
> 	  * translation structure through KVM
> 	  */
> 	pa_data = {
> 		.ioasid_fd	= ioasid_fd;
> 		.ioasid		= gva_ioasid;
> 		.guest_pasid	= gpasid1;
> 	};
> 	ioctl(kvm_fd, KVM_MAP_PASID, &pa_data);

Make sense

> 	/* Bind guest I/O page table  */
> 	bind_data = {
> 		.ioasid	= gva_ioasid;
> 		.addr	= gva_pgtable1;
> 		// and format information
> 	};
> 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);

Again I do wonder if this should just be part of alloc_ioasid. Is
there any reason to split these things? The only advantage to the
split is the device is known, but the device shouldn't impact
anything..

> 5.6. I/O page fault
> +++++++++++++++
> 
> (uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
> to guest IOMMU driver and backwards).
> 
> -   Host IOMMU driver receives a page request with raw fault_data {rid, 
>     pasid, addr};
> 
> -   Host IOMMU driver identifies the faulting I/O page table according to
>     information registered by IOASID fault handler;
> 
> -   IOASID fault handler is called with raw fault_data (rid, pasid, addr), which 
>     is saved in ioasid_data->fault_data (used for response);
> 
> -   IOASID fault handler generates an user fault_data (ioasid, addr), links it 
>     to the shared ring buffer and triggers eventfd to userspace;

Here rid should be translated to a labeled device and return the
device label from VFIO_BIND_IOASID_FD. Depending on how the device
bound the label might match to a rid or to a rid,pasid

> -   Upon received event, Qemu needs to find the virtual routing information 
>     (v_rid + v_pasid) of the device attached to the faulting ioasid. If there are 
>     multiple, pick a random one. This should be fine since the purpose is to
>     fix the I/O page table on the guest;

The device label should fix this
 
> -   Qemu finds the pending fault event, converts virtual completion data 
>     into (ioasid, response_code), and then calls a /dev/ioasid ioctl to 
>     complete the pending fault;
> 
> -   /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved in 
>     ioasid_data->fault_data, and then calls iommu api to complete it with
>     {rid, pasid, response_code};

So resuming a fault on an ioasid will resume all devices pending on
the fault?

> 5.7. BIND_PASID_TABLE
> ++++++++++++++++++++
> 
> PASID table is put in the GPA space on some platform, thus must be updated
> by the guest. It is treated as another user page table to be bound with the 
> IOMMU.
> 
> As explained earlier, the user still needs to explicitly bind every user I/O 
> page table to the kernel so the same pgtable binding protocol (bind, cache 
> invalidate and fault handling) is unified cross platforms.
>
> vIOMMUs may include a caching mode (or paravirtualized way) which, once 
> enabled, requires the guest to invalidate PASID cache for any change on the 
> PASID table. This allows Qemu to track the lifespan of guest I/O page tables.
>
> In case of missing such capability, Qemu could enable write-protection on
> the guest PASID table to achieve the same effect.
> 
> 	/* After boots */
> 	/* Make vPASID space nested on GPA space */
> 	pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
> 
> 	/* Attach dev1 to pasidtbl_ioasid */
> 	at_data = { .ioasid = pasidtbl_ioasid};
> 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Bind PASID table */
> 	bind_data = {
> 		.ioasid	= pasidtbl_ioasid;
> 		.addr	= gpa_pasid_table;
> 		// and format information
> 	};
> 	ioctl(ioasid_fd, IOASID_BIND_PASID_TABLE, &bind_data);
> 
> 	/* vIOMMU detects a new GVA I/O space created */
> 	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
> 
> 	/* Attach dev1 to the new address space, with gpasid1 */
> 	at_data = {
> 		.ioasid		= gva_ioasid;
> 		.flag 		= IOASID_ATTACH_USER_PASID;
> 		.user_pasid	= gpasid1;
> 	};
> 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Bind guest I/O page table. Because SET_PASID_TABLE has been
> 	  * used, the kernel will not update the PASID table. Instead, just
> 	  * track the bound I/O page table for handling invalidation and
> 	  * I/O page faults.
> 	  */
> 	bind_data = {
> 		.ioasid	= gva_ioasid;
> 		.addr	= gva_pgtable1;
> 		// and format information
> 	};
> 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);

I still don't quite get the benifit from doing this.

The idea to create an all PASID IOASID seems to work better with less
fuss on HW that is directly parsing the guest's PASID table.

Cache invalidate seems easy enough to support

Fault handling needs to return the (ioasid, device_label, pasid) when
working with this kind of ioasid.

It is true that it does create an additional flow qemu has to
implement, but it does directly mirror the HW.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-27  7:58 ` Tian, Kevin
@ 2021-05-28 20:03   ` Jason Gunthorpe
  -1 siblings, 0 replies; 518+ messages in thread
From: Jason Gunthorpe @ 2021-05-28 20:03 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> /dev/ioasid provides an unified interface for managing I/O page tables for 
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, 
> etc.) are expected to use this interface instead of creating their own logic to 
> isolate untrusted device DMAs initiated by userspace. 

It is very long, but I think this has turned out quite well. It
certainly matches the basic sketch I had in my head when we were
talking about how to create vDPA devices a few years ago.

When you get down to the operations they all seem pretty common sense
and straightfoward. Create an IOASID. Connect to a device. Fill the
IOASID with pages somehow. Worry about PASID labeling.

It really is critical to get all the vendor IOMMU people to go over it
and see how their HW features map into this.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-05-28 20:03   ` Jason Gunthorpe
  0 siblings, 0 replies; 518+ messages in thread
From: Jason Gunthorpe @ 2021-05-28 20:03 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jean-Philippe Brucker,
	Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, David Gibson, Jiang, Dave, Wu, Hao,
	David Woodhouse, Jason Wang

On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> /dev/ioasid provides an unified interface for managing I/O page tables for 
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, 
> etc.) are expected to use this interface instead of creating their own logic to 
> isolate untrusted device DMAs initiated by userspace. 

It is very long, but I think this has turned out quite well. It
certainly matches the basic sketch I had in my head when we were
talking about how to create vDPA devices a few years ago.

When you get down to the operations they all seem pretty common sense
and straightfoward. Create an IOASID. Connect to a device. Fill the
IOASID with pages somehow. Worry about PASID labeling.

It really is critical to get all the vendor IOMMU people to go over it
and see how their HW features map into this.

Thanks,
Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-28 16:23   ` Jean-Philippe Brucker
@ 2021-05-28 20:16     ` Jason Gunthorpe
  -1 siblings, 0 replies; 518+ messages in thread
From: Jason Gunthorpe @ 2021-05-28 20:16 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, David Gibson, Kirti Wankhede,
	Robin Murphy

On Fri, May 28, 2021 at 06:23:07PM +0200, Jean-Philippe Brucker wrote:

> Regarding the invalidation, I think limiting it to IOASID may work but it
> does bother me that we can't directly forward all invalidations received
> on the vIOMMU: if the guest sends a device-wide invalidation, do we
> iterate over all IOASIDs and issue one ioctl for each?  Sure the guest is
> probably sending that because of detaching the PASID table, for which the
> kernel did perform the invalidation, but we can't just assume that and
> ignore the request, there may be a different reason. Iterating is going to
> take a lot time, whereas with the current API we can send a single request
> and issue a single command to the IOMMU hardware.

I think the invalidation could stand some improvement, but that also
feels basically incremental to the essence of the proposal.

I agree with the general goal that the uAPI should be able to issue
invalidates that directly map to HW invalidations.

> Similarly, if the guest sends an ATC invalidation for a whole device (in
> the SMMU, that's an ATC_INV without SSID), we'll have to transform that
> into multiple IOTLB invalidations?  We can't just send it on IOASID #0,
> because it may not have been created by the guest.

For instance adding device labels allows an invalidate device
operation to exist and the "generic" kernel driver can iterate over
all IOASIDs hooked to the device. Overridable by the IOMMU driver.

> > Notes:
> > -   It might be confusing as IOASID is also used in the kernel (drivers/
> >     iommu/ioasid.c) to represent PCI PASID or ARM substream ID. We need
> >     find a better name later to differentiate.
> 
> Yes this isn't just about allocating PASIDs anymore. /dev/iommu or
> /dev/ioas would make more sense.

Either makes sense to me

/dev/iommu and the internal IOASID objects can be called IOAS (==
iommu_domain) is not bad

> >   * Get information about an I/O address space
> >   *
> >   * Supported capabilities:
> >   *	- VFIO type1 map/unmap;
> >   *	- pgtable/pasid_table binding
> >   *	- hardware nesting vs. software nesting;
> >   *	- ...
> >   *
> >   * Related attributes:
> >   * 	- supported page sizes, reserved IOVA ranges (DMA mapping);
> >   *	- vendor pgtable formats (pgtable binding);
> >   *	- number of child IOASIDs (nesting);
> >   *	- ...
> >   *
> >   * Above information is available only after one or more devices are
> >   * attached to the specified IOASID. Otherwise the IOASID is just a
> >   * number w/o any capability or attribute.
> >   *
> >   * Input parameters:
> >   *	- u32 ioasid;
> >   *
> >   * Output parameters:
> >   *	- many. TBD.
> 
> We probably need a capability format similar to PCI and VFIO.

Designing this kind of uAPI where it is half HW and half generic is
really tricky to get right. Probably best to take the detailed design
of the IOCTL structs later.

Jason

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-05-28 20:16     ` Jason Gunthorpe
  0 siblings, 0 replies; 518+ messages in thread
From: Jason Gunthorpe @ 2021-05-28 20:16 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Tian, Kevin, Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, David Gibson, Jiang, Dave, Wu, Hao,
	David Woodhouse, Jason Wang

On Fri, May 28, 2021 at 06:23:07PM +0200, Jean-Philippe Brucker wrote:

> Regarding the invalidation, I think limiting it to IOASID may work but it
> does bother me that we can't directly forward all invalidations received
> on the vIOMMU: if the guest sends a device-wide invalidation, do we
> iterate over all IOASIDs and issue one ioctl for each?  Sure the guest is
> probably sending that because of detaching the PASID table, for which the
> kernel did perform the invalidation, but we can't just assume that and
> ignore the request, there may be a different reason. Iterating is going to
> take a lot time, whereas with the current API we can send a single request
> and issue a single command to the IOMMU hardware.

I think the invalidation could stand some improvement, but that also
feels basically incremental to the essence of the proposal.

I agree with the general goal that the uAPI should be able to issue
invalidates that directly map to HW invalidations.

> Similarly, if the guest sends an ATC invalidation for a whole device (in
> the SMMU, that's an ATC_INV without SSID), we'll have to transform that
> into multiple IOTLB invalidations?  We can't just send it on IOASID #0,
> because it may not have been created by the guest.

For instance adding device labels allows an invalidate device
operation to exist and the "generic" kernel driver can iterate over
all IOASIDs hooked to the device. Overridable by the IOMMU driver.

> > Notes:
> > -   It might be confusing as IOASID is also used in the kernel (drivers/
> >     iommu/ioasid.c) to represent PCI PASID or ARM substream ID. We need
> >     find a better name later to differentiate.
> 
> Yes this isn't just about allocating PASIDs anymore. /dev/iommu or
> /dev/ioas would make more sense.

Either makes sense to me

/dev/iommu and the internal IOASID objects can be called IOAS (==
iommu_domain) is not bad

> >   * Get information about an I/O address space
> >   *
> >   * Supported capabilities:
> >   *	- VFIO type1 map/unmap;
> >   *	- pgtable/pasid_table binding
> >   *	- hardware nesting vs. software nesting;
> >   *	- ...
> >   *
> >   * Related attributes:
> >   * 	- supported page sizes, reserved IOVA ranges (DMA mapping);
> >   *	- vendor pgtable formats (pgtable binding);
> >   *	- number of child IOASIDs (nesting);
> >   *	- ...
> >   *
> >   * Above information is available only after one or more devices are
> >   * attached to the specified IOASID. Otherwise the IOASID is just a
> >   * number w/o any capability or attribute.
> >   *
> >   * Input parameters:
> >   *	- u32 ioasid;
> >   *
> >   * Output parameters:
> >   *	- many. TBD.
> 
> We probably need a capability format similar to PCI and VFIO.

Designing this kind of uAPI where it is half HW and half generic is
really tricky to get right. Probably best to take the detailed design
of the IOCTL structs later.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-28  2:24   ` Jason Wang
@ 2021-05-28 20:25     ` Jason Gunthorpe
  -1 siblings, 0 replies; 518+ messages in thread
From: Jason Gunthorpe @ 2021-05-28 20:25 UTC (permalink / raw)
  To: Jason Wang
  Cc: Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu, Hao,
	Jiang, Dave, Jacob Pan, Jean-Philippe Brucker, David Gibson,
	Kirti Wankhede, Robin Murphy

On Fri, May 28, 2021 at 10:24:56AM +0800, Jason Wang wrote:
> > IOASID nesting can be implemented in two ways: hardware nesting and
> > software nesting. With hardware support the child and parent I/O page
> > tables are walked consecutively by the IOMMU to form a nested translation.
> > When it's implemented in software, the ioasid driver
> 
> Need to explain what did "ioasid driver" mean.

I think it means "drivers/iommu"

> And if yes, does it allow the device for software specific implementation:
> 
> 1) swiotlb or

I think it is necessary to have a 'software page table' which is
required to do all the mdevs we have today.

> 2) device specific IOASID implementation

"drivers/iommu" is pluggable, so I guess it can exist? I've never seen
it done before though

If we'd want this to drive an on-device translation table is an
interesting question. I don't have an answer

> > I/O page tables routed through PASID are installed in a per-RID PASID
> > table structure.
> 
> I'm not sure this is true for all archs.

It must be true. For security reasons access to a PASID must be
limited by RID.

RID_A assigned to guest A should not be able to access a PASID being
used by RID_B in guest B. Only a per-RID restriction can accomplish
this.

> I would like to know the reason for such indirection.
> 
> It looks to me the ioasid fd is sufficient for performing any operations.
> 
> Such allocation only work if as ioas fd can have multiple ioasid which seems
> not the case you describe here.

It is the case, read the examples section. One had 3 interrelated
IOASID objects inside the same FD.
 
> > 5.3. IOASID nesting (software)
> > +++++++++++++++++++++++++
> > 
> > Same usage scenario as 5.2, with software-based IOASID nesting
> > available. In this mode it is the kernel instead of user to create the
> > shadow mapping.
> > 
> > The flow before guest boots is same as 5.2, except one point. Because
> > giova_ioasid is nested on gpa_ioasid, locked accounting is only
> > conducted for gpa_ioasid. So it's not necessary to pre-register virtual
> > memory.
> > 
> > To save space we only list the steps after boots (i.e. both dev1/dev2
> > have been attached to gpa_ioasid before guest boots):
> > 
> > 	/* After boots */
> > 	/* Make GIOVA space nested on GPA space */
> > 	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > 				gpa_ioasid);
> > 
> > 	/* Attach dev2 to the new address space (child)
> > 	  * Note dev2 is still attached to gpa_ioasid (parent)
> > 	  */
> > 	at_data = { .ioasid = giova_ioasid};
> > 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> 
> 
> For vDPA, we need something similar. And in the future, vDPA may allow
> multiple ioasid to be attached to a single device. It should work with the
> current design.

What do you imagine multiple IOASID's being used for in VDPA?

Jason

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-05-28 20:25     ` Jason Gunthorpe
  0 siblings, 0 replies; 518+ messages in thread
From: Jason Gunthorpe @ 2021-05-28 20:25 UTC (permalink / raw)
  To: Jason Wang
  Cc: Jean-Philippe Brucker, Tian, Kevin,
	Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, David Gibson, Jiang, Dave, Wu, Hao,
	David Woodhouse

On Fri, May 28, 2021 at 10:24:56AM +0800, Jason Wang wrote:
> > IOASID nesting can be implemented in two ways: hardware nesting and
> > software nesting. With hardware support the child and parent I/O page
> > tables are walked consecutively by the IOMMU to form a nested translation.
> > When it's implemented in software, the ioasid driver
> 
> Need to explain what did "ioasid driver" mean.

I think it means "drivers/iommu"

> And if yes, does it allow the device for software specific implementation:
> 
> 1) swiotlb or

I think it is necessary to have a 'software page table' which is
required to do all the mdevs we have today.

> 2) device specific IOASID implementation

"drivers/iommu" is pluggable, so I guess it can exist? I've never seen
it done before though

If we'd want this to drive an on-device translation table is an
interesting question. I don't have an answer

> > I/O page tables routed through PASID are installed in a per-RID PASID
> > table structure.
> 
> I'm not sure this is true for all archs.

It must be true. For security reasons access to a PASID must be
limited by RID.

RID_A assigned to guest A should not be able to access a PASID being
used by RID_B in guest B. Only a per-RID restriction can accomplish
this.

> I would like to know the reason for such indirection.
> 
> It looks to me the ioasid fd is sufficient for performing any operations.
> 
> Such allocation only work if as ioas fd can have multiple ioasid which seems
> not the case you describe here.

It is the case, read the examples section. One had 3 interrelated
IOASID objects inside the same FD.
 
> > 5.3. IOASID nesting (software)
> > +++++++++++++++++++++++++
> > 
> > Same usage scenario as 5.2, with software-based IOASID nesting
> > available. In this mode it is the kernel instead of user to create the
> > shadow mapping.
> > 
> > The flow before guest boots is same as 5.2, except one point. Because
> > giova_ioasid is nested on gpa_ioasid, locked accounting is only
> > conducted for gpa_ioasid. So it's not necessary to pre-register virtual
> > memory.
> > 
> > To save space we only list the steps after boots (i.e. both dev1/dev2
> > have been attached to gpa_ioasid before guest boots):
> > 
> > 	/* After boots */
> > 	/* Make GIOVA space nested on GPA space */
> > 	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > 				gpa_ioasid);
> > 
> > 	/* Attach dev2 to the new address space (child)
> > 	  * Note dev2 is still attached to gpa_ioasid (parent)
> > 	  */
> > 	at_data = { .ioasid = giova_ioasid};
> > 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> 
> 
> For vDPA, we need something similar. And in the future, vDPA may allow
> multiple ioasid to be attached to a single device. It should work with the
> current design.

What do you imagine multiple IOASID's being used for in VDPA?

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-27  7:58 ` Tian, Kevin
@ 2021-05-28 23:36   ` Jason Gunthorpe
  -1 siblings, 0 replies; 518+ messages in thread
From: Jason Gunthorpe @ 2021-05-28 23:36 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:

> 2.1. /dev/ioasid uAPI
> +++++++++++++++++
> 
> /*
>   * Check whether an uAPI extension is supported. 
>   *
>   * This is for FD-level capabilities, such as locked page pre-registration. 
>   * IOASID-level capabilities are reported through IOASID_GET_INFO.
>   *
>   * Return: 0 if not supported, 1 if supported.
>   */
> #define IOASID_CHECK_EXTENSION	_IO(IOASID_TYPE, IOASID_BASE + 0)

 
> /*
>   * Register user space memory where DMA is allowed.
>   *
>   * It pins user pages and does the locked memory accounting so sub-
>   * sequent IOASID_MAP/UNMAP_DMA calls get faster.
>   *
>   * When this ioctl is not used, one user page might be accounted
>   * multiple times when it is mapped by multiple IOASIDs which are
>   * not nested together.
>   *
>   * Input parameters:
>   *	- vaddr;
>   *	- size;
>   *
>   * Return: 0 on success, -errno on failure.
>   */
> #define IOASID_REGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 1)
> #define IOASID_UNREGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 2)

So VA ranges are pinned and stored in a tree and later references to
those VA ranges by any other IOASID use the pin cached in the tree?

It seems reasonable and is similar to the ioasid parent/child I
suggested for PPC.

IMHO this should be merged with the all SW IOASID that is required for
today's mdev drivers. If this can be done while keeping this uAPI then
great, otherwise I don't think it is so bad to weakly nest a physical
IOASID under a SW one just to optimize page pinning.

Either way this seems like a smart direction

> /*
>   * Allocate an IOASID. 
>   *
>   * IOASID is the FD-local software handle representing an I/O address 
>   * space. Each IOASID is associated with a single I/O page table. User 
>   * must call this ioctl to get an IOASID for every I/O address space that is
>   * intended to be enabled in the IOMMU.
>   *
>   * A newly-created IOASID doesn't accept any command before it is 
>   * attached to a device. Once attached, an empty I/O page table is 
>   * bound with the IOMMU then the user could use either DMA mapping 
>   * or pgtable binding commands to manage this I/O page table.

Can the IOASID can be populated before being attached?

>   * Device attachment is initiated through device driver uAPI (e.g. VFIO)
>   *
>   * Return: allocated ioasid on success, -errno on failure.
>   */
> #define IOASID_ALLOC	_IO(IOASID_TYPE, IOASID_BASE + 3)
> #define IOASID_FREE	_IO(IOASID_TYPE, IOASID_BASE + 4)

I assume alloc will include quite a big structure to satisfy the
various vendor needs?

> /*
>   * Get information about an I/O address space
>   *
>   * Supported capabilities:
>   *	- VFIO type1 map/unmap;
>   *	- pgtable/pasid_table binding
>   *	- hardware nesting vs. software nesting;
>   *	- ...
>   *
>   * Related attributes:
>   * 	- supported page sizes, reserved IOVA ranges (DMA mapping);
>   *	- vendor pgtable formats (pgtable binding);
>   *	- number of child IOASIDs (nesting);
>   *	- ...
>   *
>   * Above information is available only after one or more devices are
>   * attached to the specified IOASID. Otherwise the IOASID is just a
>   * number w/o any capability or attribute.

This feels wrong to learn most of these attributes of the IOASID after
attaching to a device.

The user should have some idea how it intends to use the IOASID when
it creates it and the rest of the system should match the intention.

For instance if the user is creating a IOASID to cover the guest GPA
with the intention of making children it should indicate this during
alloc.

If the user is intending to point a child IOASID to a guest page table
in a certain descriptor format then it should indicate it during
alloc.

device bind should fail if the device somehow isn't compatible with
the scheme the user is tring to use.

> /*
>   * Map/unmap process virtual addresses to I/O virtual addresses.
>   *
>   * Provide VFIO type1 equivalent semantics. Start with the same 
>   * restriction e.g. the unmap size should match those used in the 
>   * original mapping call. 
>   *
>   * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
>   * must be already in the preregistered list.
>   *
>   * Input parameters:
>   *	- u32 ioasid;
>   *	- refer to vfio_iommu_type1_dma_{un}map
>   *
>   * Return: 0 on success, -errno on failure.
>   */
> #define IOASID_MAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 6)
> #define IOASID_UNMAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 7)

What about nested IOASIDs?

> /*
>   * Create a nesting IOASID (child) on an existing IOASID (parent)
>   *
>   * IOASIDs can be nested together, implying that the output address 
>   * from one I/O page table (child) must be further translated by 
>   * another I/O page table (parent).
>   *
>   * As the child adds essentially another reference to the I/O page table 
>   * represented by the parent, any device attached to the child ioasid 
>   * must be already attached to the parent.
>   *
>   * In concept there is no limit on the number of the nesting levels. 
>   * However for the majority case one nesting level is sufficient. The
>   * user should check whether an IOASID supports nesting through 
>   * IOASID_GET_INFO. For example, if only one nesting level is allowed,
>   * the nesting capability is reported only on the parent instead of the
>   * child.
>   *
>   * User also needs check (via IOASID_GET_INFO) whether the nesting 
>   * is implemented in hardware or software. If software-based, DMA 
>   * mapping protocol should be used on the child IOASID. Otherwise, 
>   * the child should be operated with pgtable binding protocol.
>   *
>   * Input parameters:
>   *	- u32 parent_ioasid;
>   *
>   * Return: child_ioasid on success, -errno on failure;
>   */
> #define IOASID_CREATE_NESTING	_IO(IOASID_TYPE, IOASID_BASE + 8)

Do you think another ioctl is best? Should this just be another
parameter to alloc?

> /*
>   * Bind an user-managed I/O page table with the IOMMU
>   *
>   * Because user page table is untrusted, IOASID nesting must be enabled 
>   * for this ioasid so the kernel can enforce its DMA isolation policy 
>   * through the parent ioasid.
>   *
>   * Pgtable binding protocol is different from DMA mapping. The latter 
>   * has the I/O page table constructed by the kernel and updated 
>   * according to user MAP/UNMAP commands. With pgtable binding the 
>   * whole page table is created and updated by userspace, thus different 
>   * set of commands are required (bind, iotlb invalidation, page fault, etc.).
>   *
>   * Because the page table is directly walked by the IOMMU, the user 
>   * must  use a format compatible to the underlying hardware. It can 
>   * check the format information through IOASID_GET_INFO.
>   *
>   * The page table is bound to the IOMMU according to the routing 
>   * information of each attached device under the specified IOASID. The
>   * routing information (RID and optional PASID) is registered when a 
>   * device is attached to this IOASID through VFIO uAPI. 
>   *
>   * Input parameters:
>   *	- child_ioasid;
>   *	- address of the user page table;
>   *	- formats (vendor, address_width, etc.);
>   * 
>   * Return: 0 on success, -errno on failure.
>   */
> #define IOASID_BIND_PGTABLE		_IO(IOASID_TYPE, IOASID_BASE + 9)
> #define IOASID_UNBIND_PGTABLE	_IO(IOASID_TYPE, IOASID_BASE + 10)

Also feels backwards, why wouldn't we specify this, and the required
page table format, during alloc time?

> /*
>   * Bind an user-managed PASID table to the IOMMU
>   *
>   * This is required for platforms which place PASID table in the GPA space.
>   * In this case the specified IOASID represents the per-RID PASID space.
>   *
>   * Alternatively this may be replaced by IOASID_BIND_PGTABLE plus a
>   * special flag to indicate the difference from normal I/O address spaces.
>   *
>   * The format info of the PASID table is reported in IOASID_GET_INFO.
>   *
>   * As explained in the design section, user-managed I/O page tables must
>   * be explicitly bound to the kernel even on these platforms. It allows
>   * the kernel to uniformly manage I/O address spaces cross all platforms.
>   * Otherwise, the iotlb invalidation and page faulting uAPI must be hacked
>   * to carry device routing information to indirectly mark the hidden I/O
>   * address spaces.
>   *
>   * Input parameters:
>   *	- child_ioasid;
>   *	- address of PASID table;
>   *	- formats (vendor, size, etc.);
>   *
>   * Return: 0 on success, -errno on failure.
>   */
> #define IOASID_BIND_PASID_TABLE	_IO(IOASID_TYPE, IOASID_BASE + 11)
> #define IOASID_UNBIND_PASID_TABLE	_IO(IOASID_TYPE, IOASID_BASE + 12)

Ditto

> 
> /*
>   * Invalidate IOTLB for an user-managed I/O page table
>   *
>   * Unlike what's defined in include/uapi/linux/iommu.h, this command 
>   * doesn't allow the user to specify cache type and likely support only
>   * two granularities (all, or a specified range) in the I/O address space.
>   *
>   * Physical IOMMU have three cache types (iotlb, dev_iotlb and pasid
>   * cache). If the IOASID represents an I/O address space, the invalidation
>   * always applies to the iotlb (and dev_iotlb if enabled). If the IOASID
>   * represents a vPASID space, then this command applies to the PASID
>   * cache.
>   *
>   * Similarly this command doesn't provide IOMMU-like granularity
>   * info (domain-wide, pasid-wide, range-based), since it's all about the
>   * I/O address space itself. The ioasid driver walks the attached
>   * routing information to match the IOMMU semantics under the
>   * hood. 
>   *
>   * Input parameters:
>   *	- child_ioasid;
>   *	- granularity
>   * 
>   * Return: 0 on success, -errno on failure
>   */
> #define IOASID_INVALIDATE_CACHE	_IO(IOASID_TYPE, IOASID_BASE + 13)

This should have an IOVA range too?

> /*
>   * Page fault report and response
>   *
>   * This is TBD. Can be added after other parts are cleared up. Likely it 
>   * will be a ring buffer shared between user/kernel, an eventfd to notify 
>   * the user and an ioctl to complete the fault.
>   *
>   * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
>   */

Any reason not to just use read()?
  
>
> 2.2. /dev/vfio uAPI
> ++++++++++++++++

To be clear you mean the 'struct vfio_device' API, these are not
IOCTLs on the container or group?

> /*
>    * Bind a vfio_device to the specified IOASID fd
>    *
>    * Multiple vfio devices can be bound to a single ioasid_fd, but a single
>    * vfio device should not be bound to multiple ioasid_fd's.
>    *
>    * Input parameters:
>    *  - ioasid_fd;
>    *
>    * Return: 0 on success, -errno on failure.
>    */
> #define VFIO_BIND_IOASID_FD           _IO(VFIO_TYPE, VFIO_BASE + 22)
> #define VFIO_UNBIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 23)

This is where it would make sense to have an output "device id" that
allows /dev/ioasid to refer to this "device" by number in events and
other related things.

> 
> 2.3. KVM uAPI
> ++++++++++++
> 
> /*
>   * Update CPU PASID mapping
>   *
>   * This is necessary when ENQCMD will be used in the guest while the
>   * targeted device doesn't accept the vPASID saved in the CPU MSR.
>   *
>   * This command allows user to set/clear the vPASID->pPASID mapping
>   * in the CPU, by providing the IOASID (and FD) information representing
>   * the I/O address space marked by this vPASID.
>   *
>   * Input parameters:
>   *	- user_pasid;
>   *	- ioasid_fd;
>   *	- ioasid;
>   */
> #define KVM_MAP_PASID	_IO(KVMIO, 0xf0)
> #define KVM_UNMAP_PASID	_IO(KVMIO, 0xf1)

It seems simple enough.. So the physical PASID can only be assigned if
the user has an IOASID that points at it? Thus it is secure?
 
> 3. Sample structures and helper functions
> 
> Three helper functions are provided to support VFIO_BIND_IOASID_FD:
> 
> 	struct ioasid_ctx *ioasid_ctx_fdget(int fd);
> 	int ioasid_register_device(struct ioasid_ctx *ctx, struct ioasid_dev *dev);
> 	int ioasid_unregister_device(struct ioasid_dev *dev);
> 
> An ioasid_ctx is created for each fd:
> 
> 	struct ioasid_ctx {
> 		// a list of allocated IOASID data's
> 		struct list_head		ioasid_list;

Would expect an xarray

> 		// a list of registered devices
> 		struct list_head		dev_list;

xarray of device_id

> 		// a list of pre-registered virtual address ranges
> 		struct list_head		prereg_list;

Should re-use the existing SW IOASID table, and be an interval tree.

> Each registered device is represented by ioasid_dev:
> 
> 	struct ioasid_dev {
> 		struct list_head		next;
> 		struct ioasid_ctx	*ctx;
> 		// always be the physical device
> 		struct device 		*device;
> 		struct kref		kref;
> 	};
> 
> Because we assume one vfio_device connected to at most one ioasid_fd, 
> here ioasid_dev could be embedded in vfio_device and then linked to 
> ioasid_ctx->dev_list when registration succeeds. For mdev the struct
> device should be the pointer to the parent device. PASID marking this
> mdev is specified later when VFIO_ATTACH_IOASID.

Don't embed a struct like this in something with vfio_device - that
just makes a mess of reference counting by having multiple krefs in
the same memory block. Keep it as a pointer, the attach operation
should return a pointer to the above struct.

> An ioasid_data is created when IOASID_ALLOC, as the main object 
> describing characteristics about an I/O page table:
> 
> 	struct ioasid_data {
> 		// link to ioasid_ctx->ioasid_list
> 		struct list_head		next;
> 
> 		// the IOASID number
> 		u32			ioasid;
> 
> 		// the handle to convey iommu operations
> 		// hold the pgd (TBD until discussing iommu api)
> 		struct iommu_domain *domain;

But at least for the first coding draft I would expect to see this API
presented with no PASID support and a simple 1:1 with iommu_domain. How
PASID gets modeled is the big TBD, right?

> ioasid_data and iommu_domain have overlapping roles as both are 
> introduced to represent an I/O address space. It is still a big TBD how 
> the two should be corelated or even merged, and whether new iommu 
> ops are required to handle RID+PASID explicitly.

I think it is OK that the uapi and kernel api have different
structs. The uapi focused one should hold the uapi related data, which
is what you've shown here, I think.

> Two helper functions are provided to support VFIO_ATTACH_IOASID:
> 
> 	struct attach_info {
> 		u32	ioasid;
> 		// If valid, the PASID to be used physically
> 		u32	pasid;
> 	};
> 	int ioasid_device_attach(struct ioasid_dev *dev, 
> 		struct attach_info info);
> 	int ioasid_device_detach(struct ioasid_dev *dev, u32 ioasid);

Honestly, I still prefer this to be highly explicit as this is where
all device driver authors get invovled:

ioasid_pci_device_attach(struct pci_device *pdev, struct ioasid_dev *dev, u32 ioasid);
ioasid_pci_device_pasid_attach(struct pci_device *pdev, u32 *physical_pasid, struct ioasid_dev *dev, u32 ioasid);

And presumably a variant for ARM non-PCI platform (?) devices.

This could boil down to a __ioasid_device_attach() as you've shown.

> A new object is introduced and linked to ioasid_data->attach_data for 
> each successful attach operation:
> 
> 	struct ioasid_attach_data {
> 		struct list_head		next;
> 		struct ioasid_dev	*dev;
> 		u32 			pasid;
> 	}

This should be returned as a pointer and detatch should be:

int ioasid_device_detach(struct ioasid_attach_data *);
 
> As explained in the design section, there is no explicit group enforcement
> in /dev/ioasid uAPI or helper functions. But the ioasid driver does
> implicit group check - before every device within an iommu group is 
> attached to this IOASID, the previously-attached devices in this group are
> put in ioasid_data->partial_devices. The IOASID rejects any command if
> the partial_devices list is not empty.

It is simple enough. Would be good to design in a diagnostic string so
userspace can make sense of the failure. Eg return something like
-EDEADLK and provide an ioctl 'why did EDEADLK happen' ?


> Then is the last helper function:
> 	u32 ioasid_get_global_pasid(struct ioasid_ctx *ctx, 
> 		u32 ioasid, bool alloc);
> 
> ioasid_get_global_pasid is necessary in scenarios where multiple devices 
> want to share a same PASID value on the attached I/O page table (e.g. 
> when ENQCMD is enabled, as explained in next section). We need a 
> centralized place (ioasid_data->pasid) to hold this value (allocated when
> first called with alloc=true). vfio device driver calls this function (alloc=
> true) to get the global PASID for an ioasid before calling ioasid_device_
> attach. KVM also calls this function (alloc=false) to setup PASID translation 
> structure when user calls KVM_MAP_PASID.

When/why would the VFIO driver do this? isn't this just some varient
of pasid_attach?

ioasid_pci_device_enqcmd_attach(struct pci_device *pdev, u32 *physical_pasid, struct ioasid_dev *dev, u32 ioasid);

?

> 4. PASID Virtualization
> 
> When guest SVA (vSVA) is enabled, multiple GVA address spaces are 
> created on the assigned vfio device. This leads to the concepts of 
> "virtual PASID" (vPASID) vs. "physical PASID" (pPASID). vPASID is assigned 
> by the guest to mark an GVA address space while pPASID is the one 
> selected by the host and actually routed in the wire.
> 
> vPASID is conveyed to the kernel when user calls VFIO_ATTACH_IOASID.

Should the vPASID programmed into the IOASID before calling
VFIO_ATTACH_IOASID?

> vfio device driver translates vPASID to pPASID before calling ioasid_attach_
> device, with two factors to be considered:
> 
> -    Whether vPASID is directly used (vPASID==pPASID) in the wire, or 
>      should be instead converted to a newly-allocated one (vPASID!=
>      pPASID);
> 
> -    If vPASID!=pPASID, whether pPASID is allocated from per-RID PASID
>      space or a global PASID space (implying sharing pPASID cross devices,
>      e.g. when supporting Intel ENQCMD which puts PASID in a CPU MSR
>      as part of the process context);

This whole section is 4 really confusing. I think it would be more
understandable to focus on the list below and minimize the vPASID

> The actual policy depends on pdev vs. mdev, and whether ENQCMD is
> supported. There are three possible scenarios:
> 
> (Note: /dev/ioasid uAPI is not affected by underlying PASID virtualization 
> policies.)

This has become unclear. I think this should start by identifying the
6 main type of devices and how they can use pPASID/vPASID:

0) Device is a RID and cannot issue PASID
1) Device is a mdev and cannot issue PASID
2) Device is a mdev and programs a single fixed PASID during bind,
   does not accept PASID from the guest

3) Device accepts any PASIDs from the guest. No
   vPASID/pPASID translation is possible. (classic vfio_pci)
4) Device accepts any PASID from the guest and has an
   internal vPASID/pPASID translation (enhanced vfio_pci)
5) Device accepts and PASID from the guest and relys on
   external vPASID/pPASID translation via ENQCMD (Intel SIOV mdev)

0-2 don't use vPASID at all

3-5 consume a vPASID but handle it differently.

I think the 3-5 map into what you are trying to explain in the table
below, which is the rules for allocating the vPASID depending on which
of device types 3-5 are present and or mixed.

For instance device type 3 requires vPASID == pPASID because it can't
do translation at all.

This probably all needs to come through clearly in the /dev/ioasid
interface. Once the attached devices are labled it would make sense to
have a 'query device' /dev/ioasid IOCTL to report the details based on
how the device attached and other information.

> 2)  mdev: vPASID!=pPASID (per-RID if w/o ENQCMD, otherwise global)
> 
>      PASIDs are also used by kernel to mark the default I/O address space 
>      for mdev, thus cannot be delegated to the guest. Instead, the mdev 
>      driver must allocate a new pPASID for each vPASID (thus vPASID!=
>      pPASID) and then use pPASID when attaching this mdev to an ioasid.

I don't understand this at all.. What does "PASIDs are also used by
the kernel" mean?

>      The mdev driver needs cache the PASID mapping so in mediation 
>      path vPASID programmed by the guest can be converted to pPASID 
>      before updating the physical MMIO register.

This is my scenario #4 above. Device and internally virtualize
vPASID/pPASID - how that is done is up to the device. But this is all
just labels, when such a device attaches, it should use some specific
API:

ioasid_pci_device_vpasid_attach(struct pci_device *pdev,
 u32 *physical_pasid, u32 *virtual_pasid, struct ioasid_dev *dev, u32 ioasid);

And then maintain its internal translation

>      In previous thread a PASID range split scheme was discussed to support
>      this combination, but we haven't worked out a clean uAPI design yet.
>      Therefore in this proposal we decide to not support it, implying the 
>      user should have some intelligence to avoid such scenario. It could be
>      a TODO task for future.

It really just boils down to how to allocate the PASIDs to get around
the bad viommu interface that assumes all PASIDs are usable by all
devices.
 
> In spite of those subtle considerations, the kernel implementation could
> start simple, e.g.:
> 
> -    v==p for pdev;
> -    v!=p and always use a global PASID pool for all mdev's;

Regardless all this mess needs to be hidden from the consuming drivers
with some simple APIs as above. The driver should indicate what its HW
can do and the PASID #'s that magically come out of /dev/ioasid should
be appropriate.

Will resume on another email..

Jason

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-05-28 23:36   ` Jason Gunthorpe
  0 siblings, 0 replies; 518+ messages in thread
From: Jason Gunthorpe @ 2021-05-28 23:36 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jean-Philippe Brucker,
	Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, David Gibson, Jiang, Dave, Wu, Hao,
	David Woodhouse, Jason Wang

On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:

> 2.1. /dev/ioasid uAPI
> +++++++++++++++++
> 
> /*
>   * Check whether an uAPI extension is supported. 
>   *
>   * This is for FD-level capabilities, such as locked page pre-registration. 
>   * IOASID-level capabilities are reported through IOASID_GET_INFO.
>   *
>   * Return: 0 if not supported, 1 if supported.
>   */
> #define IOASID_CHECK_EXTENSION	_IO(IOASID_TYPE, IOASID_BASE + 0)

 
> /*
>   * Register user space memory where DMA is allowed.
>   *
>   * It pins user pages and does the locked memory accounting so sub-
>   * sequent IOASID_MAP/UNMAP_DMA calls get faster.
>   *
>   * When this ioctl is not used, one user page might be accounted
>   * multiple times when it is mapped by multiple IOASIDs which are
>   * not nested together.
>   *
>   * Input parameters:
>   *	- vaddr;
>   *	- size;
>   *
>   * Return: 0 on success, -errno on failure.
>   */
> #define IOASID_REGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 1)
> #define IOASID_UNREGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 2)

So VA ranges are pinned and stored in a tree and later references to
those VA ranges by any other IOASID use the pin cached in the tree?

It seems reasonable and is similar to the ioasid parent/child I
suggested for PPC.

IMHO this should be merged with the all SW IOASID that is required for
today's mdev drivers. If this can be done while keeping this uAPI then
great, otherwise I don't think it is so bad to weakly nest a physical
IOASID under a SW one just to optimize page pinning.

Either way this seems like a smart direction

> /*
>   * Allocate an IOASID. 
>   *
>   * IOASID is the FD-local software handle representing an I/O address 
>   * space. Each IOASID is associated with a single I/O page table. User 
>   * must call this ioctl to get an IOASID for every I/O address space that is
>   * intended to be enabled in the IOMMU.
>   *
>   * A newly-created IOASID doesn't accept any command before it is 
>   * attached to a device. Once attached, an empty I/O page table is 
>   * bound with the IOMMU then the user could use either DMA mapping 
>   * or pgtable binding commands to manage this I/O page table.

Can the IOASID can be populated before being attached?

>   * Device attachment is initiated through device driver uAPI (e.g. VFIO)
>   *
>   * Return: allocated ioasid on success, -errno on failure.
>   */
> #define IOASID_ALLOC	_IO(IOASID_TYPE, IOASID_BASE + 3)
> #define IOASID_FREE	_IO(IOASID_TYPE, IOASID_BASE + 4)

I assume alloc will include quite a big structure to satisfy the
various vendor needs?

> /*
>   * Get information about an I/O address space
>   *
>   * Supported capabilities:
>   *	- VFIO type1 map/unmap;
>   *	- pgtable/pasid_table binding
>   *	- hardware nesting vs. software nesting;
>   *	- ...
>   *
>   * Related attributes:
>   * 	- supported page sizes, reserved IOVA ranges (DMA mapping);
>   *	- vendor pgtable formats (pgtable binding);
>   *	- number of child IOASIDs (nesting);
>   *	- ...
>   *
>   * Above information is available only after one or more devices are
>   * attached to the specified IOASID. Otherwise the IOASID is just a
>   * number w/o any capability or attribute.

This feels wrong to learn most of these attributes of the IOASID after
attaching to a device.

The user should have some idea how it intends to use the IOASID when
it creates it and the rest of the system should match the intention.

For instance if the user is creating a IOASID to cover the guest GPA
with the intention of making children it should indicate this during
alloc.

If the user is intending to point a child IOASID to a guest page table
in a certain descriptor format then it should indicate it during
alloc.

device bind should fail if the device somehow isn't compatible with
the scheme the user is tring to use.

> /*
>   * Map/unmap process virtual addresses to I/O virtual addresses.
>   *
>   * Provide VFIO type1 equivalent semantics. Start with the same 
>   * restriction e.g. the unmap size should match those used in the 
>   * original mapping call. 
>   *
>   * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
>   * must be already in the preregistered list.
>   *
>   * Input parameters:
>   *	- u32 ioasid;
>   *	- refer to vfio_iommu_type1_dma_{un}map
>   *
>   * Return: 0 on success, -errno on failure.
>   */
> #define IOASID_MAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 6)
> #define IOASID_UNMAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 7)

What about nested IOASIDs?

> /*
>   * Create a nesting IOASID (child) on an existing IOASID (parent)
>   *
>   * IOASIDs can be nested together, implying that the output address 
>   * from one I/O page table (child) must be further translated by 
>   * another I/O page table (parent).
>   *
>   * As the child adds essentially another reference to the I/O page table 
>   * represented by the parent, any device attached to the child ioasid 
>   * must be already attached to the parent.
>   *
>   * In concept there is no limit on the number of the nesting levels. 
>   * However for the majority case one nesting level is sufficient. The
>   * user should check whether an IOASID supports nesting through 
>   * IOASID_GET_INFO. For example, if only one nesting level is allowed,
>   * the nesting capability is reported only on the parent instead of the
>   * child.
>   *
>   * User also needs check (via IOASID_GET_INFO) whether the nesting 
>   * is implemented in hardware or software. If software-based, DMA 
>   * mapping protocol should be used on the child IOASID. Otherwise, 
>   * the child should be operated with pgtable binding protocol.
>   *
>   * Input parameters:
>   *	- u32 parent_ioasid;
>   *
>   * Return: child_ioasid on success, -errno on failure;
>   */
> #define IOASID_CREATE_NESTING	_IO(IOASID_TYPE, IOASID_BASE + 8)

Do you think another ioctl is best? Should this just be another
parameter to alloc?

> /*
>   * Bind an user-managed I/O page table with the IOMMU
>   *
>   * Because user page table is untrusted, IOASID nesting must be enabled 
>   * for this ioasid so the kernel can enforce its DMA isolation policy 
>   * through the parent ioasid.
>   *
>   * Pgtable binding protocol is different from DMA mapping. The latter 
>   * has the I/O page table constructed by the kernel and updated 
>   * according to user MAP/UNMAP commands. With pgtable binding the 
>   * whole page table is created and updated by userspace, thus different 
>   * set of commands are required (bind, iotlb invalidation, page fault, etc.).
>   *
>   * Because the page table is directly walked by the IOMMU, the user 
>   * must  use a format compatible to the underlying hardware. It can 
>   * check the format information through IOASID_GET_INFO.
>   *
>   * The page table is bound to the IOMMU according to the routing 
>   * information of each attached device under the specified IOASID. The
>   * routing information (RID and optional PASID) is registered when a 
>   * device is attached to this IOASID through VFIO uAPI. 
>   *
>   * Input parameters:
>   *	- child_ioasid;
>   *	- address of the user page table;
>   *	- formats (vendor, address_width, etc.);
>   * 
>   * Return: 0 on success, -errno on failure.
>   */
> #define IOASID_BIND_PGTABLE		_IO(IOASID_TYPE, IOASID_BASE + 9)
> #define IOASID_UNBIND_PGTABLE	_IO(IOASID_TYPE, IOASID_BASE + 10)

Also feels backwards, why wouldn't we specify this, and the required
page table format, during alloc time?

> /*
>   * Bind an user-managed PASID table to the IOMMU
>   *
>   * This is required for platforms which place PASID table in the GPA space.
>   * In this case the specified IOASID represents the per-RID PASID space.
>   *
>   * Alternatively this may be replaced by IOASID_BIND_PGTABLE plus a
>   * special flag to indicate the difference from normal I/O address spaces.
>   *
>   * The format info of the PASID table is reported in IOASID_GET_INFO.
>   *
>   * As explained in the design section, user-managed I/O page tables must
>   * be explicitly bound to the kernel even on these platforms. It allows
>   * the kernel to uniformly manage I/O address spaces cross all platforms.
>   * Otherwise, the iotlb invalidation and page faulting uAPI must be hacked
>   * to carry device routing information to indirectly mark the hidden I/O
>   * address spaces.
>   *
>   * Input parameters:
>   *	- child_ioasid;
>   *	- address of PASID table;
>   *	- formats (vendor, size, etc.);
>   *
>   * Return: 0 on success, -errno on failure.
>   */
> #define IOASID_BIND_PASID_TABLE	_IO(IOASID_TYPE, IOASID_BASE + 11)
> #define IOASID_UNBIND_PASID_TABLE	_IO(IOASID_TYPE, IOASID_BASE + 12)

Ditto

> 
> /*
>   * Invalidate IOTLB for an user-managed I/O page table
>   *
>   * Unlike what's defined in include/uapi/linux/iommu.h, this command 
>   * doesn't allow the user to specify cache type and likely support only
>   * two granularities (all, or a specified range) in the I/O address space.
>   *
>   * Physical IOMMU have three cache types (iotlb, dev_iotlb and pasid
>   * cache). If the IOASID represents an I/O address space, the invalidation
>   * always applies to the iotlb (and dev_iotlb if enabled). If the IOASID
>   * represents a vPASID space, then this command applies to the PASID
>   * cache.
>   *
>   * Similarly this command doesn't provide IOMMU-like granularity
>   * info (domain-wide, pasid-wide, range-based), since it's all about the
>   * I/O address space itself. The ioasid driver walks the attached
>   * routing information to match the IOMMU semantics under the
>   * hood. 
>   *
>   * Input parameters:
>   *	- child_ioasid;
>   *	- granularity
>   * 
>   * Return: 0 on success, -errno on failure
>   */
> #define IOASID_INVALIDATE_CACHE	_IO(IOASID_TYPE, IOASID_BASE + 13)

This should have an IOVA range too?

> /*
>   * Page fault report and response
>   *
>   * This is TBD. Can be added after other parts are cleared up. Likely it 
>   * will be a ring buffer shared between user/kernel, an eventfd to notify 
>   * the user and an ioctl to complete the fault.
>   *
>   * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
>   */

Any reason not to just use read()?
  
>
> 2.2. /dev/vfio uAPI
> ++++++++++++++++

To be clear you mean the 'struct vfio_device' API, these are not
IOCTLs on the container or group?

> /*
>    * Bind a vfio_device to the specified IOASID fd
>    *
>    * Multiple vfio devices can be bound to a single ioasid_fd, but a single
>    * vfio device should not be bound to multiple ioasid_fd's.
>    *
>    * Input parameters:
>    *  - ioasid_fd;
>    *
>    * Return: 0 on success, -errno on failure.
>    */
> #define VFIO_BIND_IOASID_FD           _IO(VFIO_TYPE, VFIO_BASE + 22)
> #define VFIO_UNBIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 23)

This is where it would make sense to have an output "device id" that
allows /dev/ioasid to refer to this "device" by number in events and
other related things.

> 
> 2.3. KVM uAPI
> ++++++++++++
> 
> /*
>   * Update CPU PASID mapping
>   *
>   * This is necessary when ENQCMD will be used in the guest while the
>   * targeted device doesn't accept the vPASID saved in the CPU MSR.
>   *
>   * This command allows user to set/clear the vPASID->pPASID mapping
>   * in the CPU, by providing the IOASID (and FD) information representing
>   * the I/O address space marked by this vPASID.
>   *
>   * Input parameters:
>   *	- user_pasid;
>   *	- ioasid_fd;
>   *	- ioasid;
>   */
> #define KVM_MAP_PASID	_IO(KVMIO, 0xf0)
> #define KVM_UNMAP_PASID	_IO(KVMIO, 0xf1)

It seems simple enough.. So the physical PASID can only be assigned if
the user has an IOASID that points at it? Thus it is secure?
 
> 3. Sample structures and helper functions
> 
> Three helper functions are provided to support VFIO_BIND_IOASID_FD:
> 
> 	struct ioasid_ctx *ioasid_ctx_fdget(int fd);
> 	int ioasid_register_device(struct ioasid_ctx *ctx, struct ioasid_dev *dev);
> 	int ioasid_unregister_device(struct ioasid_dev *dev);
> 
> An ioasid_ctx is created for each fd:
> 
> 	struct ioasid_ctx {
> 		// a list of allocated IOASID data's
> 		struct list_head		ioasid_list;

Would expect an xarray

> 		// a list of registered devices
> 		struct list_head		dev_list;

xarray of device_id

> 		// a list of pre-registered virtual address ranges
> 		struct list_head		prereg_list;

Should re-use the existing SW IOASID table, and be an interval tree.

> Each registered device is represented by ioasid_dev:
> 
> 	struct ioasid_dev {
> 		struct list_head		next;
> 		struct ioasid_ctx	*ctx;
> 		// always be the physical device
> 		struct device 		*device;
> 		struct kref		kref;
> 	};
> 
> Because we assume one vfio_device connected to at most one ioasid_fd, 
> here ioasid_dev could be embedded in vfio_device and then linked to 
> ioasid_ctx->dev_list when registration succeeds. For mdev the struct
> device should be the pointer to the parent device. PASID marking this
> mdev is specified later when VFIO_ATTACH_IOASID.

Don't embed a struct like this in something with vfio_device - that
just makes a mess of reference counting by having multiple krefs in
the same memory block. Keep it as a pointer, the attach operation
should return a pointer to the above struct.

> An ioasid_data is created when IOASID_ALLOC, as the main object 
> describing characteristics about an I/O page table:
> 
> 	struct ioasid_data {
> 		// link to ioasid_ctx->ioasid_list
> 		struct list_head		next;
> 
> 		// the IOASID number
> 		u32			ioasid;
> 
> 		// the handle to convey iommu operations
> 		// hold the pgd (TBD until discussing iommu api)
> 		struct iommu_domain *domain;

But at least for the first coding draft I would expect to see this API
presented with no PASID support and a simple 1:1 with iommu_domain. How
PASID gets modeled is the big TBD, right?

> ioasid_data and iommu_domain have overlapping roles as both are 
> introduced to represent an I/O address space. It is still a big TBD how 
> the two should be corelated or even merged, and whether new iommu 
> ops are required to handle RID+PASID explicitly.

I think it is OK that the uapi and kernel api have different
structs. The uapi focused one should hold the uapi related data, which
is what you've shown here, I think.

> Two helper functions are provided to support VFIO_ATTACH_IOASID:
> 
> 	struct attach_info {
> 		u32	ioasid;
> 		// If valid, the PASID to be used physically
> 		u32	pasid;
> 	};
> 	int ioasid_device_attach(struct ioasid_dev *dev, 
> 		struct attach_info info);
> 	int ioasid_device_detach(struct ioasid_dev *dev, u32 ioasid);

Honestly, I still prefer this to be highly explicit as this is where
all device driver authors get invovled:

ioasid_pci_device_attach(struct pci_device *pdev, struct ioasid_dev *dev, u32 ioasid);
ioasid_pci_device_pasid_attach(struct pci_device *pdev, u32 *physical_pasid, struct ioasid_dev *dev, u32 ioasid);

And presumably a variant for ARM non-PCI platform (?) devices.

This could boil down to a __ioasid_device_attach() as you've shown.

> A new object is introduced and linked to ioasid_data->attach_data for 
> each successful attach operation:
> 
> 	struct ioasid_attach_data {
> 		struct list_head		next;
> 		struct ioasid_dev	*dev;
> 		u32 			pasid;
> 	}

This should be returned as a pointer and detatch should be:

int ioasid_device_detach(struct ioasid_attach_data *);
 
> As explained in the design section, there is no explicit group enforcement
> in /dev/ioasid uAPI or helper functions. But the ioasid driver does
> implicit group check - before every device within an iommu group is 
> attached to this IOASID, the previously-attached devices in this group are
> put in ioasid_data->partial_devices. The IOASID rejects any command if
> the partial_devices list is not empty.

It is simple enough. Would be good to design in a diagnostic string so
userspace can make sense of the failure. Eg return something like
-EDEADLK and provide an ioctl 'why did EDEADLK happen' ?


> Then is the last helper function:
> 	u32 ioasid_get_global_pasid(struct ioasid_ctx *ctx, 
> 		u32 ioasid, bool alloc);
> 
> ioasid_get_global_pasid is necessary in scenarios where multiple devices 
> want to share a same PASID value on the attached I/O page table (e.g. 
> when ENQCMD is enabled, as explained in next section). We need a 
> centralized place (ioasid_data->pasid) to hold this value (allocated when
> first called with alloc=true). vfio device driver calls this function (alloc=
> true) to get the global PASID for an ioasid before calling ioasid_device_
> attach. KVM also calls this function (alloc=false) to setup PASID translation 
> structure when user calls KVM_MAP_PASID.

When/why would the VFIO driver do this? isn't this just some varient
of pasid_attach?

ioasid_pci_device_enqcmd_attach(struct pci_device *pdev, u32 *physical_pasid, struct ioasid_dev *dev, u32 ioasid);

?

> 4. PASID Virtualization
> 
> When guest SVA (vSVA) is enabled, multiple GVA address spaces are 
> created on the assigned vfio device. This leads to the concepts of 
> "virtual PASID" (vPASID) vs. "physical PASID" (pPASID). vPASID is assigned 
> by the guest to mark an GVA address space while pPASID is the one 
> selected by the host and actually routed in the wire.
> 
> vPASID is conveyed to the kernel when user calls VFIO_ATTACH_IOASID.

Should the vPASID programmed into the IOASID before calling
VFIO_ATTACH_IOASID?

> vfio device driver translates vPASID to pPASID before calling ioasid_attach_
> device, with two factors to be considered:
> 
> -    Whether vPASID is directly used (vPASID==pPASID) in the wire, or 
>      should be instead converted to a newly-allocated one (vPASID!=
>      pPASID);
> 
> -    If vPASID!=pPASID, whether pPASID is allocated from per-RID PASID
>      space or a global PASID space (implying sharing pPASID cross devices,
>      e.g. when supporting Intel ENQCMD which puts PASID in a CPU MSR
>      as part of the process context);

This whole section is 4 really confusing. I think it would be more
understandable to focus on the list below and minimize the vPASID

> The actual policy depends on pdev vs. mdev, and whether ENQCMD is
> supported. There are three possible scenarios:
> 
> (Note: /dev/ioasid uAPI is not affected by underlying PASID virtualization 
> policies.)

This has become unclear. I think this should start by identifying the
6 main type of devices and how they can use pPASID/vPASID:

0) Device is a RID and cannot issue PASID
1) Device is a mdev and cannot issue PASID
2) Device is a mdev and programs a single fixed PASID during bind,
   does not accept PASID from the guest

3) Device accepts any PASIDs from the guest. No
   vPASID/pPASID translation is possible. (classic vfio_pci)
4) Device accepts any PASID from the guest and has an
   internal vPASID/pPASID translation (enhanced vfio_pci)
5) Device accepts and PASID from the guest and relys on
   external vPASID/pPASID translation via ENQCMD (Intel SIOV mdev)

0-2 don't use vPASID at all

3-5 consume a vPASID but handle it differently.

I think the 3-5 map into what you are trying to explain in the table
below, which is the rules for allocating the vPASID depending on which
of device types 3-5 are present and or mixed.

For instance device type 3 requires vPASID == pPASID because it can't
do translation at all.

This probably all needs to come through clearly in the /dev/ioasid
interface. Once the attached devices are labled it would make sense to
have a 'query device' /dev/ioasid IOCTL to report the details based on
how the device attached and other information.

> 2)  mdev: vPASID!=pPASID (per-RID if w/o ENQCMD, otherwise global)
> 
>      PASIDs are also used by kernel to mark the default I/O address space 
>      for mdev, thus cannot be delegated to the guest. Instead, the mdev 
>      driver must allocate a new pPASID for each vPASID (thus vPASID!=
>      pPASID) and then use pPASID when attaching this mdev to an ioasid.

I don't understand this at all.. What does "PASIDs are also used by
the kernel" mean?

>      The mdev driver needs cache the PASID mapping so in mediation 
>      path vPASID programmed by the guest can be converted to pPASID 
>      before updating the physical MMIO register.

This is my scenario #4 above. Device and internally virtualize
vPASID/pPASID - how that is done is up to the device. But this is all
just labels, when such a device attaches, it should use some specific
API:

ioasid_pci_device_vpasid_attach(struct pci_device *pdev,
 u32 *physical_pasid, u32 *virtual_pasid, struct ioasid_dev *dev, u32 ioasid);

And then maintain its internal translation

>      In previous thread a PASID range split scheme was discussed to support
>      this combination, but we haven't worked out a clean uAPI design yet.
>      Therefore in this proposal we decide to not support it, implying the 
>      user should have some intelligence to avoid such scenario. It could be
>      a TODO task for future.

It really just boils down to how to allocate the PASIDs to get around
the bad viommu interface that assumes all PASIDs are usable by all
devices.
 
> In spite of those subtle considerations, the kernel implementation could
> start simple, e.g.:
> 
> -    v==p for pdev;
> -    v!=p and always use a global PASID pool for all mdev's;

Regardless all this mess needs to be hidden from the consuming drivers
with some simple APIs as above. The driver should indicate what its HW
can do and the PASID #'s that magically come out of /dev/ioasid should
be appropriate.

Will resume on another email..

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-28  2:24   ` Jason Wang
  (?)
  (?)
@ 2021-05-31  8:41   ` Liu Yi L
  2021-06-01  2:36       ` Jason Wang
  -1 siblings, 1 reply; 518+ messages in thread
From: Liu Yi L @ 2021-05-31  8:41 UTC (permalink / raw)
  To: Jason Wang
  Cc: Jiang, Tian, Kevin, Ashok, kvm,
	Alex Williamson (alex.williamson@redhat.com)"
	<alex.williamson@redhat.com>,
	Eric Auger <eric.auger@redhat.com>,
	Jonathan Corbet <corbet@lwn.net>, ,
	Raj, LKML,  <dave.jiang@intel.com>,
	Jacob Pan  <jacob.jun.pan@linux.intel.com>,
	Jean-Philippe Brucker  <jean-philippe@linaro.org>,
	David Gibson <david@gibson.dropbear.id.au>,
	Kirti Wankhede <kwankhede@nvidia.com>,
	Robin Murphy <robin.murphy@arm.com>,
	 <hao.wu@intel.com>, ,
	iommu, Jason Gunthorpe, Hao, Dave,  <ashok.raj@intel.com>, ,
	David Woodhouse, Wu

On Fri, 28 May 2021 10:24:56 +0800, Jason Wang wrote:

> 在 2021/5/27 下午3:58, Tian, Kevin 写道:
> > /dev/ioasid provides an unified interface for managing I/O page tables for
> > devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
> > etc.) are expected to use this interface instead of creating their own logic to
> > isolate untrusted device DMAs initiated by userspace.  
> 
> 
> Not a native speaker but /dev/ioas seems better?

we are open on it. using /dev/ioasid is just because of we are using it in the
previous discussion. ^_^

> 
> >
> > This proposal describes the uAPI of /dev/ioasid and also sample sequences
> > with VFIO as example in typical usages. The driver-facing kernel API provided
> > by the iommu layer is still TBD, which can be discussed after consensus is
> > made on this uAPI.
> >
> > It's based on a lengthy discussion starting from here:
> > 	https://lore.kernel.org/linux-iommu/20210330132830.GO2356281@nvidia.com/
> >
> > It ends up to be a long writing due to many things to be summarized and
> > non-trivial effort required to connect them into a complete proposal.
> > Hope it provides a clean base to converge.
> >
> > TOC
> > ====
> > 1. Terminologies and Concepts
> > 2. uAPI Proposal
> >      2.1. /dev/ioasid uAPI
> >      2.2. /dev/vfio uAPI
> >      2.3. /dev/kvm uAPI
> > 3. Sample structures and helper functions
> > 4. PASID virtualization
> > 5. Use Cases and Flows
> >      5.1. A simple example
> >      5.2. Multiple IOASIDs (no nesting)
> >      5.3. IOASID nesting (software)
> >      5.4. IOASID nesting (hardware)
> >      5.5. Guest SVA (vSVA)
> >      5.6. I/O page fault
> >      5.7. BIND_PASID_TABLE
> > ====
> >
[...]
> >
> > With above design /dev/ioasid uAPI is all about I/O address spaces.
> > It doesn't include any device routing information, which is only
> > indirectly registered to the ioasid driver through VFIO uAPI. For
> > example, I/O page fault is always reported to userspace per IOASID,
> > although it's physically reported per device (RID+PASID). If there is a
> > need of further relaying this fault into the guest, the user is responsible
> > of identifying the device attached to this IOASID (randomly pick one if
> > multiple attached devices) and then generates a per-device virtual I/O
> > page fault into guest. Similarly the iotlb invalidation uAPI describes the
> > granularity in the I/O address space (all, or a range), different from the
> > underlying IOMMU semantics (domain-wide, PASID-wide, range-based).
> >
> > I/O page tables routed through PASID are installed in a per-RID PASID
> > table structure.  
> 
> 
> I'm not sure this is true for all archs.

today, yes. and echo JasonG's comment on it.

> 
> >   Some platforms implement the PASID table in the guest
> > physical space (GPA), expecting it managed by the guest. The guest
> > PASID table is bound to the IOMMU also by attaching to an IOASID,
> > representing the per-RID vPASID space.
> >
[...]
> >
> > /*
> >    * Get information about an I/O address space
> >    *
> >    * Supported capabilities:
> >    *	- VFIO type1 map/unmap;
> >    *	- pgtable/pasid_table binding
> >    *	- hardware nesting vs. software nesting;
> >    *	- ...
> >    *
> >    * Related attributes:
> >    * 	- supported page sizes, reserved IOVA ranges (DMA mapping);
> >    *	- vendor pgtable formats (pgtable binding);
> >    *	- number of child IOASIDs (nesting);
> >    *	- ...
> >    *
> >    * Above information is available only after one or more devices are
> >    * attached to the specified IOASID. Otherwise the IOASID is just a
> >    * number w/o any capability or attribute.
> >    *
> >    * Input parameters:
> >    *	- u32 ioasid;
> >    *
> >    * Output parameters:
> >    *	- many. TBD.
> >    */
> > #define IOASID_GET_INFO	_IO(IOASID_TYPE, IOASID_BASE + 5)
> >
> >
> > /*
> >    * Map/unmap process virtual addresses to I/O virtual addresses.
> >    *
> >    * Provide VFIO type1 equivalent semantics. Start with the same
> >    * restriction e.g. the unmap size should match those used in the
> >    * original mapping call.
> >    *
> >    * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
> >    * must be already in the preregistered list.
> >    *
> >    * Input parameters:
> >    *	- u32 ioasid;
> >    *	- refer to vfio_iommu_type1_dma_{un}map
> >    *
> >    * Return: 0 on success, -errno on failure.
> >    */
> > #define IOASID_MAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 6)
> > #define IOASID_UNMAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 7)
> >
> >
> > /*
> >    * Create a nesting IOASID (child) on an existing IOASID (parent)
> >    *
> >    * IOASIDs can be nested together, implying that the output address
> >    * from one I/O page table (child) must be further translated by
> >    * another I/O page table (parent).
> >    *
> >    * As the child adds essentially another reference to the I/O page table
> >    * represented by the parent, any device attached to the child ioasid
> >    * must be already attached to the parent.
> >    *
> >    * In concept there is no limit on the number of the nesting levels.
> >    * However for the majority case one nesting level is sufficient. The
> >    * user should check whether an IOASID supports nesting through
> >    * IOASID_GET_INFO. For example, if only one nesting level is allowed,
> >    * the nesting capability is reported only on the parent instead of the
> >    * child.
> >    *
> >    * User also needs check (via IOASID_GET_INFO) whether the nesting
> >    * is implemented in hardware or software. If software-based, DMA
> >    * mapping protocol should be used on the child IOASID. Otherwise,
> >    * the child should be operated with pgtable binding protocol.
> >    *
> >    * Input parameters:
> >    *	- u32 parent_ioasid;
> >    *
> >    * Return: child_ioasid on success, -errno on failure;
> >    */
> > #define IOASID_CREATE_NESTING	_IO(IOASID_TYPE, IOASID_BASE + 8)
> >
> >
> > /*
> >    * Bind an user-managed I/O page table with the IOMMU
> >    *
> >    * Because user page table is untrusted, IOASID nesting must be enabled
> >    * for this ioasid so the kernel can enforce its DMA isolation policy
> >    * through the parent ioasid.
> >    *
> >    * Pgtable binding protocol is different from DMA mapping. The latter
> >    * has the I/O page table constructed by the kernel and updated
> >    * according to user MAP/UNMAP commands. With pgtable binding the
> >    * whole page table is created and updated by userspace, thus different
> >    * set of commands are required (bind, iotlb invalidation, page fault, etc.).
> >    *
> >    * Because the page table is directly walked by the IOMMU, the user
> >    * must  use a format compatible to the underlying hardware. It can
> >    * check the format information through IOASID_GET_INFO.
> >    *
> >    * The page table is bound to the IOMMU according to the routing
> >    * information of each attached device under the specified IOASID. The
> >    * routing information (RID and optional PASID) is registered when a
> >    * device is attached to this IOASID through VFIO uAPI.
> >    *
> >    * Input parameters:
> >    *	- child_ioasid;
> >    *	- address of the user page table;
> >    *	- formats (vendor, address_width, etc.);
> >    *
> >    * Return: 0 on success, -errno on failure.
> >    */
> > #define IOASID_BIND_PGTABLE		_IO(IOASID_TYPE, IOASID_BASE + 9)
> > #define IOASID_UNBIND_PGTABLE	_IO(IOASID_TYPE, IOASID_BASE + 10)
> >
> >
> > /*
> >    * Bind an user-managed PASID table to the IOMMU
> >    *
> >    * This is required for platforms which place PASID table in the GPA space.
> >    * In this case the specified IOASID represents the per-RID PASID space.
> >    *
> >    * Alternatively this may be replaced by IOASID_BIND_PGTABLE plus a
> >    * special flag to indicate the difference from normal I/O address spaces.
> >    *
> >    * The format info of the PASID table is reported in IOASID_GET_INFO.
> >    *
> >    * As explained in the design section, user-managed I/O page tables must
> >    * be explicitly bound to the kernel even on these platforms. It allows
> >    * the kernel to uniformly manage I/O address spaces cross all platforms.
> >    * Otherwise, the iotlb invalidation and page faulting uAPI must be hacked
> >    * to carry device routing information to indirectly mark the hidden I/O
> >    * address spaces.
> >    *
> >    * Input parameters:
> >    *	- child_ioasid;
> >    *	- address of PASID table;
> >    *	- formats (vendor, size, etc.);
> >    *
> >    * Return: 0 on success, -errno on failure.
> >    */
> > #define IOASID_BIND_PASID_TABLE	_IO(IOASID_TYPE, IOASID_BASE + 11)
> > #define IOASID_UNBIND_PASID_TABLE	_IO(IOASID_TYPE, IOASID_BASE + 12)
> >
> >
> > /*
> >    * Invalidate IOTLB for an user-managed I/O page table
> >    *
> >    * Unlike what's defined in include/uapi/linux/iommu.h, this command
> >    * doesn't allow the user to specify cache type and likely support only
> >    * two granularities (all, or a specified range) in the I/O address space.
> >    *
> >    * Physical IOMMU have three cache types (iotlb, dev_iotlb and pasid
> >    * cache). If the IOASID represents an I/O address space, the invalidation
> >    * always applies to the iotlb (and dev_iotlb if enabled). If the IOASID
> >    * represents a vPASID space, then this command applies to the PASID
> >    * cache.
> >    *
> >    * Similarly this command doesn't provide IOMMU-like granularity
> >    * info (domain-wide, pasid-wide, range-based), since it's all about the
> >    * I/O address space itself. The ioasid driver walks the attached
> >    * routing information to match the IOMMU semantics under the
> >    * hood.
> >    *
> >    * Input parameters:
> >    *	- child_ioasid;
> >    *	- granularity
> >    *
> >    * Return: 0 on success, -errno on failure
> >    */
> > #define IOASID_INVALIDATE_CACHE	_IO(IOASID_TYPE, IOASID_BASE + 13)
> >
> >
> > /*
> >    * Page fault report and response
> >    *
> >    * This is TBD. Can be added after other parts are cleared up. Likely it
> >    * will be a ring buffer shared between user/kernel, an eventfd to notify
> >    * the user and an ioctl to complete the fault.
> >    *
> >    * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
> >    */
> >
> >
> > /*
> >    * Dirty page tracking
> >    *
> >    * Track and report memory pages dirtied in I/O address spaces. There
> >    * is an ongoing work by Kunkun Jiang by extending existing VFIO type1.
> >    * It needs be adapted to /dev/ioasid later.
> >    */
> >
> >
> > 2.2. /dev/vfio uAPI
> > ++++++++++++++++
> >
> > /*
> >    * Bind a vfio_device to the specified IOASID fd
> >    *
> >    * Multiple vfio devices can be bound to a single ioasid_fd, but a single
> >    * vfio device should not be bound to multiple ioasid_fd's.
> >    *
> >    * Input parameters:
> >    *	- ioasid_fd;
> >    *
> >    * Return: 0 on success, -errno on failure.
> >    */
> > #define VFIO_BIND_IOASID_FD		_IO(VFIO_TYPE, VFIO_BASE + 22)
> > #define VFIO_UNBIND_IOASID_FD	_IO(VFIO_TYPE, VFIO_BASE + 23)
> >
> >
> > /*
> >    * Attach a vfio device to the specified IOASID
> >    *
> >    * Multiple vfio devices can be attached to the same IOASID, and vice
> >    * versa.
> >    *
> >    * User may optionally provide a "virtual PASID" to mark an I/O page
> >    * table on this vfio device. Whether the virtual PASID is physically used
> >    * or converted to another kernel-allocated PASID is a policy in vfio device
> >    * driver.
> >    *
> >    * There is no need to specify ioasid_fd in this call due to the assumption
> >    * of 1:1 connection between vfio device and the bound fd.
> >    *
> >    * Input parameter:
> >    *	- ioasid;
> >    *	- flag;
> >    *	- user_pasid (if specified);
> >    *
> >    * Return: 0 on success, -errno on failure.
> >    */
> > #define VFIO_ATTACH_IOASID		_IO(VFIO_TYPE, VFIO_BASE + 24)
> > #define VFIO_DETACH_IOASID		_IO(VFIO_TYPE, VFIO_BASE + 25)
> >
> >
> > 2.3. KVM uAPI
> > ++++++++++++
> >
> > /*
> >    * Update CPU PASID mapping
> >    *
> >    * This is necessary when ENQCMD will be used in the guest while the
> >    * targeted device doesn't accept the vPASID saved in the CPU MSR.
> >    *
> >    * This command allows user to set/clear the vPASID->pPASID mapping
> >    * in the CPU, by providing the IOASID (and FD) information representing
> >    * the I/O address space marked by this vPASID.
> >    *
> >    * Input parameters:
> >    *	- user_pasid;
> >    *	- ioasid_fd;
> >    *	- ioasid;
> >    */
> > #define KVM_MAP_PASID	_IO(KVMIO, 0xf0)
> > #define KVM_UNMAP_PASID	_IO(KVMIO, 0xf1)
> >
> >
> > 3. Sample structures and helper functions
> > --------------------------------------------------------
> >
> > Three helper functions are provided to support VFIO_BIND_IOASID_FD:
> >
> > 	struct ioasid_ctx *ioasid_ctx_fdget(int fd);
> > 	int ioasid_register_device(struct ioasid_ctx *ctx, struct ioasid_dev *dev);
> > 	int ioasid_unregister_device(struct ioasid_dev *dev);
> >
> > An ioasid_ctx is created for each fd:
> >
> > 	struct ioasid_ctx {
> > 		// a list of allocated IOASID data's
> > 		struct list_head		ioasid_list;
> > 		// a list of registered devices
> > 		struct list_head		dev_list;
> > 		// a list of pre-registered virtual address ranges
> > 		struct list_head		prereg_list;
> > 	};
> >
> > Each registered device is represented by ioasid_dev:
> >
> > 	struct ioasid_dev {
> > 		struct list_head		next;
> > 		struct ioasid_ctx	*ctx;
> > 		// always be the physical device
> > 		struct device 		*device;
> > 		struct kref		kref;
> > 	};
> >
> > Because we assume one vfio_device connected to at most one ioasid_fd,
> > here ioasid_dev could be embedded in vfio_device and then linked to
> > ioasid_ctx->dev_list when registration succeeds. For mdev the struct
> > device should be the pointer to the parent device. PASID marking this
> > mdev is specified later when VFIO_ATTACH_IOASID.
> >
> > An ioasid_data is created when IOASID_ALLOC, as the main object
> > describing characteristics about an I/O page table:
> >
> > 	struct ioasid_data {
> > 		// link to ioasid_ctx->ioasid_list
> > 		struct list_head		next;
> >
> > 		// the IOASID number
> > 		u32			ioasid;
> >
> > 		// the handle to convey iommu operations
> > 		// hold the pgd (TBD until discussing iommu api)
> > 		struct iommu_domain *domain;
> >
> > 		// map metadata (vfio type1 semantics)
> > 		struct rb_node		dma_list;
> >
> > 		// pointer to user-managed pgtable (for nesting case)
> > 		u64			user_pgd;
> >
> > 		// link to the parent ioasid (for nesting)
> > 		struct ioasid_data	*parent;
> >
> > 		// cache the global PASID shared by ENQCMD-capable
> > 		// devices (see below explanation in section 4)
> > 		u32			pasid;
> >
> > 		// a list of device attach data (routing information)
> > 		struct list_head		attach_data;
> >
> > 		// a list of partially-attached devices (group)
> > 		struct list_head		partial_devices;
> >
> > 		// a list of fault_data reported from the iommu layer
> > 		struct list_head		fault_data;
> >
> > 		...
> > 	}
> >
> > ioasid_data and iommu_domain have overlapping roles as both are
> > introduced to represent an I/O address space. It is still a big TBD how
> > the two should be corelated or even merged, and whether new iommu
> > ops are required to handle RID+PASID explicitly. We leave this as open
> > for now as this proposal is mainly about uAPI. For simplification
> > purpose the two objects are kept separate in this context, assuming an
> > 1:1 connection in-between and the domain as the place-holder
> > representing the 1st class object in the iommu ops.
> >
> > Two helper functions are provided to support VFIO_ATTACH_IOASID:
> >
> > 	struct attach_info {
> > 		u32	ioasid;
> > 		// If valid, the PASID to be used physically
> > 		u32	pasid;
> > 	};
> > 	int ioasid_device_attach(struct ioasid_dev *dev,
> > 		struct attach_info info);
> > 	int ioasid_device_detach(struct ioasid_dev *dev, u32 ioasid);
> >
> > The pasid parameter is optionally provided based on the policy in vfio
> > device driver. It could be the PASID marking the default I/O address
> > space for a mdev, or the user-provided PASID marking an user I/O page
> > table, or another kernel-allocated PASID backing the user-provided one.
> > Please check next section for detail explanation.
> >
> > A new object is introduced and linked to ioasid_data->attach_data for
> > each successful attach operation:
> >
> > 	struct ioasid_attach_data {
> > 		struct list_head		next;
> > 		struct ioasid_dev	*dev;
> > 		u32 			pasid;
> > 	}
> >
> > As explained in the design section, there is no explicit group enforcement
> > in /dev/ioasid uAPI or helper functions. But the ioasid driver does
> > implicit group check - before every device within an iommu group is
> > attached to this IOASID, the previously-attached devices in this group are
> > put in ioasid_data->partial_devices. The IOASID rejects any command if
> > the partial_devices list is not empty.
> >
> > Then is the last helper function:
> > 	u32 ioasid_get_global_pasid(struct ioasid_ctx *ctx,
> > 		u32 ioasid, bool alloc);
> >
> > ioasid_get_global_pasid is necessary in scenarios where multiple devices
> > want to share a same PASID value on the attached I/O page table (e.g.
> > when ENQCMD is enabled, as explained in next section). We need a
> > centralized place (ioasid_data->pasid) to hold this value (allocated when
> > first called with alloc=true). vfio device driver calls this function (alloc=
> > true) to get the global PASID for an ioasid before calling ioasid_device_
> > attach. KVM also calls this function (alloc=false) to setup PASID translation
> > structure when user calls KVM_MAP_PASID.
> >
> > 4. PASID Virtualization
> > ------------------------------
> >
> > When guest SVA (vSVA) is enabled, multiple GVA address spaces are
> > created on the assigned vfio device. This leads to the concepts of
> > "virtual PASID" (vPASID) vs. "physical PASID" (pPASID). vPASID is assigned
> > by the guest to mark an GVA address space while pPASID is the one
> > selected by the host and actually routed in the wire.
> >
> > vPASID is conveyed to the kernel when user calls VFIO_ATTACH_IOASID.
> >
> > vfio device driver translates vPASID to pPASID before calling ioasid_attach_
> > device, with two factors to be considered:
> >
> > -    Whether vPASID is directly used (vPASID==pPASID) in the wire, or
> >       should be instead converted to a newly-allocated one (vPASID!=
> >       pPASID);
> >
> > -    If vPASID!=pPASID, whether pPASID is allocated from per-RID PASID
> >       space or a global PASID space (implying sharing pPASID cross devices,
> >       e.g. when supporting Intel ENQCMD which puts PASID in a CPU MSR
> >       as part of the process context);
> >
> > The actual policy depends on pdev vs. mdev, and whether ENQCMD is
> > supported. There are three possible scenarios:
> >
> > (Note: /dev/ioasid uAPI is not affected by underlying PASID virtualization
> > policies.)
> >
> > 1)  pdev (w/ or w/o ENQCMD): vPASID==pPASID
> >
> >       vPASIDs are directly programmed by the guest to the assigned MMIO
> >       bar, implying all DMAs out of this device having vPASID in the packet
> >       header. This mandates vPASID==pPASID, sort of delegating the entire
> >       per-RID PASID space to the guest.
> >
> >       When ENQCMD is enabled, the CPU MSR when running a guest task
> >       contains a vPASID. In this case the CPU PASID translation capability
> >       should be disabled so this vPASID in CPU MSR is directly sent to the
> >       wire.
> >
> >       This ensures consistent vPASID usage on pdev regardless of the
> >       workload submitted through a MMIO register or ENQCMD instruction.
> >
> > 2)  mdev: vPASID!=pPASID (per-RID if w/o ENQCMD, otherwise global)
> >
> >       PASIDs are also used by kernel to mark the default I/O address space
> >       for mdev, thus cannot be delegated to the guest. Instead, the mdev
> >       driver must allocate a new pPASID for each vPASID (thus vPASID!=
> >       pPASID) and then use pPASID when attaching this mdev to an ioasid.
> >
> >       The mdev driver needs cache the PASID mapping so in mediation
> >       path vPASID programmed by the guest can be converted to pPASID
> >       before updating the physical MMIO register. The mapping should
> >       also be saved in the CPU PASID translation structure (via KVM uAPI),
> >       so the vPASID saved in the CPU MSR is auto-translated to pPASID
> >       before sent to the wire, when ENQCMD is enabled.
> >
> >       Generally pPASID could be allocated from the per-RID PASID space
> >       if all mdev's created on the parent device don't support ENQCMD.
> >
> >       However if the parent supports ENQCMD-capable mdev, pPASIDs
> >       must be allocated from a global pool because the CPU PASID
> >       translation structure is per-VM. It implies that when an guest I/O
> >       page table is attached to two mdevs with a single vPASID (i.e. bind
> >       to the same guest process), a same pPASID should be used for
> >       both mdevs even when they belong to different parents. Sharing
> >       pPASID cross mdevs is achieved by calling aforementioned ioasid_
> >       get_global_pasid().
> >
> > 3)  Mix pdev/mdev together
> >
> >       Above policies are per device type thus are not affected when mixing
> >       those device types together (when assigned to a single guest). However,
> >       there is one exception - when both pdev/mdev support ENQCMD.
> >
> >       Remember the two types have conflicting requirements on whether
> >       CPU PASID translation should be enabled. This capability is per-VM,
> >       and must be enabled for mdev isolation. When enabled, pdev will
> >       receive a mdev pPASID violating its vPASID expectation.
> >
> >       In previous thread a PASID range split scheme was discussed to support
> >       this combination, but we haven't worked out a clean uAPI design yet.
> >       Therefore in this proposal we decide to not support it, implying the
> >       user should have some intelligence to avoid such scenario. It could be
> >       a TODO task for future.
> >
> > In spite of those subtle considerations, the kernel implementation could
> > start simple, e.g.:
> >
> > -    v==p for pdev;
> > -    v!=p and always use a global PASID pool for all mdev's;
> >
> > Regardless of the kernel policy, the user policy is unchanged:
> >
> > -    provide vPASID when calling VFIO_ATTACH_IOASID;
> > -    call KVM uAPI to setup CPU PASID translation if ENQCMD-capable mdev;
> > -    Don't expose ENQCMD capability on both pdev and mdev;
> >
> > Sample user flow is described in section 5.5.
> >
> > 5. Use Cases and Flows
> > -------------------------------
> >
> > Here assume VFIO will support a new model where every bound device
> > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> > going through legacy container/group interface. For illustration purpose
> > those devices are just called dev[1...N]:
> >
> > 	device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> >
> > As explained earlier, one IOASID fd is sufficient for all intended use cases:
> >
> > 	ioasid_fd = open("/dev/ioasid", mode);
> >
> > For simplicity below examples are all made for the virtualization story.
> > They are representative and could be easily adapted to a non-virtualization
> > scenario.
> >
> > Three types of IOASIDs are considered:
> >
> > 	gpa_ioasid[1...N]: 	for GPA address space
> > 	giova_ioasid[1...N]:	for guest IOVA address space
> > 	gva_ioasid[1...N]:	for guest CPU VA address space
> >
> > At least one gpa_ioasid must always be created per guest, while the other
> > two are relevant as far as vIOMMU is concerned.
> >
> > Examples here apply to both pdev and mdev, if not explicitly marked out
> > (e.g. in section 5.5). VFIO device driver in the kernel will figure out the
> > associated routing information in the attaching operation.
> >
> > For illustration simplicity, IOASID_CHECK_EXTENSION and IOASID_GET_
> > INFO are skipped in these examples.
> >
> > 5.1. A simple example
> > ++++++++++++++++++
> >
> > Dev1 is assigned to the guest. One gpa_ioasid is created. The GPA address
> > space is managed through DMA mapping protocol:
> >
> > 	/* Bind device to IOASID fd */
> > 	device_fd = open("/dev/vfio/devices/dev1", mode);
> > 	ioasid_fd = open("/dev/ioasid", mode);
> > 	ioctl(device_fd, VFIO_BIND_IOASID_FD, ioasid_fd);
> >
> > 	/* Attach device to IOASID */
> > 	gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> > 	at_data = { .ioasid = gpa_ioasid};
> > 	ioctl(device_fd, VFIO_ATTACH_IOASID, &at_data);
> >
> > 	/* Setup GPA mapping */
> > 	dma_map = {
> > 		.ioasid	= gpa_ioasid;
> > 		.iova	= 0;		// GPA
> > 		.vaddr	= 0x40000000;	// HVA
> > 		.size	= 1GB;
> > 	};
> > 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> >
> > If the guest is assigned with more than dev1, user follows above sequence
> > to attach other devices to the same gpa_ioasid i.e. sharing the GPA
> > address space cross all assigned devices.
> >
> > 5.2. Multiple IOASIDs (no nesting)
> > ++++++++++++++++++++++++++++
> >
> > Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
> > both devices are attached to gpa_ioasid. After boot the guest creates
> > an GIOVA address space (giova_ioasid) for dev2, leaving dev1 in pass
> > through mode (gpa_ioasid).
> >
> > Suppose IOASID nesting is not supported in this case. Qemu need to
> > generate shadow mappings in userspace for giova_ioasid (like how
> > VFIO works today).
> >
> > To avoid duplicated locked page accounting, it's recommended to pre-
> > register the virtual address range that will be used for DMA:
> >
> > 	device_fd1 = open("/dev/vfio/devices/dev1", mode);
> > 	device_fd2 = open("/dev/vfio/devices/dev2", mode);
> > 	ioasid_fd = open("/dev/ioasid", mode);
> > 	ioctl(device_fd1, VFIO_BIND_IOASID_FD, ioasid_fd);
> > 	ioctl(device_fd2, VFIO_BIND_IOASID_FD, ioasid_fd);
> >
> > 	/* pre-register the virtual address range for accounting */
> > 	mem_info = { .vaddr = 0x40000000; .size = 1GB };
> > 	ioctl(ioasid_fd, IOASID_REGISTER_MEMORY, &mem_info);
> >
> > 	/* Attach dev1 and dev2 to gpa_ioasid */
> > 	gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> > 	at_data = { .ioasid = gpa_ioasid};
> > 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> > 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> >
> > 	/* Setup GPA mapping */
> > 	dma_map = {
> > 		.ioasid	= gpa_ioasid;
> > 		.iova	= 0; 		// GPA
> > 		.vaddr	= 0x40000000;	// HVA
> > 		.size	= 1GB;
> > 	};
> > 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> >
> > 	/* After boot, guest enables an GIOVA space for dev2 */
> > 	giova_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> >
> > 	/* First detach dev2 from previous address space */
> > 	at_data = { .ioasid = gpa_ioasid};
> > 	ioctl(device_fd2, VFIO_DETACH_IOASID, &at_data);
> >
> > 	/* Then attach dev2 to the new address space */
> > 	at_data = { .ioasid = giova_ioasid};
> > 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> >
> > 	/* Setup a shadow DMA mapping according to vIOMMU
> > 	  * GIOVA (0x2000) -> GPA (0x1000) -> HVA (0x40001000)
> > 	  */
> > 	dma_map = {
> > 		.ioasid	= giova_ioasid;
> > 		.iova	= 0x2000; 	// GIOVA
> > 		.vaddr	= 0x40001000;	// HVA
> > 		.size	= 4KB;
> > 	};
> > 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> >
> > 5.3. IOASID nesting (software)
> > +++++++++++++++++++++++++
> >
> > Same usage scenario as 5.2, with software-based IOASID nesting
> > available. In this mode it is the kernel instead of user to create the
> > shadow mapping.
> >
> > The flow before guest boots is same as 5.2, except one point. Because
> > giova_ioasid is nested on gpa_ioasid, locked accounting is only
> > conducted for gpa_ioasid. So it's not necessary to pre-register virtual
> > memory.
> >
> > To save space we only list the steps after boots (i.e. both dev1/dev2
> > have been attached to gpa_ioasid before guest boots):
> >
> > 	/* After boots */
> > 	/* Make GIOVA space nested on GPA space */
> > 	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > 				gpa_ioasid);
> >
> > 	/* Attach dev2 to the new address space (child)
> > 	  * Note dev2 is still attached to gpa_ioasid (parent)
> > 	  */
> > 	at_data = { .ioasid = giova_ioasid};
> > 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);  
> 
> 
> For vDPA, we need something similar. And in the future, vDPA may allow 
> multiple ioasid to be attached to a single device. It should work with 
> the current design.
> 
> 
> >
> > 	/* Setup a GIOVA->GPA mapping for giova_ioasid, which will be
> > 	  * merged by the kernel with GPA->HVA mapping of gpa_ioasid
> > 	  * to form a shadow mapping.
> > 	  */
> > 	dma_map = {
> > 		.ioasid	= giova_ioasid;
> > 		.iova	= 0x2000;	// GIOVA
> > 		.vaddr	= 0x1000;	// GPA
> > 		.size	= 4KB;
> > 	};
> > 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> >
> > 5.4. IOASID nesting (hardware)
> > +++++++++++++++++++++++++
> >
> > Same usage scenario as 5.2, with hardware-based IOASID nesting
> > available. In this mode the pgtable binding protocol is used to
> > bind the guest IOVA page table with the IOMMU:
> >
> > 	/* After boots */
> > 	/* Make GIOVA space nested on GPA space */
> > 	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > 				gpa_ioasid);
> >
> > 	/* Attach dev2 to the new address space (child)
> > 	  * Note dev2 is still attached to gpa_ioasid (parent)
> > 	  */
> > 	at_data = { .ioasid = giova_ioasid};
> > 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);  
> 
> 
> I guess VFIO_ATTACH_IOASID will fail if the underlayer doesn't support 
> hardware nesting. Or is there way to detect the capability before?

I think it could fail in the IOASID_CREATE_NESTING. If the gpa_ioasid
is not able to support nesting, then should fail it.

> I think GET_INFO only works after the ATTACH.

yes. After attaching to gpa_ioasid, userspace could GET_INFO on the
gpa_ioasid and check if nesting is supported or not. right?

> 
> 
> >
> > 	/* Bind guest I/O page table  */
> > 	bind_data = {
> > 		.ioasid	= giova_ioasid;
> > 		.addr	= giova_pgtable;
> > 		// and format information
> > 	};
> > 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> >
> > 	/* Invalidate IOTLB when required */
> > 	inv_data = {
> > 		.ioasid	= giova_ioasid;
> > 		// granular information
> > 	};
> > 	ioctl(ioasid_fd, IOASID_INVALIDATE_CACHE, &inv_data);
> >
> > 	/* See 5.6 for I/O page fault handling */
> > 	
> > 5.5. Guest SVA (vSVA)
> > ++++++++++++++++++
> >
> > After boots the guest further create a GVA address spaces (gpasid1) on
> > dev1. Dev2 is not affected (still attached to giova_ioasid).
> >
> > As explained in section 4, user should avoid expose ENQCMD on both
> > pdev and mdev.
> >
> > The sequence applies to all device types (being pdev or mdev), except
> > one additional step to call KVM for ENQCMD-capable mdev:  
> 
> 
> My understanding is ENQCMD is Intel specific and not a requirement for 
> having vSVA.

ENQCMD is not really Intel specific although only Intel supports it today.
The PCIe DMWr capability is the capability for software to enumerate the
ENQCMD support in device side. yes, it is not a requirement for vSVA. They
are orthogonal.

> 
> >
> > 	/* After boots */
> > 	/* Make GVA space nested on GPA space */
> > 	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > 				gpa_ioasid);
> >
> > 	/* Attach dev1 to the new address space and specify vPASID */
> > 	at_data = {
> > 		.ioasid		= gva_ioasid;
> > 		.flag 		= IOASID_ATTACH_USER_PASID;
> > 		.user_pasid	= gpasid1;
> > 	};
> > 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> >
> > 	/* if dev1 is ENQCMD-capable mdev, update CPU PASID
> > 	  * translation structure through KVM
> > 	  */
> > 	pa_data = {
> > 		.ioasid_fd	= ioasid_fd;
> > 		.ioasid		= gva_ioasid;
> > 		.guest_pasid	= gpasid1;
> > 	};
> > 	ioctl(kvm_fd, KVM_MAP_PASID, &pa_data);
> >
> > 	/* Bind guest I/O page table  */
> > 	bind_data = {
> > 		.ioasid	= gva_ioasid;
> > 		.addr	= gva_pgtable1;
> > 		// and format information
> > 	};
> > 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> >
> > 	...

-- 
Regards,
Yi Liu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-28 23:36   ` Jason Gunthorpe
@ 2021-05-31 11:31     ` Liu Yi L
  -1 siblings, 0 replies; 518+ messages in thread
From: Liu Yi L @ 2021-05-31 11:31 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: yi.l.liu, Tian, Kevin, Jean-Philippe Brucker,
	Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, David Gibson, Jiang, Dave, Wu, Hao,
	David Woodhouse, Jason Wang

On Fri, 28 May 2021 20:36:49 -0300, Jason Gunthorpe wrote:

> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> 
> > 2.1. /dev/ioasid uAPI
> > +++++++++++++++++
> > 
> > /*
> >   * Check whether an uAPI extension is supported. 
> >   *
> >   * This is for FD-level capabilities, such as locked page pre-registration. 
> >   * IOASID-level capabilities are reported through IOASID_GET_INFO.
> >   *
> >   * Return: 0 if not supported, 1 if supported.
> >   */
> > #define IOASID_CHECK_EXTENSION	_IO(IOASID_TYPE, IOASID_BASE + 0)  
> 
>  
> > /*
> >   * Register user space memory where DMA is allowed.
> >   *
> >   * It pins user pages and does the locked memory accounting so sub-
> >   * sequent IOASID_MAP/UNMAP_DMA calls get faster.
> >   *
> >   * When this ioctl is not used, one user page might be accounted
> >   * multiple times when it is mapped by multiple IOASIDs which are
> >   * not nested together.
> >   *
> >   * Input parameters:
> >   *	- vaddr;
> >   *	- size;
> >   *
> >   * Return: 0 on success, -errno on failure.
> >   */
> > #define IOASID_REGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 1)
> > #define IOASID_UNREGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 2)  
> 
> So VA ranges are pinned and stored in a tree and later references to
> those VA ranges by any other IOASID use the pin cached in the tree?
> 
> It seems reasonable and is similar to the ioasid parent/child I
> suggested for PPC.
> 
> IMHO this should be merged with the all SW IOASID that is required for
> today's mdev drivers. If this can be done while keeping this uAPI then
> great, otherwise I don't think it is so bad to weakly nest a physical
> IOASID under a SW one just to optimize page pinning.
> 
> Either way this seems like a smart direction
> 
> > /*
> >   * Allocate an IOASID. 
> >   *
> >   * IOASID is the FD-local software handle representing an I/O address 
> >   * space. Each IOASID is associated with a single I/O page table. User 
> >   * must call this ioctl to get an IOASID for every I/O address space that is
> >   * intended to be enabled in the IOMMU.
> >   *
> >   * A newly-created IOASID doesn't accept any command before it is 
> >   * attached to a device. Once attached, an empty I/O page table is 
> >   * bound with the IOMMU then the user could use either DMA mapping 
> >   * or pgtable binding commands to manage this I/O page table.  
> 
> Can the IOASID can be populated before being attached?

perhaps a MAP/UNMAP operation on a gpa_ioasid?

> 
> >   * Device attachment is initiated through device driver uAPI (e.g. VFIO)
> >   *
> >   * Return: allocated ioasid on success, -errno on failure.
> >   */
> > #define IOASID_ALLOC	_IO(IOASID_TYPE, IOASID_BASE + 3)
> > #define IOASID_FREE	_IO(IOASID_TYPE, IOASID_BASE + 4)  
> 
> I assume alloc will include quite a big structure to satisfy the
> various vendor needs?
>
> 
> > /*
> >   * Get information about an I/O address space
> >   *
> >   * Supported capabilities:
> >   *	- VFIO type1 map/unmap;
> >   *	- pgtable/pasid_table binding
> >   *	- hardware nesting vs. software nesting;
> >   *	- ...
> >   *
> >   * Related attributes:
> >   * 	- supported page sizes, reserved IOVA ranges (DMA mapping);
> >   *	- vendor pgtable formats (pgtable binding);
> >   *	- number of child IOASIDs (nesting);
> >   *	- ...
> >   *
> >   * Above information is available only after one or more devices are
> >   * attached to the specified IOASID. Otherwise the IOASID is just a
> >   * number w/o any capability or attribute.  
> 
> This feels wrong to learn most of these attributes of the IOASID after
> attaching to a device.

but an IOASID is just a software handle before attached to a specific
device. e.g. before attaching to a device, we have no idea about the
supported page size in underlying iommu, coherent etc.

> The user should have some idea how it intends to use the IOASID when
> it creates it and the rest of the system should match the intention.
> 
> For instance if the user is creating a IOASID to cover the guest GPA
> with the intention of making children it should indicate this during
> alloc.
> 
> If the user is intending to point a child IOASID to a guest page table
> in a certain descriptor format then it should indicate it during
> alloc.

Actually, we have only two kinds of IOASIDs so far. One is used as parent
and another is child. For child, this proposal has defined IOASID_CREATE_NESTING
for it. But yeah, I think it is doable to indicate the type in ALLOC. But
for child IOASID, there require one more step to config its parent IOASID
or may include such info in the ioctl input as well.
 
> device bind should fail if the device somehow isn't compatible with
> the scheme the user is tring to use.

yeah, I guess you mean to fail the device attach when the IOASID is a
nesting IOASID but the device is behind an iommu without nesting support.
right?

> 
> > /*
> >   * Map/unmap process virtual addresses to I/O virtual addresses.
> >   *
> >   * Provide VFIO type1 equivalent semantics. Start with the same 
> >   * restriction e.g. the unmap size should match those used in the 
> >   * original mapping call. 
> >   *
> >   * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
> >   * must be already in the preregistered list.
> >   *
> >   * Input parameters:
> >   *	- u32 ioasid;
> >   *	- refer to vfio_iommu_type1_dma_{un}map
> >   *
> >   * Return: 0 on success, -errno on failure.
> >   */
> > #define IOASID_MAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 6)
> > #define IOASID_UNMAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 7)  
> 
> What about nested IOASIDs?

at first glance, it looks like we should prevent the MAP/UNMAP usage on
nested IOASIDs. At least hardware nested translation only allows MAP/UNMAP
on the parent IOASIDs and page table bind on nested IOASIDs. But considering
about software nesting, it seems still useful to allow MAP/UNMAP usage
on nested IOASIDs. This is how I understand it, how about your opinion
on it? do you think it's better to allow MAP/UNMAP usage only on parent
IOASIDs as a start?

> 
> > /*
> >   * Create a nesting IOASID (child) on an existing IOASID (parent)
> >   *
> >   * IOASIDs can be nested together, implying that the output address 
> >   * from one I/O page table (child) must be further translated by 
> >   * another I/O page table (parent).
> >   *
> >   * As the child adds essentially another reference to the I/O page table 
> >   * represented by the parent, any device attached to the child ioasid 
> >   * must be already attached to the parent.
> >   *
> >   * In concept there is no limit on the number of the nesting levels. 
> >   * However for the majority case one nesting level is sufficient. The
> >   * user should check whether an IOASID supports nesting through 
> >   * IOASID_GET_INFO. For example, if only one nesting level is allowed,
> >   * the nesting capability is reported only on the parent instead of the
> >   * child.
> >   *
> >   * User also needs check (via IOASID_GET_INFO) whether the nesting 
> >   * is implemented in hardware or software. If software-based, DMA 
> >   * mapping protocol should be used on the child IOASID. Otherwise, 
> >   * the child should be operated with pgtable binding protocol.
> >   *
> >   * Input parameters:
> >   *	- u32 parent_ioasid;
> >   *
> >   * Return: child_ioasid on success, -errno on failure;
> >   */
> > #define IOASID_CREATE_NESTING	_IO(IOASID_TYPE, IOASID_BASE + 8)  
> 
> Do you think another ioctl is best? Should this just be another
> parameter to alloc?

either is fine. This ioctl is following one of your previous comment.

https://lore.kernel.org/linux-iommu/20210422121020.GT1370958@nvidia.com/

> 
> > /*
> >   * Bind an user-managed I/O page table with the IOMMU
> >   *
> >   * Because user page table is untrusted, IOASID nesting must be enabled 
> >   * for this ioasid so the kernel can enforce its DMA isolation policy 
> >   * through the parent ioasid.
> >   *
> >   * Pgtable binding protocol is different from DMA mapping. The latter 
> >   * has the I/O page table constructed by the kernel and updated 
> >   * according to user MAP/UNMAP commands. With pgtable binding the 
> >   * whole page table is created and updated by userspace, thus different 
> >   * set of commands are required (bind, iotlb invalidation, page fault, etc.).
> >   *
> >   * Because the page table is directly walked by the IOMMU, the user 
> >   * must  use a format compatible to the underlying hardware. It can 
> >   * check the format information through IOASID_GET_INFO.
> >   *
> >   * The page table is bound to the IOMMU according to the routing 
> >   * information of each attached device under the specified IOASID. The
> >   * routing information (RID and optional PASID) is registered when a 
> >   * device is attached to this IOASID through VFIO uAPI. 
> >   *
> >   * Input parameters:
> >   *	- child_ioasid;
> >   *	- address of the user page table;
> >   *	- formats (vendor, address_width, etc.);
> >   * 
> >   * Return: 0 on success, -errno on failure.
> >   */
> > #define IOASID_BIND_PGTABLE		_IO(IOASID_TYPE, IOASID_BASE + 9)
> > #define IOASID_UNBIND_PGTABLE	_IO(IOASID_TYPE, IOASID_BASE + 10)  
> 
> Also feels backwards, why wouldn't we specify this, and the required
> page table format, during alloc time?

here the model is user-space gets the page table format from kernel and
decide if it can proceed. So what you are suggesting is user-space should
tell kernel the page table format it has in ALLOC and kenrel should fail
the ALLOC if the user-space page table format is not compatible with underlying
iommu?

> 
> > /*
> >   * Bind an user-managed PASID table to the IOMMU
> >   *
> >   * This is required for platforms which place PASID table in the GPA space.
> >   * In this case the specified IOASID represents the per-RID PASID space.
> >   *
> >   * Alternatively this may be replaced by IOASID_BIND_PGTABLE plus a
> >   * special flag to indicate the difference from normal I/O address spaces.
> >   *
> >   * The format info of the PASID table is reported in IOASID_GET_INFO.
> >   *
> >   * As explained in the design section, user-managed I/O page tables must
> >   * be explicitly bound to the kernel even on these platforms. It allows
> >   * the kernel to uniformly manage I/O address spaces cross all platforms.
> >   * Otherwise, the iotlb invalidation and page faulting uAPI must be hacked
> >   * to carry device routing information to indirectly mark the hidden I/O
> >   * address spaces.
> >   *
> >   * Input parameters:
> >   *	- child_ioasid;
> >   *	- address of PASID table;
> >   *	- formats (vendor, size, etc.);
> >   *
> >   * Return: 0 on success, -errno on failure.
> >   */
> > #define IOASID_BIND_PASID_TABLE	_IO(IOASID_TYPE, IOASID_BASE + 11)
> > #define IOASID_UNBIND_PASID_TABLE	_IO(IOASID_TYPE, IOASID_BASE + 12)  
> 
> Ditto
> 
> > 
> > /*
> >   * Invalidate IOTLB for an user-managed I/O page table
> >   *
> >   * Unlike what's defined in include/uapi/linux/iommu.h, this command 
> >   * doesn't allow the user to specify cache type and likely support only
> >   * two granularities (all, or a specified range) in the I/O address space.
> >   *
> >   * Physical IOMMU have three cache types (iotlb, dev_iotlb and pasid
> >   * cache). If the IOASID represents an I/O address space, the invalidation
> >   * always applies to the iotlb (and dev_iotlb if enabled). If the IOASID
> >   * represents a vPASID space, then this command applies to the PASID
> >   * cache.
> >   *
> >   * Similarly this command doesn't provide IOMMU-like granularity
> >   * info (domain-wide, pasid-wide, range-based), since it's all about the
> >   * I/O address space itself. The ioasid driver walks the attached
> >   * routing information to match the IOMMU semantics under the
> >   * hood. 
> >   *
> >   * Input parameters:
> >   *	- child_ioasid;
> >   *	- granularity
> >   * 
> >   * Return: 0 on success, -errno on failure
> >   */
> > #define IOASID_INVALIDATE_CACHE	_IO(IOASID_TYPE, IOASID_BASE + 13)  
> 
> This should have an IOVA range too?
> 
> > /*
> >   * Page fault report and response
> >   *
> >   * This is TBD. Can be added after other parts are cleared up. Likely it 
> >   * will be a ring buffer shared between user/kernel, an eventfd to notify 
> >   * the user and an ioctl to complete the fault.
> >   *
> >   * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
> >   */  
> 
> Any reason not to just use read()?

a ring buffer may be mmap to user-space, thus reading fault data from kernel
would be faster. This is also how Eric's fault reporting is doing today.

https://lore.kernel.org/linux-iommu/20210411114659.15051-5-eric.auger@redhat.com/

> >
> > 2.2. /dev/vfio uAPI
> > ++++++++++++++++  
> 
> To be clear you mean the 'struct vfio_device' API, these are not
> IOCTLs on the container or group?
> 
> > /*
> >    * Bind a vfio_device to the specified IOASID fd
> >    *
> >    * Multiple vfio devices can be bound to a single ioasid_fd, but a single
> >    * vfio device should not be bound to multiple ioasid_fd's.
> >    *
> >    * Input parameters:
> >    *  - ioasid_fd;
> >    *
> >    * Return: 0 on success, -errno on failure.
> >    */
> > #define VFIO_BIND_IOASID_FD           _IO(VFIO_TYPE, VFIO_BASE + 22)
> > #define VFIO_UNBIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 23)  
> 
> This is where it would make sense to have an output "device id" that
> allows /dev/ioasid to refer to this "device" by number in events and
> other related things.

perhaps this is the device info Jean Philippe wants in page fault reporting
path?

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-05-31 11:31     ` Liu Yi L
  0 siblings, 0 replies; 518+ messages in thread
From: Liu Yi L @ 2021-05-31 11:31 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Tian, Kevin, Jiang, Dave, Raj, Ashok, kvm,
	Jonathan Corbet, David Woodhouse, Jason Wang, LKML,
	Kirti Wankhede, Alex Williamson (alex.williamson@redhat.com),
	iommu, Robin Murphy, David Gibson

On Fri, 28 May 2021 20:36:49 -0300, Jason Gunthorpe wrote:

> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> 
> > 2.1. /dev/ioasid uAPI
> > +++++++++++++++++
> > 
> > /*
> >   * Check whether an uAPI extension is supported. 
> >   *
> >   * This is for FD-level capabilities, such as locked page pre-registration. 
> >   * IOASID-level capabilities are reported through IOASID_GET_INFO.
> >   *
> >   * Return: 0 if not supported, 1 if supported.
> >   */
> > #define IOASID_CHECK_EXTENSION	_IO(IOASID_TYPE, IOASID_BASE + 0)  
> 
>  
> > /*
> >   * Register user space memory where DMA is allowed.
> >   *
> >   * It pins user pages and does the locked memory accounting so sub-
> >   * sequent IOASID_MAP/UNMAP_DMA calls get faster.
> >   *
> >   * When this ioctl is not used, one user page might be accounted
> >   * multiple times when it is mapped by multiple IOASIDs which are
> >   * not nested together.
> >   *
> >   * Input parameters:
> >   *	- vaddr;
> >   *	- size;
> >   *
> >   * Return: 0 on success, -errno on failure.
> >   */
> > #define IOASID_REGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 1)
> > #define IOASID_UNREGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 2)  
> 
> So VA ranges are pinned and stored in a tree and later references to
> those VA ranges by any other IOASID use the pin cached in the tree?
> 
> It seems reasonable and is similar to the ioasid parent/child I
> suggested for PPC.
> 
> IMHO this should be merged with the all SW IOASID that is required for
> today's mdev drivers. If this can be done while keeping this uAPI then
> great, otherwise I don't think it is so bad to weakly nest a physical
> IOASID under a SW one just to optimize page pinning.
> 
> Either way this seems like a smart direction
> 
> > /*
> >   * Allocate an IOASID. 
> >   *
> >   * IOASID is the FD-local software handle representing an I/O address 
> >   * space. Each IOASID is associated with a single I/O page table. User 
> >   * must call this ioctl to get an IOASID for every I/O address space that is
> >   * intended to be enabled in the IOMMU.
> >   *
> >   * A newly-created IOASID doesn't accept any command before it is 
> >   * attached to a device. Once attached, an empty I/O page table is 
> >   * bound with the IOMMU then the user could use either DMA mapping 
> >   * or pgtable binding commands to manage this I/O page table.  
> 
> Can the IOASID can be populated before being attached?

perhaps a MAP/UNMAP operation on a gpa_ioasid?

> 
> >   * Device attachment is initiated through device driver uAPI (e.g. VFIO)
> >   *
> >   * Return: allocated ioasid on success, -errno on failure.
> >   */
> > #define IOASID_ALLOC	_IO(IOASID_TYPE, IOASID_BASE + 3)
> > #define IOASID_FREE	_IO(IOASID_TYPE, IOASID_BASE + 4)  
> 
> I assume alloc will include quite a big structure to satisfy the
> various vendor needs?
>
> 
> > /*
> >   * Get information about an I/O address space
> >   *
> >   * Supported capabilities:
> >   *	- VFIO type1 map/unmap;
> >   *	- pgtable/pasid_table binding
> >   *	- hardware nesting vs. software nesting;
> >   *	- ...
> >   *
> >   * Related attributes:
> >   * 	- supported page sizes, reserved IOVA ranges (DMA mapping);
> >   *	- vendor pgtable formats (pgtable binding);
> >   *	- number of child IOASIDs (nesting);
> >   *	- ...
> >   *
> >   * Above information is available only after one or more devices are
> >   * attached to the specified IOASID. Otherwise the IOASID is just a
> >   * number w/o any capability or attribute.  
> 
> This feels wrong to learn most of these attributes of the IOASID after
> attaching to a device.

but an IOASID is just a software handle before attached to a specific
device. e.g. before attaching to a device, we have no idea about the
supported page size in underlying iommu, coherent etc.

> The user should have some idea how it intends to use the IOASID when
> it creates it and the rest of the system should match the intention.
> 
> For instance if the user is creating a IOASID to cover the guest GPA
> with the intention of making children it should indicate this during
> alloc.
> 
> If the user is intending to point a child IOASID to a guest page table
> in a certain descriptor format then it should indicate it during
> alloc.

Actually, we have only two kinds of IOASIDs so far. One is used as parent
and another is child. For child, this proposal has defined IOASID_CREATE_NESTING
for it. But yeah, I think it is doable to indicate the type in ALLOC. But
for child IOASID, there require one more step to config its parent IOASID
or may include such info in the ioctl input as well.
 
> device bind should fail if the device somehow isn't compatible with
> the scheme the user is tring to use.

yeah, I guess you mean to fail the device attach when the IOASID is a
nesting IOASID but the device is behind an iommu without nesting support.
right?

> 
> > /*
> >   * Map/unmap process virtual addresses to I/O virtual addresses.
> >   *
> >   * Provide VFIO type1 equivalent semantics. Start with the same 
> >   * restriction e.g. the unmap size should match those used in the 
> >   * original mapping call. 
> >   *
> >   * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
> >   * must be already in the preregistered list.
> >   *
> >   * Input parameters:
> >   *	- u32 ioasid;
> >   *	- refer to vfio_iommu_type1_dma_{un}map
> >   *
> >   * Return: 0 on success, -errno on failure.
> >   */
> > #define IOASID_MAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 6)
> > #define IOASID_UNMAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 7)  
> 
> What about nested IOASIDs?

at first glance, it looks like we should prevent the MAP/UNMAP usage on
nested IOASIDs. At least hardware nested translation only allows MAP/UNMAP
on the parent IOASIDs and page table bind on nested IOASIDs. But considering
about software nesting, it seems still useful to allow MAP/UNMAP usage
on nested IOASIDs. This is how I understand it, how about your opinion
on it? do you think it's better to allow MAP/UNMAP usage only on parent
IOASIDs as a start?

> 
> > /*
> >   * Create a nesting IOASID (child) on an existing IOASID (parent)
> >   *
> >   * IOASIDs can be nested together, implying that the output address 
> >   * from one I/O page table (child) must be further translated by 
> >   * another I/O page table (parent).
> >   *
> >   * As the child adds essentially another reference to the I/O page table 
> >   * represented by the parent, any device attached to the child ioasid 
> >   * must be already attached to the parent.
> >   *
> >   * In concept there is no limit on the number of the nesting levels. 
> >   * However for the majority case one nesting level is sufficient. The
> >   * user should check whether an IOASID supports nesting through 
> >   * IOASID_GET_INFO. For example, if only one nesting level is allowed,
> >   * the nesting capability is reported only on the parent instead of the
> >   * child.
> >   *
> >   * User also needs check (via IOASID_GET_INFO) whether the nesting 
> >   * is implemented in hardware or software. If software-based, DMA 
> >   * mapping protocol should be used on the child IOASID. Otherwise, 
> >   * the child should be operated with pgtable binding protocol.
> >   *
> >   * Input parameters:
> >   *	- u32 parent_ioasid;
> >   *
> >   * Return: child_ioasid on success, -errno on failure;
> >   */
> > #define IOASID_CREATE_NESTING	_IO(IOASID_TYPE, IOASID_BASE + 8)  
> 
> Do you think another ioctl is best? Should this just be another
> parameter to alloc?

either is fine. This ioctl is following one of your previous comment.

https://lore.kernel.org/linux-iommu/20210422121020.GT1370958@nvidia.com/

> 
> > /*
> >   * Bind an user-managed I/O page table with the IOMMU
> >   *
> >   * Because user page table is untrusted, IOASID nesting must be enabled 
> >   * for this ioasid so the kernel can enforce its DMA isolation policy 
> >   * through the parent ioasid.
> >   *
> >   * Pgtable binding protocol is different from DMA mapping. The latter 
> >   * has the I/O page table constructed by the kernel and updated 
> >   * according to user MAP/UNMAP commands. With pgtable binding the 
> >   * whole page table is created and updated by userspace, thus different 
> >   * set of commands are required (bind, iotlb invalidation, page fault, etc.).
> >   *
> >   * Because the page table is directly walked by the IOMMU, the user 
> >   * must  use a format compatible to the underlying hardware. It can 
> >   * check the format information through IOASID_GET_INFO.
> >   *
> >   * The page table is bound to the IOMMU according to the routing 
> >   * information of each attached device under the specified IOASID. The
> >   * routing information (RID and optional PASID) is registered when a 
> >   * device is attached to this IOASID through VFIO uAPI. 
> >   *
> >   * Input parameters:
> >   *	- child_ioasid;
> >   *	- address of the user page table;
> >   *	- formats (vendor, address_width, etc.);
> >   * 
> >   * Return: 0 on success, -errno on failure.
> >   */
> > #define IOASID_BIND_PGTABLE		_IO(IOASID_TYPE, IOASID_BASE + 9)
> > #define IOASID_UNBIND_PGTABLE	_IO(IOASID_TYPE, IOASID_BASE + 10)  
> 
> Also feels backwards, why wouldn't we specify this, and the required
> page table format, during alloc time?

here the model is user-space gets the page table format from kernel and
decide if it can proceed. So what you are suggesting is user-space should
tell kernel the page table format it has in ALLOC and kenrel should fail
the ALLOC if the user-space page table format is not compatible with underlying
iommu?

> 
> > /*
> >   * Bind an user-managed PASID table to the IOMMU
> >   *
> >   * This is required for platforms which place PASID table in the GPA space.
> >   * In this case the specified IOASID represents the per-RID PASID space.
> >   *
> >   * Alternatively this may be replaced by IOASID_BIND_PGTABLE plus a
> >   * special flag to indicate the difference from normal I/O address spaces.
> >   *
> >   * The format info of the PASID table is reported in IOASID_GET_INFO.
> >   *
> >   * As explained in the design section, user-managed I/O page tables must
> >   * be explicitly bound to the kernel even on these platforms. It allows
> >   * the kernel to uniformly manage I/O address spaces cross all platforms.
> >   * Otherwise, the iotlb invalidation and page faulting uAPI must be hacked
> >   * to carry device routing information to indirectly mark the hidden I/O
> >   * address spaces.
> >   *
> >   * Input parameters:
> >   *	- child_ioasid;
> >   *	- address of PASID table;
> >   *	- formats (vendor, size, etc.);
> >   *
> >   * Return: 0 on success, -errno on failure.
> >   */
> > #define IOASID_BIND_PASID_TABLE	_IO(IOASID_TYPE, IOASID_BASE + 11)
> > #define IOASID_UNBIND_PASID_TABLE	_IO(IOASID_TYPE, IOASID_BASE + 12)  
> 
> Ditto
> 
> > 
> > /*
> >   * Invalidate IOTLB for an user-managed I/O page table
> >   *
> >   * Unlike what's defined in include/uapi/linux/iommu.h, this command 
> >   * doesn't allow the user to specify cache type and likely support only
> >   * two granularities (all, or a specified range) in the I/O address space.
> >   *
> >   * Physical IOMMU have three cache types (iotlb, dev_iotlb and pasid
> >   * cache). If the IOASID represents an I/O address space, the invalidation
> >   * always applies to the iotlb (and dev_iotlb if enabled). If the IOASID
> >   * represents a vPASID space, then this command applies to the PASID
> >   * cache.
> >   *
> >   * Similarly this command doesn't provide IOMMU-like granularity
> >   * info (domain-wide, pasid-wide, range-based), since it's all about the
> >   * I/O address space itself. The ioasid driver walks the attached
> >   * routing information to match the IOMMU semantics under the
> >   * hood. 
> >   *
> >   * Input parameters:
> >   *	- child_ioasid;
> >   *	- granularity
> >   * 
> >   * Return: 0 on success, -errno on failure
> >   */
> > #define IOASID_INVALIDATE_CACHE	_IO(IOASID_TYPE, IOASID_BASE + 13)  
> 
> This should have an IOVA range too?
> 
> > /*
> >   * Page fault report and response
> >   *
> >   * This is TBD. Can be added after other parts are cleared up. Likely it 
> >   * will be a ring buffer shared between user/kernel, an eventfd to notify 
> >   * the user and an ioctl to complete the fault.
> >   *
> >   * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
> >   */  
> 
> Any reason not to just use read()?

a ring buffer may be mmap to user-space, thus reading fault data from kernel
would be faster. This is also how Eric's fault reporting is doing today.

https://lore.kernel.org/linux-iommu/20210411114659.15051-5-eric.auger@redhat.com/

> >
> > 2.2. /dev/vfio uAPI
> > ++++++++++++++++  
> 
> To be clear you mean the 'struct vfio_device' API, these are not
> IOCTLs on the container or group?
> 
> > /*
> >    * Bind a vfio_device to the specified IOASID fd
> >    *
> >    * Multiple vfio devices can be bound to a single ioasid_fd, but a single
> >    * vfio device should not be bound to multiple ioasid_fd's.
> >    *
> >    * Input parameters:
> >    *  - ioasid_fd;
> >    *
> >    * Return: 0 on success, -errno on failure.
> >    */
> > #define VFIO_BIND_IOASID_FD           _IO(VFIO_TYPE, VFIO_BASE + 22)
> > #define VFIO_UNBIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 23)  
> 
> This is where it would make sense to have an output "device id" that
> allows /dev/ioasid to refer to this "device" by number in events and
> other related things.

perhaps this is the device info Jean Philippe wants in page fault reporting
path?

-- 
Regards,
Yi Liu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-05-27  7:58 ` Tian, Kevin
@ 2021-05-31 17:37   ` Parav Pandit
  -1 siblings, 0 replies; 518+ messages in thread
From: Parav Pandit @ 2021-05-31 17:37 UTC (permalink / raw)
  To: Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe, Lu Baolu,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang
  Cc: Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu, Hao,
	Jiang, Dave, Jacob Pan, Jean-Philippe Brucker, David Gibson,
	Kirti Wankhede, Robin Murphy



> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Thursday, May 27, 2021 1:28 PM
> /dev/ioasid provides an unified interface for managing I/O page tables for
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
> etc.) are expected to use this interface instead of creating their own logic to
> isolate untrusted device DMAs initiated by userspace.
> 
> This proposal describes the uAPI of /dev/ioasid and also sample sequences
> with VFIO as example in typical usages. The driver-facing kernel API provided
> by the iommu layer is still TBD, which can be discussed after consensus is
> made on this uAPI.
> 
> It's based on a lengthy discussion starting from here:
> 	https://lore.kernel.org/linux-
> iommu/20210330132830.GO2356281@nvidia.com/
> 
> It ends up to be a long writing due to many things to be summarized and
> non-trivial effort required to connect them into a complete proposal.
> Hope it provides a clean base to converge.

Thanks for the detailed RFC. Digesting it...

[..]
> 2.1. /dev/ioasid uAPI
> +++++++++++++++++
> /*
>   * Register user space memory where DMA is allowed.
>   *
>   * It pins user pages and does the locked memory accounting so sub-
>   * sequent IOASID_MAP/UNMAP_DMA calls get faster.
>   *
>   * When this ioctl is not used, one user page might be accounted
>   * multiple times when it is mapped by multiple IOASIDs which are
>   * not nested together.
>   *
>   * Input parameters:
>   *	- vaddr;
>   *	- size;
>   *
>   * Return: 0 on success, -errno on failure.
>   */	
It appears that this is only to make map ioctl faster apart from accounting.
It doesn't have any ioasid handle input either.

In that case, can it be a new system call? Why does it have to be under /dev/ioasid?
For example few years back such system call mpin() thought was proposed in [1].

Or a new MAP_PINNED flag is better approach to achieve in single mmap() call?

> #define IOASID_REGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 1)
> #define IOASID_UNREGISTER_MEMORY	_IO(IOASID_TYPE,
> IOASID_BASE + 2)

[1] https://lwn.net/Articles/600502/

^ permalink raw reply	[flat|nested] 518+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
@ 2021-05-31 17:37   ` Parav Pandit
  0 siblings, 0 replies; 518+ messages in thread
From: Parav Pandit @ 2021-05-31 17:37 UTC (permalink / raw)
  To: Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe, Lu Baolu,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang
  Cc: Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok, Jonathan Corbet,
	Kirti Wankhede, David Gibson, Robin Murphy



> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Thursday, May 27, 2021 1:28 PM
> /dev/ioasid provides an unified interface for managing I/O page tables for
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
> etc.) are expected to use this interface instead of creating their own logic to
> isolate untrusted device DMAs initiated by userspace.
> 
> This proposal describes the uAPI of /dev/ioasid and also sample sequences
> with VFIO as example in typical usages. The driver-facing kernel API provided
> by the iommu layer is still TBD, which can be discussed after consensus is
> made on this uAPI.
> 
> It's based on a lengthy discussion starting from here:
> 	https://lore.kernel.org/linux-
> iommu/20210330132830.GO2356281@nvidia.com/
> 
> It ends up to be a long writing due to many things to be summarized and
> non-trivial effort required to connect them into a complete proposal.
> Hope it provides a clean base to converge.

Thanks for the detailed RFC. Digesting it...

[..]
> 2.1. /dev/ioasid uAPI
> +++++++++++++++++
> /*
>   * Register user space memory where DMA is allowed.
>   *
>   * It pins user pages and does the locked memory accounting so sub-
>   * sequent IOASID_MAP/UNMAP_DMA calls get faster.
>   *
>   * When this ioctl is not used, one user page might be accounted
>   * multiple times when it is mapped by multiple IOASIDs which are
>   * not nested together.
>   *
>   * Input parameters:
>   *	- vaddr;
>   *	- size;
>   *
>   * Return: 0 on success, -errno on failure.
>   */	
It appears that this is only to make map ioctl faster apart from accounting.
It doesn't have any ioasid handle input either.

In that case, can it be a new system call? Why does it have to be under /dev/ioasid?
For example few years back such system call mpin() thought was proposed in [1].

Or a new MAP_PINNED flag is better approach to achieve in single mmap() call?

> #define IOASID_REGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 1)
> #define IOASID_UNREGISTER_MEMORY	_IO(IOASID_TYPE,
> IOASID_BASE + 2)

[1] https://lwn.net/Articles/600502/
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-31 11:31     ` Liu Yi L
@ 2021-05-31 18:09       ` Jason Gunthorpe
  -1 siblings, 0 replies; 518+ messages in thread
From: Jason Gunthorpe @ 2021-05-31 18:09 UTC (permalink / raw)
  To: Liu Yi L
  Cc: yi.l.liu, Tian, Kevin, Jean-Philippe Brucker,
	Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, David Gibson, Jiang, Dave, Wu, Hao,
	David Woodhouse, Jason Wang

On Mon, May 31, 2021 at 07:31:57PM +0800, Liu Yi L wrote:
> > > /*
> > >   * Get information about an I/O address space
> > >   *
> > >   * Supported capabilities:
> > >   *	- VFIO type1 map/unmap;
> > >   *	- pgtable/pasid_table binding
> > >   *	- hardware nesting vs. software nesting;
> > >   *	- ...
> > >   *
> > >   * Related attributes:
> > >   * 	- supported page sizes, reserved IOVA ranges (DMA mapping);
> > >   *	- vendor pgtable formats (pgtable binding);
> > >   *	- number of child IOASIDs (nesting);
> > >   *	- ...
> > >   *
> > >   * Above information is available only after one or more devices are
> > >   * attached to the specified IOASID. Otherwise the IOASID is just a
> > >   * number w/o any capability or attribute.  
> > 
> > This feels wrong to learn most of these attributes of the IOASID after
> > attaching to a device.
> 
> but an IOASID is just a software handle before attached to a specific
> device. e.g. before attaching to a device, we have no idea about the
> supported page size in underlying iommu, coherent etc.

The idea is you attach the device to the /dev/ioasid FD and this
action is what crystalizes the iommu driver that is being used:

        device_fd = open("/dev/vfio/devices/dev1", mode);
        ioasid_fd = open("/dev/ioasid", mode);
        ioctl(device_fd, VFIO_BIND_IOASID_FD, ioasid_fd);

After this sequence we should have most of the information about the
IOMMU.

One /dev/ioasid FD has one iommu driver. Design what an "iommu driver"
means so that the system should only have one. Eg the coherent/not
coherent distinction should not be a different "iommu driver".

Device attach to the _IOASID_ is a different thing, and I think it
puts the whole sequence out of order because we loose the option to
customize the IOASID before it has to be realized into HW format.

> > The user should have some idea how it intends to use the IOASID when
> > it creates it and the rest of the system should match the intention.
> > 
> > For instance if the user is creating a IOASID to cover the guest GPA
> > with the intention of making children it should indicate this during
> > alloc.
> > 
> > If the user is intending to point a child IOASID to a guest page table
> > in a certain descriptor format then it should indicate it during
> > alloc.
> 
> Actually, we have only two kinds of IOASIDs so far. 

Maybe at a very very high level, but it looks like there is alot of
IOMMU specific configuration that goes into an IOASD.


> > device bind should fail if the device somehow isn't compatible with
> > the scheme the user is tring to use.
> 
> yeah, I guess you mean to fail the device attach when the IOASID is a
> nesting IOASID but the device is behind an iommu without nesting support.
> right?

Right..
 
> > 
> > > /*
> > >   * Map/unmap process virtual addresses to I/O virtual addresses.
> > >   *
> > >   * Provide VFIO type1 equivalent semantics. Start with the same 
> > >   * restriction e.g. the unmap size should match those used in the 
> > >   * original mapping call. 
> > >   *
> > >   * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
> > >   * must be already in the preregistered list.
> > >   *
> > >   * Input parameters:
> > >   *	- u32 ioasid;
> > >   *	- refer to vfio_iommu_type1_dma_{un}map
> > >   *
> > >   * Return: 0 on success, -errno on failure.
> > >   */
> > > #define IOASID_MAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 6)
> > > #define IOASID_UNMAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 7)  
> > 
> > What about nested IOASIDs?
> 
> at first glance, it looks like we should prevent the MAP/UNMAP usage on
> nested IOASIDs. At least hardware nested translation only allows MAP/UNMAP
> on the parent IOASIDs and page table bind on nested IOASIDs. But considering
> about software nesting, it seems still useful to allow MAP/UNMAP usage
> on nested IOASIDs. This is how I understand it, how about your opinion
> on it? do you think it's better to allow MAP/UNMAP usage only on parent
> IOASIDs as a start?

If the only form of nested IOASID is the "read the page table from
my process memory" then MAP/UNMAP won't make sense on that..

MAP/UNMAP is only useful if the page table is stored in kernel memory.

> > > #define IOASID_CREATE_NESTING	_IO(IOASID_TYPE, IOASID_BASE + 8)  
> > 
> > Do you think another ioctl is best? Should this just be another
> > parameter to alloc?
> 
> either is fine. This ioctl is following one of your previous comment.

Sometimes I say things in a way that is ment to be easier to
understand conecpts not necessarily good API design :)

> > > #define IOASID_BIND_PGTABLE		_IO(IOASID_TYPE, IOASID_BASE + 9)
> > > #define IOASID_UNBIND_PGTABLE	_IO(IOASID_TYPE, IOASID_BASE + 10)  
> > 
> > Also feels backwards, why wouldn't we specify this, and the required
> > page table format, during alloc time?
> 
> here the model is user-space gets the page table format from kernel and
> decide if it can proceed. So what you are suggesting is user-space should
> tell kernel the page table format it has in ALLOC and kenrel should fail
> the ALLOC if the user-space page table format is not compatible with underlying
> iommu?

Yes, the action should be
   Alloc an IOASID that points at a page table in this user memory,
   that is stored in this specific format.

The supported formats should be discoverable after VFIO_BIND_IOASID_FD

> > > /*
> > >   * Page fault report and response
> > >   *
> > >   * This is TBD. Can be added after other parts are cleared up. Likely it 
> > >   * will be a ring buffer shared between user/kernel, an eventfd to notify 
> > >   * the user and an ioctl to complete the fault.
> > >   *
> > >   * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
> > >   */  
> > 
> > Any reason not to just use read()?
> 
> a ring buffer may be mmap to user-space, thus reading fault data from kernel
> would be faster. This is also how Eric's fault reporting is doing today.

Okay, if it is performance sensitive.. mmap rings are just tricky beasts

> > >    * Bind a vfio_device to the specified IOASID fd
> > >    *
> > >    * Multiple vfio devices can be bound to a single ioasid_fd, but a single
> > >    * vfio device should not be bound to multiple ioasid_fd's.
> > >    *
> > >    * Input parameters:
> > >    *  - ioasid_fd;
> > >    *
> > >    * Return: 0 on success, -errno on failure.
> > >    */
> > > #define VFIO_BIND_IOASID_FD           _IO(VFIO_TYPE, VFIO_BASE + 22)
> > > #define VFIO_UNBIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 23)  
> > 
> > This is where it would make sense to have an output "device id" that
> > allows /dev/ioasid to refer to this "device" by number in events and
> > other related things.
> 
> perhaps this is the device info Jean Philippe wants in page fault reporting
> path?

Yes, it is

Jason
 

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-05-31 18:09       ` Jason Gunthorpe
  0 siblings, 0 replies; 518+ messages in thread
From: Jason Gunthorpe @ 2021-05-31 18:09 UTC (permalink / raw)
  To: Liu Yi L
  Cc: Jean-Philippe Brucker, Tian, Kevin, Jiang, Dave, Raj, Ashok, kvm,
	Jonathan Corbet, David Woodhouse, Jason Wang, LKML,
	Kirti Wankhede, Alex Williamson (alex.williamson@redhat.com),
	iommu, Robin Murphy, David Gibson

On Mon, May 31, 2021 at 07:31:57PM +0800, Liu Yi L wrote:
> > > /*
> > >   * Get information about an I/O address space
> > >   *
> > >   * Supported capabilities:
> > >   *	- VFIO type1 map/unmap;
> > >   *	- pgtable/pasid_table binding
> > >   *	- hardware nesting vs. software nesting;
> > >   *	- ...
> > >   *
> > >   * Related attributes:
> > >   * 	- supported page sizes, reserved IOVA ranges (DMA mapping);
> > >   *	- vendor pgtable formats (pgtable binding);
> > >   *	- number of child IOASIDs (nesting);
> > >   *	- ...
> > >   *
> > >   * Above information is available only after one or more devices are
> > >   * attached to the specified IOASID. Otherwise the IOASID is just a
> > >   * number w/o any capability or attribute.  
> > 
> > This feels wrong to learn most of these attributes of the IOASID after
> > attaching to a device.
> 
> but an IOASID is just a software handle before attached to a specific
> device. e.g. before attaching to a device, we have no idea about the
> supported page size in underlying iommu, coherent etc.

The idea is you attach the device to the /dev/ioasid FD and this
action is what crystalizes the iommu driver that is being used:

        device_fd = open("/dev/vfio/devices/dev1", mode);
        ioasid_fd = open("/dev/ioasid", mode);
        ioctl(device_fd, VFIO_BIND_IOASID_FD, ioasid_fd);

After this sequence we should have most of the information about the
IOMMU.

One /dev/ioasid FD has one iommu driver. Design what an "iommu driver"
means so that the system should only have one. Eg the coherent/not
coherent distinction should not be a different "iommu driver".

Device attach to the _IOASID_ is a different thing, and I think it
puts the whole sequence out of order because we loose the option to
customize the IOASID before it has to be realized into HW format.

> > The user should have some idea how it intends to use the IOASID when
> > it creates it and the rest of the system should match the intention.
> > 
> > For instance if the user is creating a IOASID to cover the guest GPA
> > with the intention of making children it should indicate this during
> > alloc.
> > 
> > If the user is intending to point a child IOASID to a guest page table
> > in a certain descriptor format then it should indicate it during
> > alloc.
> 
> Actually, we have only two kinds of IOASIDs so far. 

Maybe at a very very high level, but it looks like there is alot of
IOMMU specific configuration that goes into an IOASD.


> > device bind should fail if the device somehow isn't compatible with
> > the scheme the user is tring to use.
> 
> yeah, I guess you mean to fail the device attach when the IOASID is a
> nesting IOASID but the device is behind an iommu without nesting support.
> right?

Right..
 
> > 
> > > /*
> > >   * Map/unmap process virtual addresses to I/O virtual addresses.
> > >   *
> > >   * Provide VFIO type1 equivalent semantics. Start with the same 
> > >   * restriction e.g. the unmap size should match those used in the 
> > >   * original mapping call. 
> > >   *
> > >   * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
> > >   * must be already in the preregistered list.
> > >   *
> > >   * Input parameters:
> > >   *	- u32 ioasid;
> > >   *	- refer to vfio_iommu_type1_dma_{un}map
> > >   *
> > >   * Return: 0 on success, -errno on failure.
> > >   */
> > > #define IOASID_MAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 6)
> > > #define IOASID_UNMAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 7)  
> > 
> > What about nested IOASIDs?
> 
> at first glance, it looks like we should prevent the MAP/UNMAP usage on
> nested IOASIDs. At least hardware nested translation only allows MAP/UNMAP
> on the parent IOASIDs and page table bind on nested IOASIDs. But considering
> about software nesting, it seems still useful to allow MAP/UNMAP usage
> on nested IOASIDs. This is how I understand it, how about your opinion
> on it? do you think it's better to allow MAP/UNMAP usage only on parent
> IOASIDs as a start?

If the only form of nested IOASID is the "read the page table from
my process memory" then MAP/UNMAP won't make sense on that..

MAP/UNMAP is only useful if the page table is stored in kernel memory.

> > > #define IOASID_CREATE_NESTING	_IO(IOASID_TYPE, IOASID_BASE + 8)  
> > 
> > Do you think another ioctl is best? Should this just be another
> > parameter to alloc?
> 
> either is fine. This ioctl is following one of your previous comment.

Sometimes I say things in a way that is ment to be easier to
understand conecpts not necessarily good API design :)

> > > #define IOASID_BIND_PGTABLE		_IO(IOASID_TYPE, IOASID_BASE + 9)
> > > #define IOASID_UNBIND_PGTABLE	_IO(IOASID_TYPE, IOASID_BASE + 10)  
> > 
> > Also feels backwards, why wouldn't we specify this, and the required
> > page table format, during alloc time?
> 
> here the model is user-space gets the page table format from kernel and
> decide if it can proceed. So what you are suggesting is user-space should
> tell kernel the page table format it has in ALLOC and kenrel should fail
> the ALLOC if the user-space page table format is not compatible with underlying
> iommu?

Yes, the action should be
   Alloc an IOASID that points at a page table in this user memory,
   that is stored in this specific format.

The supported formats should be discoverable after VFIO_BIND_IOASID_FD

> > > /*
> > >   * Page fault report and response
> > >   *
> > >   * This is TBD. Can be added after other parts are cleared up. Likely it 
> > >   * will be a ring buffer shared between user/kernel, an eventfd to notify 
> > >   * the user and an ioctl to complete the fault.
> > >   *
> > >   * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
> > >   */  
> > 
> > Any reason not to just use read()?
> 
> a ring buffer may be mmap to user-space, thus reading fault data from kernel
> would be faster. This is also how Eric's fault reporting is doing today.

Okay, if it is performance sensitive.. mmap rings are just tricky beasts

> > >    * Bind a vfio_device to the specified IOASID fd
> > >    *
> > >    * Multiple vfio devices can be bound to a single ioasid_fd, but a single
> > >    * vfio device should not be bound to multiple ioasid_fd's.
> > >    *
> > >    * Input parameters:
> > >    *  - ioasid_fd;
> > >    *
> > >    * Return: 0 on success, -errno on failure.
> > >    */
> > > #define VFIO_BIND_IOASID_FD           _IO(VFIO_TYPE, VFIO_BASE + 22)
> > > #define VFIO_UNBIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 23)  
> > 
> > This is where it would make sense to have an output "device id" that
> > allows /dev/ioasid to refer to this "device" by number in events and
> > other related things.
> 
> perhaps this is the device info Jean Philippe wants in page fault reporting
> path?

Yes, it is

Jason
 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-31 17:37   ` Parav Pandit
@ 2021-05-31 18:12     ` Jason Gunthorpe
  -1 siblings, 0 replies; 518+ messages in thread
From: Jason Gunthorpe @ 2021-05-31 18:12 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

On Mon, May 31, 2021 at 05:37:35PM +0000, Parav Pandit wrote:

> In that case, can it be a new system call? Why does it have to be under /dev/ioasid?
> For example few years back such system call mpin() thought was proposed in [1].

Reference counting of the overall pins are required

So when a pinned pages is incorporated into an IOASID page table in a
later IOCTL it means it cannot be unpinned while the IOASID page table
is using it.

This is some trick to organize the pinning into groups and then
refcount each group, thus avoiding needing per-page refcounts.

The data structure would be an interval tree of pins in general

The ioasid itself would have an interval tree of its own mappings,
each entry in this tree would reference count against an element in
the above tree

Then the ioasid's interval tree would be mapped into a page table tree
in HW format.

The redundant storages are needed to keep track of the refencing and
the CPU page table values for later unpinning.

Jason

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-05-31 18:12     ` Jason Gunthorpe
  0 siblings, 0 replies; 518+ messages in thread
From: Jason Gunthorpe @ 2021-05-31 18:12 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jean-Philippe Brucker, Tian, Kevin,
	Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, David Gibson, Jiang, Dave,
	David Woodhouse, Jason Wang

On Mon, May 31, 2021 at 05:37:35PM +0000, Parav Pandit wrote:

> In that case, can it be a new system call? Why does it have to be under /dev/ioasid?
> For example few years back such system call mpin() thought was proposed in [1].

Reference counting of the overall pins are required

So when a pinned pages is incorporated into an IOASID page table in a
later IOCTL it means it cannot be unpinned while the IOASID page table
is using it.

This is some trick to organize the pinning into groups and then
refcount each group, thus avoiding needing per-page refcounts.

The data structure would be an interval tree of pins in general

The ioasid itself would have an interval tree of its own mappings,
each entry in this tree would reference count against an element in
the above tree

Then the ioasid's interval tree would be mapped into a page table tree
in HW format.

The redundant storages are needed to keep track of the refencing and
the CPU page table values for later unpinning.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-31 11:31     ` Liu Yi L
@ 2021-06-01  1:25       ` Lu Baolu
  -1 siblings, 0 replies; 518+ messages in thread
From: Lu Baolu @ 2021-06-01  1:25 UTC (permalink / raw)
  To: Liu Yi L, Jason Gunthorpe
  Cc: baolu.lu, Jean-Philippe Brucker, Tian, Kevin, Jiang, Dave, Raj,
	Ashok, kvm, Jonathan Corbet, David Woodhouse, Jason Wang, LKML,
	Kirti Wankhede, Alex Williamson (alex.williamson@redhat.com),
	iommu, Robin Murphy, David Gibson

On 5/31/21 7:31 PM, Liu Yi L wrote:
> On Fri, 28 May 2021 20:36:49 -0300, Jason Gunthorpe wrote:
> 
>> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
>>
>>> 2.1. /dev/ioasid uAPI
>>> +++++++++++++++++

[---cut for short---]

>>> /*
>>>    * Allocate an IOASID.
>>>    *
>>>    * IOASID is the FD-local software handle representing an I/O address
>>>    * space. Each IOASID is associated with a single I/O page table. User
>>>    * must call this ioctl to get an IOASID for every I/O address space that is
>>>    * intended to be enabled in the IOMMU.
>>>    *
>>>    * A newly-created IOASID doesn't accept any command before it is
>>>    * attached to a device. Once attached, an empty I/O page table is
>>>    * bound with the IOMMU then the user could use either DMA mapping
>>>    * or pgtable binding commands to manage this I/O page table.
>> Can the IOASID can be populated before being attached?
> perhaps a MAP/UNMAP operation on a gpa_ioasid?
> 

But before attaching to any device, there's no connection between an
IOASID and the underlying IOMMU. How do you know the supported page
sizes and cache coherency?

The restriction of iommu_group is implicitly expressed as only after all
devices belonging to an iommu_group are attached, the operations of the
page table can be performed.

Best regards,
baolu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-01  1:25       ` Lu Baolu
  0 siblings, 0 replies; 518+ messages in thread
From: Lu Baolu @ 2021-06-01  1:25 UTC (permalink / raw)
  To: Liu Yi L, Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Tian, Kevin, Jiang, Dave, Raj, Ashok, kvm,
	Jonathan Corbet, Robin Murphy, Jason Wang, LKML, iommu,
	Kirti Wankhede, Alex Williamson (alex.williamson@redhat.com),
	David Gibson, David Woodhouse

On 5/31/21 7:31 PM, Liu Yi L wrote:
> On Fri, 28 May 2021 20:36:49 -0300, Jason Gunthorpe wrote:
> 
>> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
>>
>>> 2.1. /dev/ioasid uAPI
>>> +++++++++++++++++

[---cut for short---]

>>> /*
>>>    * Allocate an IOASID.
>>>    *
>>>    * IOASID is the FD-local software handle representing an I/O address
>>>    * space. Each IOASID is associated with a single I/O page table. User
>>>    * must call this ioctl to get an IOASID for every I/O address space that is
>>>    * intended to be enabled in the IOMMU.
>>>    *
>>>    * A newly-created IOASID doesn't accept any command before it is
>>>    * attached to a device. Once attached, an empty I/O page table is
>>>    * bound with the IOMMU then the user could use either DMA mapping
>>>    * or pgtable binding commands to manage this I/O page table.
>> Can the IOASID can be populated before being attached?
> perhaps a MAP/UNMAP operation on a gpa_ioasid?
> 

But before attaching to any device, there's no connection between an
IOASID and the underlying IOMMU. How do you know the supported page
sizes and cache coherency?

The restriction of iommu_group is implicitly expressed as only after all
devices belonging to an iommu_group are attached, the operations of the
page table can be performed.

Best regards,
baolu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-31  8:41   ` Liu Yi L
@ 2021-06-01  2:36       ` Jason Wang
  0 siblings, 0 replies; 518+ messages in thread
From: Jason Wang @ 2021-06-01  2:36 UTC (permalink / raw)
  To: Liu Yi L
  Cc: yi.l.liu, Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe,
	Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com)",
	Eric Auger, Jonathan Corbet


在 2021/5/31 下午4:41, Liu Yi L 写道:
>> I guess VFIO_ATTACH_IOASID will fail if the underlayer doesn't support
>> hardware nesting. Or is there way to detect the capability before?
> I think it could fail in the IOASID_CREATE_NESTING. If the gpa_ioasid
> is not able to support nesting, then should fail it.
>
>> I think GET_INFO only works after the ATTACH.
> yes. After attaching to gpa_ioasid, userspace could GET_INFO on the
> gpa_ioasid and check if nesting is supported or not. right?


Some more questions:

1) Is the handle returned by IOASID_ALLOC an fd?
2) If yes, what's the reason for not simply use the fd opened from 
/dev/ioas. (This is the question that is not answered) and what happens 
if we call GET_INFO for the ioasid_fd?
3) If not, how GET_INFO work?


>
>>> 	/* Bind guest I/O page table  */
>>> 	bind_data = {
>>> 		.ioasid	= giova_ioasid;
>>> 		.addr	= giova_pgtable;
>>> 		// and format information
>>> 	};
>>> 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
>>>
>>> 	/* Invalidate IOTLB when required */
>>> 	inv_data = {
>>> 		.ioasid	= giova_ioasid;
>>> 		// granular information
>>> 	};
>>> 	ioctl(ioasid_fd, IOASID_INVALIDATE_CACHE, &inv_data);
>>>
>>> 	/* See 5.6 for I/O page fault handling */
>>> 	
>>> 5.5. Guest SVA (vSVA)
>>> ++++++++++++++++++
>>>
>>> After boots the guest further create a GVA address spaces (gpasid1) on
>>> dev1. Dev2 is not affected (still attached to giova_ioasid).
>>>
>>> As explained in section 4, user should avoid expose ENQCMD on both
>>> pdev and mdev.
>>>
>>> The sequence applies to all device types (being pdev or mdev), except
>>> one additional step to call KVM for ENQCMD-capable mdev:
>> My understanding is ENQCMD is Intel specific and not a requirement for
>> having vSVA.
> ENQCMD is not really Intel specific although only Intel supports it today.
> The PCIe DMWr capability is the capability for software to enumerate the
> ENQCMD support in device side. yes, it is not a requirement for vSVA. They
> are orthogonal.


Right, then it's better to mention DMWr instead of a vendor specific 
instruction in a general framework like ioasid.

Thanks


>


^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-01  2:36       ` Jason Wang
  0 siblings, 0 replies; 518+ messages in thread
From: Jason Wang @ 2021-06-01  2:36 UTC (permalink / raw)
  To: Liu Yi L
  Cc: Tian, Kevin, Alex Williamson (alex.williamson@redhat.com)",
	kvm, Jonathan Corbet, LKML, iommu, Jason Gunthorpe,
	David Woodhouse


在 2021/5/31 下午4:41, Liu Yi L 写道:
>> I guess VFIO_ATTACH_IOASID will fail if the underlayer doesn't support
>> hardware nesting. Or is there way to detect the capability before?
> I think it could fail in the IOASID_CREATE_NESTING. If the gpa_ioasid
> is not able to support nesting, then should fail it.
>
>> I think GET_INFO only works after the ATTACH.
> yes. After attaching to gpa_ioasid, userspace could GET_INFO on the
> gpa_ioasid and check if nesting is supported or not. right?


Some more questions:

1) Is the handle returned by IOASID_ALLOC an fd?
2) If yes, what's the reason for not simply use the fd opened from 
/dev/ioas. (This is the question that is not answered) and what happens 
if we call GET_INFO for the ioasid_fd?
3) If not, how GET_INFO work?


>
>>> 	/* Bind guest I/O page table  */
>>> 	bind_data = {
>>> 		.ioasid	= giova_ioasid;
>>> 		.addr	= giova_pgtable;
>>> 		// and format information
>>> 	};
>>> 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
>>>
>>> 	/* Invalidate IOTLB when required */
>>> 	inv_data = {
>>> 		.ioasid	= giova_ioasid;
>>> 		// granular information
>>> 	};
>>> 	ioctl(ioasid_fd, IOASID_INVALIDATE_CACHE, &inv_data);
>>>
>>> 	/* See 5.6 for I/O page fault handling */
>>> 	
>>> 5.5. Guest SVA (vSVA)
>>> ++++++++++++++++++
>>>
>>> After boots the guest further create a GVA address spaces (gpasid1) on
>>> dev1. Dev2 is not affected (still attached to giova_ioasid).
>>>
>>> As explained in section 4, user should avoid expose ENQCMD on both
>>> pdev and mdev.
>>>
>>> The sequence applies to all device types (being pdev or mdev), except
>>> one additional step to call KVM for ENQCMD-capable mdev:
>> My understanding is ENQCMD is Intel specific and not a requirement for
>> having vSVA.
> ENQCMD is not really Intel specific although only Intel supports it today.
> The PCIe DMWr capability is the capability for software to enumerate the
> ENQCMD support in device side. yes, it is not a requirement for vSVA. They
> are orthogonal.


Right, then it's better to mention DMWr instead of a vendor specific 
instruction in a general framework like ioasid.

Thanks


>

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-31 18:09       ` Jason Gunthorpe
@ 2021-06-01  3:08         ` Lu Baolu
  -1 siblings, 0 replies; 518+ messages in thread
From: Lu Baolu @ 2021-06-01  3:08 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu Yi L
  Cc: baolu.lu, Jean-Philippe Brucker, Tian, Kevin, Jiang, Dave, Raj,
	Ashok, kvm, Jonathan Corbet, David Woodhouse, Jason Wang, LKML,
	Kirti Wankhede, Alex Williamson (alex.williamson@redhat.com),
	iommu, Robin Murphy, David Gibson

On 6/1/21 2:09 AM, Jason Gunthorpe wrote:
>>> device bind should fail if the device somehow isn't compatible with
>>> the scheme the user is tring to use.
>> yeah, I guess you mean to fail the device attach when the IOASID is a
>> nesting IOASID but the device is behind an iommu without nesting support.
>> right?
> Right..
>   

Just want to confirm...

Does this mean that we only support hardware nesting and don't want to
have soft nesting (shadowed page table in kernel) in IOASID?

Best regards,
baolu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-01  3:08         ` Lu Baolu
  0 siblings, 0 replies; 518+ messages in thread
From: Lu Baolu @ 2021-06-01  3:08 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu Yi L
  Cc: Jean-Philippe Brucker, Tian, Kevin, Jiang, Dave, Raj, Ashok, kvm,
	Jonathan Corbet, Robin Murphy, Jason Wang, LKML, iommu,
	Kirti Wankhede, Alex Williamson (alex.williamson@redhat.com),
	David Gibson, David Woodhouse

On 6/1/21 2:09 AM, Jason Gunthorpe wrote:
>>> device bind should fail if the device somehow isn't compatible with
>>> the scheme the user is tring to use.
>> yeah, I guess you mean to fail the device attach when the IOASID is a
>> nesting IOASID but the device is behind an iommu without nesting support.
>> right?
> Right..
>   

Just want to confirm...

Does this mean that we only support hardware nesting and don't want to
have soft nesting (shadowed page table in kernel) in IOASID?

Best regards,
baolu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  2:36       ` Jason Wang
  (?)
@ 2021-06-01  3:31       ` Liu Yi L
  2021-06-01  5:08           ` Jason Wang
  -1 siblings, 1 reply; 518+ messages in thread
From: Liu Yi L @ 2021-06-01  3:31 UTC (permalink / raw)
  To: Jason Wang
  Cc: Tian, Kevin,
	Alex Williamson (alex.williamson@redhat.com)""
	<alex.williamson@redhat.com>,
	Eric Auger <eric.auger@redhat.com>,
	Jonathan Corbet <corbet@lwn.net>,
	kvm, LKML, iommu, Jason Gunthorpe, David Woodhouse

On Tue, 1 Jun 2021 10:36:36 +0800, Jason Wang wrote:

> 在 2021/5/31 下午4:41, Liu Yi L 写道:
> >> I guess VFIO_ATTACH_IOASID will fail if the underlayer doesn't support
> >> hardware nesting. Or is there way to detect the capability before?  
> > I think it could fail in the IOASID_CREATE_NESTING. If the gpa_ioasid
> > is not able to support nesting, then should fail it.
> >  
> >> I think GET_INFO only works after the ATTACH.  
> > yes. After attaching to gpa_ioasid, userspace could GET_INFO on the
> > gpa_ioasid and check if nesting is supported or not. right?  
> 
> 
> Some more questions:
> 
> 1) Is the handle returned by IOASID_ALLOC an fd?

it's an ID so far in this proposal.

> 2) If yes, what's the reason for not simply use the fd opened from 
> /dev/ioas. (This is the question that is not answered) and what happens 
> if we call GET_INFO for the ioasid_fd?
> 3) If not, how GET_INFO work?

oh, missed this question in prior reply. Personally, no special reason
yet. But using ID may give us opportunity to customize the management
of the handle. For one, better lookup efficiency by using xarray to
store the allocated IDs. For two, could categorize the allocated IDs
(parent or nested). GET_INFO just works with an input FD and an ID.

> 
> >  
> >>> 	/* Bind guest I/O page table  */
> >>> 	bind_data = {
> >>> 		.ioasid	= giova_ioasid;
> >>> 		.addr	= giova_pgtable;
> >>> 		// and format information
> >>> 	};
> >>> 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> >>>
> >>> 	/* Invalidate IOTLB when required */
> >>> 	inv_data = {
> >>> 		.ioasid	= giova_ioasid;
> >>> 		// granular information
> >>> 	};
> >>> 	ioctl(ioasid_fd, IOASID_INVALIDATE_CACHE, &inv_data);
> >>>
> >>> 	/* See 5.6 for I/O page fault handling */
> >>> 	
> >>> 5.5. Guest SVA (vSVA)
> >>> ++++++++++++++++++
> >>>
> >>> After boots the guest further create a GVA address spaces (gpasid1) on
> >>> dev1. Dev2 is not affected (still attached to giova_ioasid).
> >>>
> >>> As explained in section 4, user should avoid expose ENQCMD on both
> >>> pdev and mdev.
> >>>
> >>> The sequence applies to all device types (being pdev or mdev), except
> >>> one additional step to call KVM for ENQCMD-capable mdev:  
> >> My understanding is ENQCMD is Intel specific and not a requirement for
> >> having vSVA.  
> > ENQCMD is not really Intel specific although only Intel supports it today.
> > The PCIe DMWr capability is the capability for software to enumerate the
> > ENQCMD support in device side. yes, it is not a requirement for vSVA. They
> > are orthogonal.  
> 
> 
> Right, then it's better to mention DMWr instead of a vendor specific 
> instruction in a general framework like ioasid.

good suggestion. :)

-- 
Regards,
Yi Liu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  2:36       ` Jason Wang
@ 2021-06-01  4:27         ` Shenming Lu
  -1 siblings, 0 replies; 518+ messages in thread
From: Shenming Lu @ 2021-06-01  4:27 UTC (permalink / raw)
  To: Jason Wang, Liu Yi L
  Cc: yi.l.liu, Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe,
	Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com)",
	Eric Auger, Jonathan Corbet, Zenghui Yu, wanghaibin.wang

On 2021/6/1 10:36, Jason Wang wrote:
> 
> 在 2021/5/31 下午4:41, Liu Yi L 写道:
>>> I guess VFIO_ATTACH_IOASID will fail if the underlayer doesn't support
>>> hardware nesting. Or is there way to detect the capability before?
>> I think it could fail in the IOASID_CREATE_NESTING. If the gpa_ioasid
>> is not able to support nesting, then should fail it.
>>
>>> I think GET_INFO only works after the ATTACH.
>> yes. After attaching to gpa_ioasid, userspace could GET_INFO on the
>> gpa_ioasid and check if nesting is supported or not. right?
> 
> 
> Some more questions:
> 
> 1) Is the handle returned by IOASID_ALLOC an fd?
> 2) If yes, what's the reason for not simply use the fd opened from /dev/ioas. (This is the question that is not answered) and what happens if we call GET_INFO for the ioasid_fd?
> 3) If not, how GET_INFO work?

It seems that the return value from IOASID_ALLOC is an IOASID number in the
ioasid_data struct, then when calling GET_INFO, we should convey this IOASID
number to get the associated I/O address space attributes (depend on the
physical IOMMU, which could be discovered when attaching a device to the
IOASID fd or number), right?

Thanks,
Shenming

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-01  4:27         ` Shenming Lu
  0 siblings, 0 replies; 518+ messages in thread
From: Shenming Lu @ 2021-06-01  4:27 UTC (permalink / raw)
  To: Jason Wang, Liu Yi L
  Cc: Tian, Kevin, Alex Williamson (alex.williamson@redhat.com)",
	kvm, Jonathan Corbet, LKML, iommu, Jason Gunthorpe,
	wanghaibin.wang, David Woodhouse

On 2021/6/1 10:36, Jason Wang wrote:
> 
> 在 2021/5/31 下午4:41, Liu Yi L 写道:
>>> I guess VFIO_ATTACH_IOASID will fail if the underlayer doesn't support
>>> hardware nesting. Or is there way to detect the capability before?
>> I think it could fail in the IOASID_CREATE_NESTING. If the gpa_ioasid
>> is not able to support nesting, then should fail it.
>>
>>> I think GET_INFO only works after the ATTACH.
>> yes. After attaching to gpa_ioasid, userspace could GET_INFO on the
>> gpa_ioasid and check if nesting is supported or not. right?
> 
> 
> Some more questions:
> 
> 1) Is the handle returned by IOASID_ALLOC an fd?
> 2) If yes, what's the reason for not simply use the fd opened from /dev/ioas. (This is the question that is not answered) and what happens if we call GET_INFO for the ioasid_fd?
> 3) If not, how GET_INFO work?

It seems that the return value from IOASID_ALLOC is an IOASID number in the
ioasid_data struct, then when calling GET_INFO, we should convey this IOASID
number to get the associated I/O address space attributes (depend on the
physical IOMMU, which could be discovered when attaching a device to the
IOASID fd or number), right?

Thanks,
Shenming
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-27  7:58 ` Tian, Kevin
@ 2021-06-01  4:31   ` Shenming Lu
  -1 siblings, 0 replies; 518+ messages in thread
From: Shenming Lu @ 2021-06-01  4:31 UTC (permalink / raw)
  To: Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe, Lu Baolu,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang
  Cc: Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu, Hao,
	Jiang, Dave, Jacob Pan, Jean-Philippe Brucker, David Gibson,
	Kirti Wankhede, Robin Murphy, Zenghui Yu, wanghaibin.wang

On 2021/5/27 15:58, Tian, Kevin wrote:
> /dev/ioasid provides an unified interface for managing I/O page tables for 
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, 
> etc.) are expected to use this interface instead of creating their own logic to 
> isolate untrusted device DMAs initiated by userspace. 
> 
> This proposal describes the uAPI of /dev/ioasid and also sample sequences 
> with VFIO as example in typical usages. The driver-facing kernel API provided 
> by the iommu layer is still TBD, which can be discussed after consensus is 
> made on this uAPI.
> 
> It's based on a lengthy discussion starting from here:
> 	https://lore.kernel.org/linux-iommu/20210330132830.GO2356281@nvidia.com/ 
> 
> It ends up to be a long writing due to many things to be summarized and
> non-trivial effort required to connect them into a complete proposal.
> Hope it provides a clean base to converge.
> 

[..]

> 
> /*
>   * Page fault report and response
>   *
>   * This is TBD. Can be added after other parts are cleared up. Likely it 
>   * will be a ring buffer shared between user/kernel, an eventfd to notify 
>   * the user and an ioctl to complete the fault.
>   *
>   * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
>   */

Hi,

It seems that the ioasid has different usage in different situation, it could
be directly used in the physical routing, or just a virtual handle that indicates
a page table or a vPASID table (such as the GPA address space, in the simple
passthrough case, the DMA input to IOMMU will just contain a Stream ID, no
Substream ID), right?

And Baolu suggested that since one device might consume multiple page tables,
it's more reasonable to have one fault handler per page table. By this, do we
have to maintain such an ioasid info list in the IOMMU layer?

Then if we add host IOPF support (for the GPA address space) in the future
(I have sent a series for this but it aimed for VFIO, I will convert it for
IOASID later [1] :-)), how could we find the handler for the received fault
event which only contains a Stream ID... Do we also have to maintain a
dev(vPASID)->ioasid mapping in the IOMMU layer?

[1] https://lore.kernel.org/patchwork/cover/1410223/

Thanks,
Shenming

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-01  4:31   ` Shenming Lu
  0 siblings, 0 replies; 518+ messages in thread
From: Shenming Lu @ 2021-06-01  4:31 UTC (permalink / raw)
  To: Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe, Lu Baolu,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang
  Cc: Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok, Jonathan Corbet,
	Kirti Wankhede, wanghaibin.wang, David Gibson, Robin Murphy

On 2021/5/27 15:58, Tian, Kevin wrote:
> /dev/ioasid provides an unified interface for managing I/O page tables for 
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, 
> etc.) are expected to use this interface instead of creating their own logic to 
> isolate untrusted device DMAs initiated by userspace. 
> 
> This proposal describes the uAPI of /dev/ioasid and also sample sequences 
> with VFIO as example in typical usages. The driver-facing kernel API provided 
> by the iommu layer is still TBD, which can be discussed after consensus is 
> made on this uAPI.
> 
> It's based on a lengthy discussion starting from here:
> 	https://lore.kernel.org/linux-iommu/20210330132830.GO2356281@nvidia.com/ 
> 
> It ends up to be a long writing due to many things to be summarized and
> non-trivial effort required to connect them into a complete proposal.
> Hope it provides a clean base to converge.
> 

[..]

> 
> /*
>   * Page fault report and response
>   *
>   * This is TBD. Can be added after other parts are cleared up. Likely it 
>   * will be a ring buffer shared between user/kernel, an eventfd to notify 
>   * the user and an ioctl to complete the fault.
>   *
>   * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
>   */

Hi,

It seems that the ioasid has different usage in different situation, it could
be directly used in the physical routing, or just a virtual handle that indicates
a page table or a vPASID table (such as the GPA address space, in the simple
passthrough case, the DMA input to IOMMU will just contain a Stream ID, no
Substream ID), right?

And Baolu suggested that since one device might consume multiple page tables,
it's more reasonable to have one fault handler per page table. By this, do we
have to maintain such an ioasid info list in the IOMMU layer?

Then if we add host IOPF support (for the GPA address space) in the future
(I have sent a series for this but it aimed for VFIO, I will convert it for
IOASID later [1] :-)), how could we find the handler for the received fault
event which only contains a Stream ID... Do we also have to maintain a
dev(vPASID)->ioasid mapping in the IOMMU layer?

[1] https://lore.kernel.org/patchwork/cover/1410223/

Thanks,
Shenming
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  3:31       ` Liu Yi L
@ 2021-06-01  5:08           ` Jason Wang
  0 siblings, 0 replies; 518+ messages in thread
From: Jason Wang @ 2021-06-01  5:08 UTC (permalink / raw)
  To: Liu Yi L
  Cc: yi.l.liu, Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe,
	Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com)"",
	Eric Auger, Jonathan Corbet


在 2021/6/1 上午11:31, Liu Yi L 写道:
> On Tue, 1 Jun 2021 10:36:36 +0800, Jason Wang wrote:
>
>> 在 2021/5/31 下午4:41, Liu Yi L 写道:
>>>> I guess VFIO_ATTACH_IOASID will fail if the underlayer doesn't support
>>>> hardware nesting. Or is there way to detect the capability before?
>>> I think it could fail in the IOASID_CREATE_NESTING. If the gpa_ioasid
>>> is not able to support nesting, then should fail it.
>>>   
>>>> I think GET_INFO only works after the ATTACH.
>>> yes. After attaching to gpa_ioasid, userspace could GET_INFO on the
>>> gpa_ioasid and check if nesting is supported or not. right?
>>
>> Some more questions:
>>
>> 1) Is the handle returned by IOASID_ALLOC an fd?
> it's an ID so far in this proposal.


Ok.


>
>> 2) If yes, what's the reason for not simply use the fd opened from
>> /dev/ioas. (This is the question that is not answered) and what happens
>> if we call GET_INFO for the ioasid_fd?
>> 3) If not, how GET_INFO work?
> oh, missed this question in prior reply. Personally, no special reason
> yet. But using ID may give us opportunity to customize the management
> of the handle. For one, better lookup efficiency by using xarray to
> store the allocated IDs. For two, could categorize the allocated IDs
> (parent or nested). GET_INFO just works with an input FD and an ID.


I'm not sure I get this, for nesting cases you can still make the child 
an fd.

And a question still, under what case we need to create multiple ioasids 
on a single ioasid fd?

(This case is not demonstrated in your examples).

Thanks


>
>>>   
>>>>> 	/* Bind guest I/O page table  */
>>>>> 	bind_data = {
>>>>> 		.ioasid	= giova_ioasid;
>>>>> 		.addr	= giova_pgtable;
>>>>> 		// and format information
>>>>> 	};
>>>>> 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
>>>>>
>>>>> 	/* Invalidate IOTLB when required */
>>>>> 	inv_data = {
>>>>> 		.ioasid	= giova_ioasid;
>>>>> 		// granular information
>>>>> 	};
>>>>> 	ioctl(ioasid_fd, IOASID_INVALIDATE_CACHE, &inv_data);
>>>>>
>>>>> 	/* See 5.6 for I/O page fault handling */
>>>>> 	
>>>>> 5.5. Guest SVA (vSVA)
>>>>> ++++++++++++++++++
>>>>>
>>>>> After boots the guest further create a GVA address spaces (gpasid1) on
>>>>> dev1. Dev2 is not affected (still attached to giova_ioasid).
>>>>>
>>>>> As explained in section 4, user should avoid expose ENQCMD on both
>>>>> pdev and mdev.
>>>>>
>>>>> The sequence applies to all device types (being pdev or mdev), except
>>>>> one additional step to call KVM for ENQCMD-capable mdev:
>>>> My understanding is ENQCMD is Intel specific and not a requirement for
>>>> having vSVA.
>>> ENQCMD is not really Intel specific although only Intel supports it today.
>>> The PCIe DMWr capability is the capability for software to enumerate the
>>> ENQCMD support in device side. yes, it is not a requirement for vSVA. They
>>> are orthogonal.
>>
>> Right, then it's better to mention DMWr instead of a vendor specific
>> instruction in a general framework like ioasid.
> good suggestion. :)
>


^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-01  5:08           ` Jason Wang
  0 siblings, 0 replies; 518+ messages in thread
From: Jason Wang @ 2021-06-01  5:08 UTC (permalink / raw)
  To: Liu Yi L
  Cc: Tian, Kevin,
	Alex Williamson (alex.williamson@redhat.com)"",
	kvm, Jonathan Corbet, LKML, iommu, Jason Gunthorpe,
	David Woodhouse


在 2021/6/1 上午11:31, Liu Yi L 写道:
> On Tue, 1 Jun 2021 10:36:36 +0800, Jason Wang wrote:
>
>> 在 2021/5/31 下午4:41, Liu Yi L 写道:
>>>> I guess VFIO_ATTACH_IOASID will fail if the underlayer doesn't support
>>>> hardware nesting. Or is there way to detect the capability before?
>>> I think it could fail in the IOASID_CREATE_NESTING. If the gpa_ioasid
>>> is not able to support nesting, then should fail it.
>>>   
>>>> I think GET_INFO only works after the ATTACH.
>>> yes. After attaching to gpa_ioasid, userspace could GET_INFO on the
>>> gpa_ioasid and check if nesting is supported or not. right?
>>
>> Some more questions:
>>
>> 1) Is the handle returned by IOASID_ALLOC an fd?
> it's an ID so far in this proposal.


Ok.


>
>> 2) If yes, what's the reason for not simply use the fd opened from
>> /dev/ioas. (This is the question that is not answered) and what happens
>> if we call GET_INFO for the ioasid_fd?
>> 3) If not, how GET_INFO work?
> oh, missed this question in prior reply. Personally, no special reason
> yet. But using ID may give us opportunity to customize the management
> of the handle. For one, better lookup efficiency by using xarray to
> store the allocated IDs. For two, could categorize the allocated IDs
> (parent or nested). GET_INFO just works with an input FD and an ID.


I'm not sure I get this, for nesting cases you can still make the child 
an fd.

And a question still, under what case we need to create multiple ioasids 
on a single ioasid fd?

(This case is not demonstrated in your examples).

Thanks


>
>>>   
>>>>> 	/* Bind guest I/O page table  */
>>>>> 	bind_data = {
>>>>> 		.ioasid	= giova_ioasid;
>>>>> 		.addr	= giova_pgtable;
>>>>> 		// and format information
>>>>> 	};
>>>>> 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
>>>>>
>>>>> 	/* Invalidate IOTLB when required */
>>>>> 	inv_data = {
>>>>> 		.ioasid	= giova_ioasid;
>>>>> 		// granular information
>>>>> 	};
>>>>> 	ioctl(ioasid_fd, IOASID_INVALIDATE_CACHE, &inv_data);
>>>>>
>>>>> 	/* See 5.6 for I/O page fault handling */
>>>>> 	
>>>>> 5.5. Guest SVA (vSVA)
>>>>> ++++++++++++++++++
>>>>>
>>>>> After boots the guest further create a GVA address spaces (gpasid1) on
>>>>> dev1. Dev2 is not affected (still attached to giova_ioasid).
>>>>>
>>>>> As explained in section 4, user should avoid expose ENQCMD on both
>>>>> pdev and mdev.
>>>>>
>>>>> The sequence applies to all device types (being pdev or mdev), except
>>>>> one additional step to call KVM for ENQCMD-capable mdev:
>>>> My understanding is ENQCMD is Intel specific and not a requirement for
>>>> having vSVA.
>>> ENQCMD is not really Intel specific although only Intel supports it today.
>>> The PCIe DMWr capability is the capability for software to enumerate the
>>> ENQCMD support in device side. yes, it is not a requirement for vSVA. They
>>> are orthogonal.
>>
>> Right, then it's better to mention DMWr instead of a vendor specific
>> instruction in a general framework like ioasid.
> good suggestion. :)
>

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  4:27         ` Shenming Lu
@ 2021-06-01  5:10           ` Jason Wang
  -1 siblings, 0 replies; 518+ messages in thread
From: Jason Wang @ 2021-06-01  5:10 UTC (permalink / raw)
  To: Shenming Lu, Liu Yi L
  Cc: yi.l.liu, Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe,
	Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com)",
	Eric Auger, Jonathan Corbet, Zenghui Yu, wanghaibin.wang


在 2021/6/1 下午12:27, Shenming Lu 写道:
> On 2021/6/1 10:36, Jason Wang wrote:
>> 在 2021/5/31 下�4:41, Liu Yi L 写�:
>>>> I guess VFIO_ATTACH_IOASID will fail if the underlayer doesn't support
>>>> hardware nesting. Or is there way to detect the capability before?
>>> I think it could fail in the IOASID_CREATE_NESTING. If the gpa_ioasid
>>> is not able to support nesting, then should fail it.
>>>
>>>> I think GET_INFO only works after the ATTACH.
>>> yes. After attaching to gpa_ioasid, userspace could GET_INFO on the
>>> gpa_ioasid and check if nesting is supported or not. right?
>>
>> Some more questions:
>>
>> 1) Is the handle returned by IOASID_ALLOC an fd?
>> 2) If yes, what's the reason for not simply use the fd opened from /dev/ioas. (This is the question that is not answered) and what happens if we call GET_INFO for the ioasid_fd?
>> 3) If not, how GET_INFO work?
> It seems that the return value from IOASID_ALLOC is an IOASID number in the
> ioasid_data struct, then when calling GET_INFO, we should convey this IOASID
> number to get the associated I/O address space attributes (depend on the
> physical IOMMU, which could be discovered when attaching a device to the
> IOASID fd or number), right?


Right, but the question is why need such indirection? Unless there's a 
case that you need to create multiple IOASIDs per ioasid fd. It's more 
simpler to attach the metadata into the ioasid fd itself.

Thanks


>
> Thanks,
> Shenming
>


^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-01  5:10           ` Jason Wang
  0 siblings, 0 replies; 518+ messages in thread
From: Jason Wang @ 2021-06-01  5:10 UTC (permalink / raw)
  To: Shenming Lu, Liu Yi L
  Cc: Tian, Kevin, Alex Williamson (alex.williamson@redhat.com)",
	kvm, Jonathan Corbet, LKML, iommu, Jason Gunthorpe,
	wanghaibin.wang, David Woodhouse


在 2021/6/1 下午12:27, Shenming Lu 写道:
> On 2021/6/1 10:36, Jason Wang wrote:
>> 在 2021/5/31 下�4:41, Liu Yi L 写�:
>>>> I guess VFIO_ATTACH_IOASID will fail if the underlayer doesn't support
>>>> hardware nesting. Or is there way to detect the capability before?
>>> I think it could fail in the IOASID_CREATE_NESTING. If the gpa_ioasid
>>> is not able to support nesting, then should fail it.
>>>
>>>> I think GET_INFO only works after the ATTACH.
>>> yes. After attaching to gpa_ioasid, userspace could GET_INFO on the
>>> gpa_ioasid and check if nesting is supported or not. right?
>>
>> Some more questions:
>>
>> 1) Is the handle returned by IOASID_ALLOC an fd?
>> 2) If yes, what's the reason for not simply use the fd opened from /dev/ioas. (This is the question that is not answered) and what happens if we call GET_INFO for the ioasid_fd?
>> 3) If not, how GET_INFO work?
> It seems that the return value from IOASID_ALLOC is an IOASID number in the
> ioasid_data struct, then when calling GET_INFO, we should convey this IOASID
> number to get the associated I/O address space attributes (depend on the
> physical IOMMU, which could be discovered when attaching a device to the
> IOASID fd or number), right?


Right, but the question is why need such indirection? Unless there's a 
case that you need to create multiple IOASIDs per ioasid fd. It's more 
simpler to attach the metadata into the ioasid fd itself.

Thanks


>
> Thanks,
> Shenming
>

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  4:31   ` Shenming Lu
@ 2021-06-01  5:10     ` Lu Baolu
  -1 siblings, 0 replies; 518+ messages in thread
From: Lu Baolu @ 2021-06-01  5:10 UTC (permalink / raw)
  To: Shenming Lu, Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang
  Cc: baolu.lu, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu,
	Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker, David Gibson,
	Kirti Wankhede, Robin Murphy, Zenghui Yu, wanghaibin.wang

Hi Shenming,

On 6/1/21 12:31 PM, Shenming Lu wrote:
> On 2021/5/27 15:58, Tian, Kevin wrote:
>> /dev/ioasid provides an unified interface for managing I/O page tables for
>> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
>> etc.) are expected to use this interface instead of creating their own logic to
>> isolate untrusted device DMAs initiated by userspace.
>>
>> This proposal describes the uAPI of /dev/ioasid and also sample sequences
>> with VFIO as example in typical usages. The driver-facing kernel API provided
>> by the iommu layer is still TBD, which can be discussed after consensus is
>> made on this uAPI.
>>
>> It's based on a lengthy discussion starting from here:
>> 	https://lore.kernel.org/linux-iommu/20210330132830.GO2356281@nvidia.com/
>>
>> It ends up to be a long writing due to many things to be summarized and
>> non-trivial effort required to connect them into a complete proposal.
>> Hope it provides a clean base to converge.
>>
> 
> [..]
> 
>>
>> /*
>>    * Page fault report and response
>>    *
>>    * This is TBD. Can be added after other parts are cleared up. Likely it
>>    * will be a ring buffer shared between user/kernel, an eventfd to notify
>>    * the user and an ioctl to complete the fault.
>>    *
>>    * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
>>    */
> 
> Hi,
> 
> It seems that the ioasid has different usage in different situation, it could
> be directly used in the physical routing, or just a virtual handle that indicates
> a page table or a vPASID table (such as the GPA address space, in the simple
> passthrough case, the DMA input to IOMMU will just contain a Stream ID, no
> Substream ID), right?
> 
> And Baolu suggested that since one device might consume multiple page tables,
> it's more reasonable to have one fault handler per page table. By this, do we
> have to maintain such an ioasid info list in the IOMMU layer?

As discussed earlier, the I/O page fault and cache invalidation paths
will have "device labels" so that the information could be easily
translated and routed.

So it's likely the per-device fault handler registering API in iommu
core can be kept, but /dev/ioasid will be grown with a layer to
translate and propagate I/O page fault information to the right
consumers.

If things evolve in this way, probably the SVA I/O page fault also needs
to be ported to /dev/ioasid.

> 
> Then if we add host IOPF support (for the GPA address space) in the future
> (I have sent a series for this but it aimed for VFIO, I will convert it for
> IOASID later [1] :-)), how could we find the handler for the received fault
> event which only contains a Stream ID... Do we also have to maintain a
> dev(vPASID)->ioasid mapping in the IOMMU layer?
> 
> [1] https://lore.kernel.org/patchwork/cover/1410223/

Best regards,
baolu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-01  5:10     ` Lu Baolu
  0 siblings, 0 replies; 518+ messages in thread
From: Lu Baolu @ 2021-06-01  5:10 UTC (permalink / raw)
  To: Shenming Lu, Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang
  Cc: Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok, Jonathan Corbet,
	Kirti Wankhede, David Gibson, wanghaibin.wang, Robin Murphy

Hi Shenming,

On 6/1/21 12:31 PM, Shenming Lu wrote:
> On 2021/5/27 15:58, Tian, Kevin wrote:
>> /dev/ioasid provides an unified interface for managing I/O page tables for
>> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
>> etc.) are expected to use this interface instead of creating their own logic to
>> isolate untrusted device DMAs initiated by userspace.
>>
>> This proposal describes the uAPI of /dev/ioasid and also sample sequences
>> with VFIO as example in typical usages. The driver-facing kernel API provided
>> by the iommu layer is still TBD, which can be discussed after consensus is
>> made on this uAPI.
>>
>> It's based on a lengthy discussion starting from here:
>> 	https://lore.kernel.org/linux-iommu/20210330132830.GO2356281@nvidia.com/
>>
>> It ends up to be a long writing due to many things to be summarized and
>> non-trivial effort required to connect them into a complete proposal.
>> Hope it provides a clean base to converge.
>>
> 
> [..]
> 
>>
>> /*
>>    * Page fault report and response
>>    *
>>    * This is TBD. Can be added after other parts are cleared up. Likely it
>>    * will be a ring buffer shared between user/kernel, an eventfd to notify
>>    * the user and an ioctl to complete the fault.
>>    *
>>    * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
>>    */
> 
> Hi,
> 
> It seems that the ioasid has different usage in different situation, it could
> be directly used in the physical routing, or just a virtual handle that indicates
> a page table or a vPASID table (such as the GPA address space, in the simple
> passthrough case, the DMA input to IOMMU will just contain a Stream ID, no
> Substream ID), right?
> 
> And Baolu suggested that since one device might consume multiple page tables,
> it's more reasonable to have one fault handler per page table. By this, do we
> have to maintain such an ioasid info list in the IOMMU layer?

As discussed earlier, the I/O page fault and cache invalidation paths
will have "device labels" so that the information could be easily
translated and routed.

So it's likely the per-device fault handler registering API in iommu
core can be kept, but /dev/ioasid will be grown with a layer to
translate and propagate I/O page fault information to the right
consumers.

If things evolve in this way, probably the SVA I/O page fault also needs
to be ported to /dev/ioasid.

> 
> Then if we add host IOPF support (for the GPA address space) in the future
> (I have sent a series for this but it aimed for VFIO, I will convert it for
> IOASID later [1] :-)), how could we find the handler for the received fault
> event which only contains a Stream ID... Do we also have to maintain a
> dev(vPASID)->ioasid mapping in the IOMMU layer?
> 
> [1] https://lore.kernel.org/patchwork/cover/1410223/

Best regards,
baolu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  5:08           ` Jason Wang
@ 2021-06-01  5:23             ` Lu Baolu
  -1 siblings, 0 replies; 518+ messages in thread
From: Lu Baolu @ 2021-06-01  5:23 UTC (permalink / raw)
  To: Jason Wang, Liu Yi L
  Cc: baolu.lu, yi.l.liu, Tian, Kevin, LKML, Joerg Roedel,
	Jason Gunthorpe, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com)"",
	Eric Auger, Jonathan Corbet

Hi Jason W,

On 6/1/21 1:08 PM, Jason Wang wrote:
>>> 2) If yes, what's the reason for not simply use the fd opened from
>>> /dev/ioas. (This is the question that is not answered) and what happens
>>> if we call GET_INFO for the ioasid_fd?
>>> 3) If not, how GET_INFO work?
>> oh, missed this question in prior reply. Personally, no special reason
>> yet. But using ID may give us opportunity to customize the management
>> of the handle. For one, better lookup efficiency by using xarray to
>> store the allocated IDs. For two, could categorize the allocated IDs
>> (parent or nested). GET_INFO just works with an input FD and an ID.
> 
> 
> I'm not sure I get this, for nesting cases you can still make the child 
> an fd.
> 
> And a question still, under what case we need to create multiple ioasids 
> on a single ioasid fd?

One possible situation where multiple IOASIDs per FD could be used is
that devices with different underlying IOMMU capabilities are sharing a
single FD. In this case, only devices with consistent underlying IOMMU
capabilities could be put in an IOASID and multiple IOASIDs per FD could
be applied.

Though, I still not sure about "multiple IOASID per-FD" vs "multiple
IOASID FDs" for such case.

Best regards,
baolu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-01  5:23             ` Lu Baolu
  0 siblings, 0 replies; 518+ messages in thread
From: Lu Baolu @ 2021-06-01  5:23 UTC (permalink / raw)
  To: Jason Wang, Liu Yi L
  Cc: Tian, Kevin,
	Alex Williamson (alex.williamson@redhat.com)"",
	kvm, Jonathan Corbet, LKML, iommu, Jason Gunthorpe,
	David Woodhouse

Hi Jason W,

On 6/1/21 1:08 PM, Jason Wang wrote:
>>> 2) If yes, what's the reason for not simply use the fd opened from
>>> /dev/ioas. (This is the question that is not answered) and what happens
>>> if we call GET_INFO for the ioasid_fd?
>>> 3) If not, how GET_INFO work?
>> oh, missed this question in prior reply. Personally, no special reason
>> yet. But using ID may give us opportunity to customize the management
>> of the handle. For one, better lookup efficiency by using xarray to
>> store the allocated IDs. For two, could categorize the allocated IDs
>> (parent or nested). GET_INFO just works with an input FD and an ID.
> 
> 
> I'm not sure I get this, for nesting cases you can still make the child 
> an fd.
> 
> And a question still, under what case we need to create multiple ioasids 
> on a single ioasid fd?

One possible situation where multiple IOASIDs per FD could be used is
that devices with different underlying IOMMU capabilities are sharing a
single FD. In this case, only devices with consistent underlying IOMMU
capabilities could be put in an IOASID and multiple IOASIDs per FD could
be applied.

Though, I still not sure about "multiple IOASID per-FD" vs "multiple
IOASID FDs" for such case.

Best regards,
baolu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  5:23             ` Lu Baolu
@ 2021-06-01  5:29               ` Jason Wang
  -1 siblings, 0 replies; 518+ messages in thread
From: Jason Wang @ 2021-06-01  5:29 UTC (permalink / raw)
  To: Lu Baolu, Liu Yi L
  Cc: yi.l.liu, Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com)"",
	Eric Auger, Jonathan Corbet


在 2021/6/1 下午1:23, Lu Baolu 写道:
> Hi Jason W,
>
> On 6/1/21 1:08 PM, Jason Wang wrote:
>>>> 2) If yes, what's the reason for not simply use the fd opened from
>>>> /dev/ioas. (This is the question that is not answered) and what 
>>>> happens
>>>> if we call GET_INFO for the ioasid_fd?
>>>> 3) If not, how GET_INFO work?
>>> oh, missed this question in prior reply. Personally, no special reason
>>> yet. But using ID may give us opportunity to customize the management
>>> of the handle. For one, better lookup efficiency by using xarray to
>>> store the allocated IDs. For two, could categorize the allocated IDs
>>> (parent or nested). GET_INFO just works with an input FD and an ID.
>>
>>
>> I'm not sure I get this, for nesting cases you can still make the 
>> child an fd.
>>
>> And a question still, under what case we need to create multiple 
>> ioasids on a single ioasid fd?
>
> One possible situation where multiple IOASIDs per FD could be used is
> that devices with different underlying IOMMU capabilities are sharing a
> single FD. In this case, only devices with consistent underlying IOMMU
> capabilities could be put in an IOASID and multiple IOASIDs per FD could
> be applied.
>
> Though, I still not sure about "multiple IOASID per-FD" vs "multiple
> IOASID FDs" for such case.


Right, that's exactly my question. The latter seems much more easier to 
be understood and implemented.

Thanks


>
> Best regards,
> baolu
>


^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-01  5:29               ` Jason Wang
  0 siblings, 0 replies; 518+ messages in thread
From: Jason Wang @ 2021-06-01  5:29 UTC (permalink / raw)
  To: Lu Baolu, Liu Yi L
  Cc: Tian, Kevin,
	Alex Williamson (alex.williamson@redhat.com)"",
	kvm, Jonathan Corbet, LKML, iommu, Jason Gunthorpe,
	David Woodhouse


在 2021/6/1 下午1:23, Lu Baolu 写道:
> Hi Jason W,
>
> On 6/1/21 1:08 PM, Jason Wang wrote:
>>>> 2) If yes, what's the reason for not simply use the fd opened from
>>>> /dev/ioas. (This is the question that is not answered) and what 
>>>> happens
>>>> if we call GET_INFO for the ioasid_fd?
>>>> 3) If not, how GET_INFO work?
>>> oh, missed this question in prior reply. Personally, no special reason
>>> yet. But using ID may give us opportunity to customize the management
>>> of the handle. For one, better lookup efficiency by using xarray to
>>> store the allocated IDs. For two, could categorize the allocated IDs
>>> (parent or nested). GET_INFO just works with an input FD and an ID.
>>
>>
>> I'm not sure I get this, for nesting cases you can still make the 
>> child an fd.
>>
>> And a question still, under what case we need to create multiple 
>> ioasids on a single ioasid fd?
>
> One possible situation where multiple IOASIDs per FD could be used is
> that devices with different underlying IOMMU capabilities are sharing a
> single FD. In this case, only devices with consistent underlying IOMMU
> capabilities could be put in an IOASID and multiple IOASIDs per FD could
> be applied.
>
> Though, I still not sure about "multiple IOASID per-FD" vs "multiple
> IOASID FDs" for such case.


Right, that's exactly my question. The latter seems much more easier to 
be understood and implemented.

Thanks


>
> Best regards,
> baolu
>

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  5:29               ` Jason Wang
@ 2021-06-01  5:42                 ` Tian, Kevin
  -1 siblings, 0 replies; 518+ messages in thread
From: Tian, Kevin @ 2021-06-01  5:42 UTC (permalink / raw)
  To: Jason Wang, Lu Baolu, Liu Yi L
  Cc: Alex Williamson (alex.williamson@redhat.com)"",
	kvm, Jonathan Corbet, LKML, iommu, Jason Gunthorpe,
	David Woodhouse

> From: Jason Wang
> Sent: Tuesday, June 1, 2021 1:30 PM
> 
> 在 2021/6/1 下午1:23, Lu Baolu 写道:
> > Hi Jason W,
> >
> > On 6/1/21 1:08 PM, Jason Wang wrote:
> >>>> 2) If yes, what's the reason for not simply use the fd opened from
> >>>> /dev/ioas. (This is the question that is not answered) and what
> >>>> happens
> >>>> if we call GET_INFO for the ioasid_fd?
> >>>> 3) If not, how GET_INFO work?
> >>> oh, missed this question in prior reply. Personally, no special reason
> >>> yet. But using ID may give us opportunity to customize the management
> >>> of the handle. For one, better lookup efficiency by using xarray to
> >>> store the allocated IDs. For two, could categorize the allocated IDs
> >>> (parent or nested). GET_INFO just works with an input FD and an ID.
> >>
> >>
> >> I'm not sure I get this, for nesting cases you can still make the
> >> child an fd.
> >>
> >> And a question still, under what case we need to create multiple
> >> ioasids on a single ioasid fd?
> >
> > One possible situation where multiple IOASIDs per FD could be used is
> > that devices with different underlying IOMMU capabilities are sharing a
> > single FD. In this case, only devices with consistent underlying IOMMU
> > capabilities could be put in an IOASID and multiple IOASIDs per FD could
> > be applied.
> >
> > Though, I still not sure about "multiple IOASID per-FD" vs "multiple
> > IOASID FDs" for such case.
> 
> 
> Right, that's exactly my question. The latter seems much more easier to
> be understood and implemented.
> 

A simple reason discussed in previous thread - there could be 1M's 
I/O address spaces per device while #FD's are precious resource.
So this RFC treats fd as a container of address spaces which is each
tagged by an IOASID.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 518+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-01  5:42                 ` Tian, Kevin
  0 siblings, 0 replies; 518+ messages in thread
From: Tian, Kevin @ 2021-06-01  5:42 UTC (permalink / raw)
  To: Jason Wang, Lu Baolu, Liu Yi L
  Cc: kvm, Jonathan Corbet, iommu, LKML,
	Alex Williamson (alex.williamson@redhat.com)"",
	Jason Gunthorpe, David Woodhouse

> From: Jason Wang
> Sent: Tuesday, June 1, 2021 1:30 PM
> 
> 在 2021/6/1 下午1:23, Lu Baolu 写道:
> > Hi Jason W,
> >
> > On 6/1/21 1:08 PM, Jason Wang wrote:
> >>>> 2) If yes, what's the reason for not simply use the fd opened from
> >>>> /dev/ioas. (This is the question that is not answered) and what
> >>>> happens
> >>>> if we call GET_INFO for the ioasid_fd?
> >>>> 3) If not, how GET_INFO work?
> >>> oh, missed this question in prior reply. Personally, no special reason
> >>> yet. But using ID may give us opportunity to customize the management
> >>> of the handle. For one, better lookup efficiency by using xarray to
> >>> store the allocated IDs. For two, could categorize the allocated IDs
> >>> (parent or nested). GET_INFO just works with an input FD and an ID.
> >>
> >>
> >> I'm not sure I get this, for nesting cases you can still make the
> >> child an fd.
> >>
> >> And a question still, under what case we need to create multiple
> >> ioasids on a single ioasid fd?
> >
> > One possible situation where multiple IOASIDs per FD could be used is
> > that devices with different underlying IOMMU capabilities are sharing a
> > single FD. In this case, only devices with consistent underlying IOMMU
> > capabilities could be put in an IOASID and multiple IOASIDs per FD could
> > be applied.
> >
> > Though, I still not sure about "multiple IOASID per-FD" vs "multiple
> > IOASID FDs" for such case.
> 
> 
> Right, that's exactly my question. The latter seems much more easier to
> be understood and implemented.
> 

A simple reason discussed in previous thread - there could be 1M's 
I/O address spaces per device while #FD's are precious resource.
So this RFC treats fd as a container of address spaces which is each
tagged by an IOASID.

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  5:42                 ` Tian, Kevin
@ 2021-06-01  6:07                   ` Jason Wang
  -1 siblings, 0 replies; 518+ messages in thread
From: Jason Wang @ 2021-06-01  6:07 UTC (permalink / raw)
  To: Tian, Kevin, Lu Baolu, Liu Yi L
  Cc: Alex Williamson (alex.williamson@redhat.com)"",
	kvm, Jonathan Corbet, LKML, iommu, Jason Gunthorpe,
	David Woodhouse


在 2021/6/1 下午1:42, Tian, Kevin 写道:
>> From: Jason Wang
>> Sent: Tuesday, June 1, 2021 1:30 PM
>>
>> 在 2021/6/1 下午1:23, Lu Baolu 写道:
>>> Hi Jason W,
>>>
>>> On 6/1/21 1:08 PM, Jason Wang wrote:
>>>>>> 2) If yes, what's the reason for not simply use the fd opened from
>>>>>> /dev/ioas. (This is the question that is not answered) and what
>>>>>> happens
>>>>>> if we call GET_INFO for the ioasid_fd?
>>>>>> 3) If not, how GET_INFO work?
>>>>> oh, missed this question in prior reply. Personally, no special reason
>>>>> yet. But using ID may give us opportunity to customize the management
>>>>> of the handle. For one, better lookup efficiency by using xarray to
>>>>> store the allocated IDs. For two, could categorize the allocated IDs
>>>>> (parent or nested). GET_INFO just works with an input FD and an ID.
>>>>
>>>> I'm not sure I get this, for nesting cases you can still make the
>>>> child an fd.
>>>>
>>>> And a question still, under what case we need to create multiple
>>>> ioasids on a single ioasid fd?
>>> One possible situation where multiple IOASIDs per FD could be used is
>>> that devices with different underlying IOMMU capabilities are sharing a
>>> single FD. In this case, only devices with consistent underlying IOMMU
>>> capabilities could be put in an IOASID and multiple IOASIDs per FD could
>>> be applied.
>>>
>>> Though, I still not sure about "multiple IOASID per-FD" vs "multiple
>>> IOASID FDs" for such case.
>>
>> Right, that's exactly my question. The latter seems much more easier to
>> be understood and implemented.
>>
> A simple reason discussed in previous thread - there could be 1M's
> I/O address spaces per device while #FD's are precious resource.


Is the concern for ulimit or performance? Note that we had

#define NR_OPEN_MAX ~0U

And with the fd semantic, you can do a lot of other stuffs: close on 
exec, passing via SCM_RIGHTS.

For the case of 1M, I would like to know what's the use case for a 
single process to handle 1M+ address spaces?


> So this RFC treats fd as a container of address spaces which is each
> tagged by an IOASID.


If the container and address space is 1:1 then the container seems useless.

Thanks


>
> Thanks
> Kevin


^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-01  6:07                   ` Jason Wang
  0 siblings, 0 replies; 518+ messages in thread
From: Jason Wang @ 2021-06-01  6:07 UTC (permalink / raw)
  To: Tian, Kevin, Lu Baolu, Liu Yi L
  Cc: kvm, Jonathan Corbet, iommu, LKML,
	Alex Williamson (alex.williamson@redhat.com)"",
	Jason Gunthorpe, David Woodhouse


在 2021/6/1 下午1:42, Tian, Kevin 写道:
>> From: Jason Wang
>> Sent: Tuesday, June 1, 2021 1:30 PM
>>
>> 在 2021/6/1 下午1:23, Lu Baolu 写道:
>>> Hi Jason W,
>>>
>>> On 6/1/21 1:08 PM, Jason Wang wrote:
>>>>>> 2) If yes, what's the reason for not simply use the fd opened from
>>>>>> /dev/ioas. (This is the question that is not answered) and what
>>>>>> happens
>>>>>> if we call GET_INFO for the ioasid_fd?
>>>>>> 3) If not, how GET_INFO work?
>>>>> oh, missed this question in prior reply. Personally, no special reason
>>>>> yet. But using ID may give us opportunity to customize the management
>>>>> of the handle. For one, better lookup efficiency by using xarray to
>>>>> store the allocated IDs. For two, could categorize the allocated IDs
>>>>> (parent or nested). GET_INFO just works with an input FD and an ID.
>>>>
>>>> I'm not sure I get this, for nesting cases you can still make the
>>>> child an fd.
>>>>
>>>> And a question still, under what case we need to create multiple
>>>> ioasids on a single ioasid fd?
>>> One possible situation where multiple IOASIDs per FD could be used is
>>> that devices with different underlying IOMMU capabilities are sharing a
>>> single FD. In this case, only devices with consistent underlying IOMMU
>>> capabilities could be put in an IOASID and multiple IOASIDs per FD could
>>> be applied.
>>>
>>> Though, I still not sure about "multiple IOASID per-FD" vs "multiple
>>> IOASID FDs" for such case.
>>
>> Right, that's exactly my question. The latter seems much more easier to
>> be understood and implemented.
>>
> A simple reason discussed in previous thread - there could be 1M's
> I/O address spaces per device while #FD's are precious resource.


Is the concern for ulimit or performance? Note that we had

#define NR_OPEN_MAX ~0U

And with the fd semantic, you can do a lot of other stuffs: close on 
exec, passing via SCM_RIGHTS.

For the case of 1M, I would like to know what's the use case for a 
single process to handle 1M+ address spaces?


> So this RFC treats fd as a container of address spaces which is each
> tagged by an IOASID.


If the container and address space is 1:1 then the container seems useless.

Thanks


>
> Thanks
> Kevin

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  6:07                   ` Jason Wang
@ 2021-06-01  6:16                     ` Tian, Kevin
  -1 siblings, 0 replies; 518+ messages in thread
From: Tian, Kevin @ 2021-06-01  6:16 UTC (permalink / raw)
  To: Jason Wang, Lu Baolu, Liu Yi L
  Cc: kvm, Jonathan Corbet, iommu, LKML,
	Alex Williamson (alex.williamson@redhat.com)"",
	Jason Gunthorpe, David Woodhouse

> From: Jason Wang
> Sent: Tuesday, June 1, 2021 2:07 PM
> 
> 在 2021/6/1 下午1:42, Tian, Kevin 写道:
> >> From: Jason Wang
> >> Sent: Tuesday, June 1, 2021 1:30 PM
> >>
> >> 在 2021/6/1 下午1:23, Lu Baolu 写道:
> >>> Hi Jason W,
> >>>
> >>> On 6/1/21 1:08 PM, Jason Wang wrote:
> >>>>>> 2) If yes, what's the reason for not simply use the fd opened from
> >>>>>> /dev/ioas. (This is the question that is not answered) and what
> >>>>>> happens
> >>>>>> if we call GET_INFO for the ioasid_fd?
> >>>>>> 3) If not, how GET_INFO work?
> >>>>> oh, missed this question in prior reply. Personally, no special reason
> >>>>> yet. But using ID may give us opportunity to customize the
> management
> >>>>> of the handle. For one, better lookup efficiency by using xarray to
> >>>>> store the allocated IDs. For two, could categorize the allocated IDs
> >>>>> (parent or nested). GET_INFO just works with an input FD and an ID.
> >>>>
> >>>> I'm not sure I get this, for nesting cases you can still make the
> >>>> child an fd.
> >>>>
> >>>> And a question still, under what case we need to create multiple
> >>>> ioasids on a single ioasid fd?
> >>> One possible situation where multiple IOASIDs per FD could be used is
> >>> that devices with different underlying IOMMU capabilities are sharing a
> >>> single FD. In this case, only devices with consistent underlying IOMMU
> >>> capabilities could be put in an IOASID and multiple IOASIDs per FD could
> >>> be applied.
> >>>
> >>> Though, I still not sure about "multiple IOASID per-FD" vs "multiple
> >>> IOASID FDs" for such case.
> >>
> >> Right, that's exactly my question. The latter seems much more easier to
> >> be understood and implemented.
> >>
> > A simple reason discussed in previous thread - there could be 1M's
> > I/O address spaces per device while #FD's are precious resource.
> 
> 
> Is the concern for ulimit or performance? Note that we had
> 
> #define NR_OPEN_MAX ~0U
> 
> And with the fd semantic, you can do a lot of other stuffs: close on
> exec, passing via SCM_RIGHTS.

yes, fd has its merits.

> 
> For the case of 1M, I would like to know what's the use case for a
> single process to handle 1M+ address spaces?

This single process is Qemu with an assigned device. Within the guest 
there could be many guest processes. Though in reality I didn't see
such 1M processes on a single device, better not restrict it in uAPI?

> 
> 
> > So this RFC treats fd as a container of address spaces which is each
> > tagged by an IOASID.
> 
> 
> If the container and address space is 1:1 then the container seems useless.
> 

yes, 1:1 then container is useless. But here it's assumed 1:M then 
even a single fd is sufficient for all intended usages. 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 518+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-01  6:16                     ` Tian, Kevin
  0 siblings, 0 replies; 518+ messages in thread
From: Tian, Kevin @ 2021-06-01  6:16 UTC (permalink / raw)
  To: Jason Wang, Lu Baolu, Liu Yi L
  Cc: kvm, Jonathan Corbet,
	Alex Williamson (alex.williamson@redhat.com)"",
	LKML, iommu, Jason Gunthorpe, David Woodhouse

> From: Jason Wang
> Sent: Tuesday, June 1, 2021 2:07 PM
> 
> 在 2021/6/1 下午1:42, Tian, Kevin 写道:
> >> From: Jason Wang
> >> Sent: Tuesday, June 1, 2021 1:30 PM
> >>
> >> 在 2021/6/1 下午1:23, Lu Baolu 写道:
> >>> Hi Jason W,
> >>>
> >>> On 6/1/21 1:08 PM, Jason Wang wrote:
> >>>>>> 2) If yes, what's the reason for not simply use the fd opened from
> >>>>>> /dev/ioas. (This is the question that is not answered) and what
> >>>>>> happens
> >>>>>> if we call GET_INFO for the ioasid_fd?
> >>>>>> 3) If not, how GET_INFO work?
> >>>>> oh, missed this question in prior reply. Personally, no special reason
> >>>>> yet. But using ID may give us opportunity to customize the
> management
> >>>>> of the handle. For one, better lookup efficiency by using xarray to
> >>>>> store the allocated IDs. For two, could categorize the allocated IDs
> >>>>> (parent or nested). GET_INFO just works with an input FD and an ID.
> >>>>
> >>>> I'm not sure I get this, for nesting cases you can still make the
> >>>> child an fd.
> >>>>
> >>>> And a question still, under what case we need to create multiple
> >>>> ioasids on a single ioasid fd?
> >>> One possible situation where multiple IOASIDs per FD could be used is
> >>> that devices with different underlying IOMMU capabilities are sharing a
> >>> single FD. In this case, only devices with consistent underlying IOMMU
> >>> capabilities could be put in an IOASID and multiple IOASIDs per FD could
> >>> be applied.
> >>>
> >>> Though, I still not sure about "multiple IOASID per-FD" vs "multiple
> >>> IOASID FDs" for such case.
> >>
> >> Right, that's exactly my question. The latter seems much more easier to
> >> be understood and implemented.
> >>
> > A simple reason discussed in previous thread - there could be 1M's
> > I/O address spaces per device while #FD's are precious resource.
> 
> 
> Is the concern for ulimit or performance? Note that we had
> 
> #define NR_OPEN_MAX ~0U
> 
> And with the fd semantic, you can do a lot of other stuffs: close on
> exec, passing via SCM_RIGHTS.

yes, fd has its merits.

> 
> For the case of 1M, I would like to know what's the use case for a
> single process to handle 1M+ address spaces?

This single process is Qemu with an assigned device. Within the guest 
there could be many guest processes. Though in reality I didn't see
such 1M processes on a single device, better not restrict it in uAPI?

> 
> 
> > So this RFC treats fd as a container of address spaces which is each
> > tagged by an IOASID.
> 
> 
> If the container and address space is 1:1 then the container seems useless.
> 

yes, 1:1 then container is useless. But here it's assumed 1:M then 
even a single fd is sufficient for all intended usages. 

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-05-28 20:03   ` Jason Gunthorpe
@ 2021-06-01  7:01     ` Tian, Kevin
  -1 siblings, 0 replies; 518+ messages in thread
From: Tian, Kevin @ 2021-06-01  7:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Saturday, May 29, 2021 4:03 AM
> 
> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > /dev/ioasid provides an unified interface for managing I/O page tables for
> > devices assigned to userspace. Device passthrough frameworks (VFIO,
> vDPA,
> > etc.) are expected to use this interface instead of creating their own logic to
> > isolate untrusted device DMAs initiated by userspace.
> 
> It is very long, but I think this has turned out quite well. It
> certainly matches the basic sketch I had in my head when we were
> talking about how to create vDPA devices a few years ago.
> 
> When you get down to the operations they all seem pretty common sense
> and straightfoward. Create an IOASID. Connect to a device. Fill the
> IOASID with pages somehow. Worry about PASID labeling.
> 
> It really is critical to get all the vendor IOMMU people to go over it
> and see how their HW features map into this.
> 

Agree. btw I feel it might be good to have several design opens 
centrally discussed after going through all the comments. Otherwise 
they may be buried in different sub-threads and potentially with 
insufficient care (especially for people who haven't completed the
reading).

I summarized five opens here, about:

1)  Finalizing the name to replace /dev/ioasid;
2)  Whether one device is allowed to bind to multiple IOASID fd's;
3)  Carry device information in invalidation/fault reporting uAPI;
4)  What should/could be specified when allocating an IOASID;
5)  The protocol between vfio group and kvm;

For 1), two alternative names are mentioned: /dev/iommu and 
/dev/ioas. I don't have a strong preference and would like to hear 
votes from all stakeholders. /dev/iommu is slightly better imho for 
two reasons. First, per AMD's presentation in last KVM forum they 
implement vIOMMU in hardware thus need to support user-managed 
domains. An iommu uAPI notation might make more sense moving 
forward. Second, it makes later uAPI naming easier as 'IOASID' can 
be always put as an object, e.g. IOMMU_ALLOC_IOASID instead of 
IOASID_ALLOC_IOASID. :)

Another naming open is about IOASID (the software handle for ioas) 
and the associated hardware ID (PASID or substream ID). Jason thought 
PASID is defined more from SVA angle while ARM's convention sounds 
clearer from device p.o.v. Following this direction then SID/SSID will be 
used to replace RID/PASID in this RFC (and possibly also implying that 
the kernel IOASID allocator should also be renamed to SSID allocator). 
I don't have better alternative. If no one objects, I'll change to this new
naming in next version. 

For 2), Jason prefers to not blocking it if no kernel design reason. If 
one device is allowed to bind multiple IOASID fd's, the main problem
is about cross-fd IOASID nesting, e.g. having gpa_ioasid created in fd1 
and giova_ioasid created in fd2 and then nesting them together (and
whether any cross-fd notification required when handling invalidation
etc.). We thought that this just adds some complexity while not sure 
about the value of supporting it (when one fd can already afford all 
discussed usages). Therefore this RFC proposes a device only bound 
to at most one IOASID fd. Does this rationale make sense?

To the other end there was also thought whether we should make
a single I/O address space per IOASID fd. This was discussed in previous
thread that #fd's are insufficient to afford theoretical 1M's address
spaces per device. But let's have another revisit and draw a clear
conclusion whether this option is viable.

For 3), Jason/Jean both think it's cleaner to carry device info in the 
uAPI. Actually this was one option we developed in earlier internal
versions of this RFC. Later on we changed it to the current way based
on misinterpretation of previous discussion. Thinking more we will
adopt this suggestion in next version, due to both efficiency (I/O page
fault is already a long path ) and security reason (some faults are 
unrecoverable thus the faulting device must be identified/isolated).

This implies that VFIO_BOUND_IOASID will be extended to allow user
specify a device label. This label will be recorded in /dev/iommu to
serve per-device invalidation request from and report per-device 
fault data to the user. In addition, vPASID (if provided by user) will
be also recorded in /dev/iommu so vPASID<->pPASID conversion 
is conducted properly. e.g. invalidation request from user carries
a vPASID which must be converted into pPASID before calling iommu
driver. Vice versa for raw fault data which carries pPASID while the
user expects a vPASID.

For 4), There are two options for specifying the IOASID attributes:

    In this RFC, an IOASID has no attribute before it's attached to any
    device. After device attach, user queries capability/format info
    about the IOMMU which the device belongs to, and then call
    different ioctl commands to set the attributes for an IOASID (e.g.
    map/unmap, bind/unbind user pgtable, nesting, etc.). This follows
    how the underlying iommu-layer API is designed: a domain reports
    capability/format info and serves iommu ops only after it's attached 
    to a device.

    Jason suggests having user to specify all attributes about how an
    IOASID is expected to work when creating this IOASID. This requires
    /dev/iommu to provide capability/format info once a device is bound
    to ioasid fd (before creating any IOASID). In concept this should work, 
    since given a device we can always find its IOMMU. The only gap is
    aforementioned: current iommu API is designed per domain instead 
    of per-device. 

Seems to close this design open we have to touch the kAPI design. and 
Joerg's input is highly appreciated here.

For 5), I'd expect Alex to chime in. Per my understanding looks the
original purpose of this protocol is not about I/O address space. It's
for KVM to know whether any device is assigned to this VM and then
do something special (e.g. posted interrupt, EPT cache attribute, etc.).
Because KVM deduces some policy based on the fact of assigned device, 
it needs to hold a reference to related vfio group. this part is irrelevant
to this RFC. 

But ARM's VMID usage is related to I/O address space thus needs some
consideration. Another strange thing is about PPC. Looks it also leverages
this protocol to do iommu group attach: kvm_spapr_tce_attach_iommu_
group. I don't know why it's done through KVM instead of VFIO uAPI in
the first place.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 518+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-01  7:01     ` Tian, Kevin
  0 siblings, 0 replies; 518+ messages in thread
From: Tian, Kevin @ 2021-06-01  7:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker,
	Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, David Gibson, Jiang, Dave,
	David Woodhouse, Jason Wang

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Saturday, May 29, 2021 4:03 AM
> 
> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > /dev/ioasid provides an unified interface for managing I/O page tables for
> > devices assigned to userspace. Device passthrough frameworks (VFIO,
> vDPA,
> > etc.) are expected to use this interface instead of creating their own logic to
> > isolate untrusted device DMAs initiated by userspace.
> 
> It is very long, but I think this has turned out quite well. It
> certainly matches the basic sketch I had in my head when we were
> talking about how to create vDPA devices a few years ago.
> 
> When you get down to the operations they all seem pretty common sense
> and straightfoward. Create an IOASID. Connect to a device. Fill the
> IOASID with pages somehow. Worry about PASID labeling.
> 
> It really is critical to get all the vendor IOMMU people to go over it
> and see how their HW features map into this.
> 

Agree. btw I feel it might be good to have several design opens 
centrally discussed after going through all the comments. Otherwise 
they may be buried in different sub-threads and potentially with 
insufficient care (especially for people who haven't completed the
reading).

I summarized five opens here, about:

1)  Finalizing the name to replace /dev/ioasid;
2)  Whether one device is allowed to bind to multiple IOASID fd's;
3)  Carry device information in invalidation/fault reporting uAPI;
4)  What should/could be specified when allocating an IOASID;
5)  The protocol between vfio group and kvm;

For 1), two alternative names are mentioned: /dev/iommu and 
/dev/ioas. I don't have a strong preference and would like to hear 
votes from all stakeholders. /dev/iommu is slightly better imho for 
two reasons. First, per AMD's presentation in last KVM forum they 
implement vIOMMU in hardware thus need to support user-managed 
domains. An iommu uAPI notation might make more sense moving 
forward. Second, it makes later uAPI naming easier as 'IOASID' can 
be always put as an object, e.g. IOMMU_ALLOC_IOASID instead of 
IOASID_ALLOC_IOASID. :)

Another naming open is about IOASID (the software handle for ioas) 
and the associated hardware ID (PASID or substream ID). Jason thought 
PASID is defined more from SVA angle while ARM's convention sounds 
clearer from device p.o.v. Following this direction then SID/SSID will be 
used to replace RID/PASID in this RFC (and possibly also implying that 
the kernel IOASID allocator should also be renamed to SSID allocator). 
I don't have better alternative. If no one objects, I'll change to this new
naming in next version. 

For 2), Jason prefers to not blocking it if no kernel design reason. If 
one device is allowed to bind multiple IOASID fd's, the main problem
is about cross-fd IOASID nesting, e.g. having gpa_ioasid created in fd1 
and giova_ioasid created in fd2 and then nesting them together (and
whether any cross-fd notification required when handling invalidation
etc.). We thought that this just adds some complexity while not sure 
about the value of supporting it (when one fd can already afford all 
discussed usages). Therefore this RFC proposes a device only bound 
to at most one IOASID fd. Does this rationale make sense?

To the other end there was also thought whether we should make
a single I/O address space per IOASID fd. This was discussed in previous
thread that #fd's are insufficient to afford theoretical 1M's address
spaces per device. But let's have another revisit and draw a clear
conclusion whether this option is viable.

For 3), Jason/Jean both think it's cleaner to carry device info in the 
uAPI. Actually this was one option we developed in earlier internal
versions of this RFC. Later on we changed it to the current way based
on misinterpretation of previous discussion. Thinking more we will
adopt this suggestion in next version, due to both efficiency (I/O page
fault is already a long path ) and security reason (some faults are 
unrecoverable thus the faulting device must be identified/isolated).

This implies that VFIO_BOUND_IOASID will be extended to allow user
specify a device label. This label will be recorded in /dev/iommu to
serve per-device invalidation request from and report per-device 
fault data to the user. In addition, vPASID (if provided by user) will
be also recorded in /dev/iommu so vPASID<->pPASID conversion 
is conducted properly. e.g. invalidation request from user carries
a vPASID which must be converted into pPASID before calling iommu
driver. Vice versa for raw fault data which carries pPASID while the
user expects a vPASID.

For 4), There are two options for specifying the IOASID attributes:

    In this RFC, an IOASID has no attribute before it's attached to any
    device. After device attach, user queries capability/format info
    about the IOMMU which the device belongs to, and then call
    different ioctl commands to set the attributes for an IOASID (e.g.
    map/unmap, bind/unbind user pgtable, nesting, etc.). This follows
    how the underlying iommu-layer API is designed: a domain reports
    capability/format info and serves iommu ops only after it's attached 
    to a device.

    Jason suggests having user to specify all attributes about how an
    IOASID is expected to work when creating this IOASID. This requires
    /dev/iommu to provide capability/format info once a device is bound
    to ioasid fd (before creating any IOASID). In concept this should work, 
    since given a device we can always find its IOMMU. The only gap is
    aforementioned: current iommu API is designed per domain instead 
    of per-device. 

Seems to close this design open we have to touch the kAPI design. and 
Joerg's input is highly appreciated here.

For 5), I'd expect Alex to chime in. Per my understanding looks the
original purpose of this protocol is not about I/O address space. It's
for KVM to know whether any device is assigned to this VM and then
do something special (e.g. posted interrupt, EPT cache attribute, etc.).
Because KVM deduces some policy based on the fact of assigned device, 
it needs to hold a reference to related vfio group. this part is irrelevant
to this RFC. 

But ARM's VMID usage is related to I/O address space thus needs some
consideration. Another strange thing is about PPC. Looks it also leverages
this protocol to do iommu group attach: kvm_spapr_tce_attach_iommu_
group. I don't know why it's done through KVM instead of VFIO uAPI in
the first place.

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  5:10     ` Lu Baolu
@ 2021-06-01  7:15       ` Shenming Lu
  -1 siblings, 0 replies; 518+ messages in thread
From: Shenming Lu @ 2021-06-01  7:15 UTC (permalink / raw)
  To: Lu Baolu, Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang
  Cc: Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu, Hao,
	Jiang, Dave, Jacob Pan, Jean-Philippe Brucker, David Gibson,
	Kirti Wankhede, Robin Murphy, Zenghui Yu, wanghaibin.wang

On 2021/6/1 13:10, Lu Baolu wrote:
> Hi Shenming,
> 
> On 6/1/21 12:31 PM, Shenming Lu wrote:
>> On 2021/5/27 15:58, Tian, Kevin wrote:
>>> /dev/ioasid provides an unified interface for managing I/O page tables for
>>> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
>>> etc.) are expected to use this interface instead of creating their own logic to
>>> isolate untrusted device DMAs initiated by userspace.
>>>
>>> This proposal describes the uAPI of /dev/ioasid and also sample sequences
>>> with VFIO as example in typical usages. The driver-facing kernel API provided
>>> by the iommu layer is still TBD, which can be discussed after consensus is
>>> made on this uAPI.
>>>
>>> It's based on a lengthy discussion starting from here:
>>>     https://lore.kernel.org/linux-iommu/20210330132830.GO2356281@nvidia.com/
>>>
>>> It ends up to be a long writing due to many things to be summarized and
>>> non-trivial effort required to connect them into a complete proposal.
>>> Hope it provides a clean base to converge.
>>>
>>
>> [..]
>>
>>>
>>> /*
>>>    * Page fault report and response
>>>    *
>>>    * This is TBD. Can be added after other parts are cleared up. Likely it
>>>    * will be a ring buffer shared between user/kernel, an eventfd to notify
>>>    * the user and an ioctl to complete the fault.
>>>    *
>>>    * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
>>>    */
>>
>> Hi,
>>
>> It seems that the ioasid has different usage in different situation, it could
>> be directly used in the physical routing, or just a virtual handle that indicates
>> a page table or a vPASID table (such as the GPA address space, in the simple
>> passthrough case, the DMA input to IOMMU will just contain a Stream ID, no
>> Substream ID), right?
>>
>> And Baolu suggested that since one device might consume multiple page tables,
>> it's more reasonable to have one fault handler per page table. By this, do we
>> have to maintain such an ioasid info list in the IOMMU layer?
> 
> As discussed earlier, the I/O page fault and cache invalidation paths
> will have "device labels" so that the information could be easily
> translated and routed.
> 
> So it's likely the per-device fault handler registering API in iommu
> core can be kept, but /dev/ioasid will be grown with a layer to
> translate and propagate I/O page fault information to the right
> consumers.

Yeah, having a general preprocessing of the faults in IOASID seems to be
a doable direction. But since there may be more than one consumer at the
same time, who is responsible for registering the per-device fault handler?

Thanks,
Shenming

> 
> If things evolve in this way, probably the SVA I/O page fault also needs
> to be ported to /dev/ioasid.
> 
>>
>> Then if we add host IOPF support (for the GPA address space) in the future
>> (I have sent a series for this but it aimed for VFIO, I will convert it for
>> IOASID later [1] :-)), how could we find the handler for the received fault
>> event which only contains a Stream ID... Do we also have to maintain a
>> dev(vPASID)->ioasid mapping in the IOMMU layer?
>>
>> [1] https://lore.kernel.org/patchwork/cover/1410223/
> 
> Best regards,
> baolu
> .

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-01  7:15       ` Shenming Lu
  0 siblings, 0 replies; 518+ messages in thread
From: Shenming Lu @ 2021-06-01  7:15 UTC (permalink / raw)
  To: Lu Baolu, Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang
  Cc: Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok, Jonathan Corbet,
	Kirti Wankhede, wanghaibin.wang, David Gibson, Robin Murphy

On 2021/6/1 13:10, Lu Baolu wrote:
> Hi Shenming,
> 
> On 6/1/21 12:31 PM, Shenming Lu wrote:
>> On 2021/5/27 15:58, Tian, Kevin wrote:
>>> /dev/ioasid provides an unified interface for managing I/O page tables for
>>> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
>>> etc.) are expected to use this interface instead of creating their own logic to
>>> isolate untrusted device DMAs initiated by userspace.
>>>
>>> This proposal describes the uAPI of /dev/ioasid and also sample sequences
>>> with VFIO as example in typical usages. The driver-facing kernel API provided
>>> by the iommu layer is still TBD, which can be discussed after consensus is
>>> made on this uAPI.
>>>
>>> It's based on a lengthy discussion starting from here:
>>>     https://lore.kernel.org/linux-iommu/20210330132830.GO2356281@nvidia.com/
>>>
>>> It ends up to be a long writing due to many things to be summarized and
>>> non-trivial effort required to connect them into a complete proposal.
>>> Hope it provides a clean base to converge.
>>>
>>
>> [..]
>>
>>>
>>> /*
>>>    * Page fault report and response
>>>    *
>>>    * This is TBD. Can be added after other parts are cleared up. Likely it
>>>    * will be a ring buffer shared between user/kernel, an eventfd to notify
>>>    * the user and an ioctl to complete the fault.
>>>    *
>>>    * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
>>>    */
>>
>> Hi,
>>
>> It seems that the ioasid has different usage in different situation, it could
>> be directly used in the physical routing, or just a virtual handle that indicates
>> a page table or a vPASID table (such as the GPA address space, in the simple
>> passthrough case, the DMA input to IOMMU will just contain a Stream ID, no
>> Substream ID), right?
>>
>> And Baolu suggested that since one device might consume multiple page tables,
>> it's more reasonable to have one fault handler per page table. By this, do we
>> have to maintain such an ioasid info list in the IOMMU layer?
> 
> As discussed earlier, the I/O page fault and cache invalidation paths
> will have "device labels" so that the information could be easily
> translated and routed.
> 
> So it's likely the per-device fault handler registering API in iommu
> core can be kept, but /dev/ioasid will be grown with a layer to
> translate and propagate I/O page fault information to the right
> consumers.

Yeah, having a general preprocessing of the faults in IOASID seems to be
a doable direction. But since there may be more than one consumer at the
same time, who is responsible for registering the per-device fault handler?

Thanks,
Shenming

> 
> If things evolve in this way, probably the SVA I/O page fault also needs
> to be ported to /dev/ioasid.
> 
>>
>> Then if we add host IOPF support (for the GPA address space) in the future
>> (I have sent a series for this but it aimed for VFIO, I will convert it for
>> IOASID later [1] :-)), how could we find the handler for the received fault
>> event which only contains a Stream ID... Do we also have to maintain a
>> dev(vPASID)->ioasid mapping in the IOMMU layer?
>>
>> [1] https://lore.kernel.org/patchwork/cover/1410223/
> 
> Best regards,
> baolu
> .
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-05-28 16:23   ` Jean-Philippe Brucker
@ 2021-06-01  7:50     ` Tian, Kevin
  -1 siblings, 0 replies; 518+ messages in thread
From: Tian, Kevin @ 2021-06-01  7:50 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: LKML, Joerg Roedel, Jason Gunthorpe, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, David Gibson, Kirti Wankhede,
	Robin Murphy

> From: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Sent: Saturday, May 29, 2021 12:23 AM
> >
> > IOASID nesting can be implemented in two ways: hardware nesting and
> > software nesting. With hardware support the child and parent I/O page
> > tables are walked consecutively by the IOMMU to form a nested translation.
> > When it's implemented in software, the ioasid driver is responsible for
> > merging the two-level mappings into a single-level shadow I/O page table.
> > Software nesting requires both child/parent page tables operated through
> > the dma mapping protocol, so any change in either level can be captured
> > by the kernel to update the corresponding shadow mapping.
> 
> Is there an advantage to moving software nesting into the kernel?
> We could just have the guest do its usual combined map/unmap on the child
> fd
> 

There are at least two intended usages:

1) From previous discussion looks PPC's window-based scheme can be
better supported with software nesting (a shared IOVA address space
as the parent (shared by all devices) which is nested by multiple windows
as the children (per-device);

2) Some mdev drivers (e.g. kvmgt) may want to do write-protection on 
guest data structures (base address programmed to mediated MMIO
register). The base address is IOVA while  KVM page-tracking API is 
based on GPA. nesting allows finding GPA according to IOVA.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 518+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-01  7:50     ` Tian, Kevin
  0 siblings, 0 replies; 518+ messages in thread
From: Tian, Kevin @ 2021-06-01  7:50 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, David Gibson, Jason Gunthorpe, Jiang,
	Dave, David Woodhouse, Jason Wang

> From: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Sent: Saturday, May 29, 2021 12:23 AM
> >
> > IOASID nesting can be implemented in two ways: hardware nesting and
> > software nesting. With hardware support the child and parent I/O page
> > tables are walked consecutively by the IOMMU to form a nested translation.
> > When it's implemented in software, the ioasid driver is responsible for
> > merging the two-level mappings into a single-level shadow I/O page table.
> > Software nesting requires both child/parent page tables operated through
> > the dma mapping protocol, so any change in either level can be captured
> > by the kernel to update the corresponding shadow mapping.
> 
> Is there an advantage to moving software nesting into the kernel?
> We could just have the guest do its usual combined map/unmap on the child
> fd
> 

There are at least two intended usages:

1) From previous discussion looks PPC's window-based scheme can be
better supported with software nesting (a shared IOVA address space
as the parent (shared by all devices) which is nested by multiple windows
as the children (per-device);

2) Some mdev drivers (e.g. kvmgt) may want to do write-protection on 
guest data structures (base address programmed to mediated MMIO
register). The base address is IOVA while  KVM page-tracking API is 
based on GPA. nesting allows finding GPA according to IOVA.

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-05-28 17:35   ` Jason Gunthorpe
@ 2021-06-01  8:10     ` Tian, Kevin
  -1 siblings, 0 replies; 518+ messages in thread
From: Tian, Kevin @ 2021-06-01  8:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Saturday, May 29, 2021 1:36 AM
> 
> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> 
> > IOASID nesting can be implemented in two ways: hardware nesting and
> > software nesting. With hardware support the child and parent I/O page
> > tables are walked consecutively by the IOMMU to form a nested translation.
> > When it's implemented in software, the ioasid driver is responsible for
> > merging the two-level mappings into a single-level shadow I/O page table.
> > Software nesting requires both child/parent page tables operated through
> > the dma mapping protocol, so any change in either level can be captured
> > by the kernel to update the corresponding shadow mapping.
> 
> Why? A SW emulation could do this synchronization during invalidation
> processing if invalidation contained an IOVA range.

In this proposal we differentiate between host-managed and user-
managed I/O page tables. If host-managed, the user is expected to use
map/unmap cmd explicitly upon any change required on the page table. 
If user-managed, the user first binds its page table to the IOMMU and 
then use invalidation cmd to flush iotlb when necessary (e.g. typically
not required when changing a PTE from non-present to present).

We expect user to use map+unmap and bind+invalidate respectively
instead of mixing them together. Following this policy, map+unmap
must be used in both levels for software nesting, so changes in either 
level are captured timely to synchronize the shadow mapping.

> 
> I think this document would be stronger to include some "Rational"
> statements in key places
> 

Sure. I tried to provide rationale as much as possible but sometimes 
it's lost in a complex context like this. :)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 518+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-01  8:10     ` Tian, Kevin
  0 siblings, 0 replies; 518+ messages in thread
From: Tian, Kevin @ 2021-06-01  8:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker,
	Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, David Gibson, Jiang, Dave,
	David Woodhouse, Jason Wang

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Saturday, May 29, 2021 1:36 AM
> 
> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> 
> > IOASID nesting can be implemented in two ways: hardware nesting and
> > software nesting. With hardware support the child and parent I/O page
> > tables are walked consecutively by the IOMMU to form a nested translation.
> > When it's implemented in software, the ioasid driver is responsible for
> > merging the two-level mappings into a single-level shadow I/O page table.
> > Software nesting requires both child/parent page tables operated through
> > the dma mapping protocol, so any change in either level can be captured
> > by the kernel to update the corresponding shadow mapping.
> 
> Why? A SW emulation could do this synchronization during invalidation
> processing if invalidation contained an IOVA range.

In this proposal we differentiate between host-managed and user-
managed I/O page tables. If host-managed, the user is expected to use
map/unmap cmd explicitly upon any change required on the page table. 
If user-managed, the user first binds its page table to the IOMMU and 
then use invalidation cmd to flush iotlb when necessary (e.g. typically
not required when changing a PTE from non-present to present).

We expect user to use map+unmap and bind+invalidate respectively
instead of mixing them together. Following this policy, map+unmap
must be used in both levels for software nesting, so changes in either 
level are captured timely to synchronize the shadow mapping.

> 
> I think this document would be stronger to include some "Rational"
> statements in key places
> 

Sure. I tried to provide rationale as much as possible but sometimes 
it's lost in a complex context like this. :)

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-05-28 19:58   ` Jason Gunthorpe
@ 2021-06-01  8:38     ` Tian, Kevin
  -1 siblings, 0 replies; 518+ messages in thread
From: Tian, Kevin @ 2021-06-01  8:38 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Saturday, May 29, 2021 3:59 AM
> 
> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> >
> > 5. Use Cases and Flows
> >
> > Here assume VFIO will support a new model where every bound device
> > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> > going through legacy container/group interface. For illustration purpose
> > those devices are just called dev[1...N]:
> >
> > 	device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> >
> > As explained earlier, one IOASID fd is sufficient for all intended use cases:
> >
> > 	ioasid_fd = open("/dev/ioasid", mode);
> >
> > For simplicity below examples are all made for the virtualization story.
> > They are representative and could be easily adapted to a non-virtualization
> > scenario.
> 
> For others, I don't think this is *strictly* necessary, we can
> probably still get to the device_fd using the group_fd and fit in
> /dev/ioasid. It does make the rest of this more readable though.

Jason, want to confirm here. Per earlier discussion we remain an
impression that you want VFIO to be a pure device driver thus
container/group are used only for legacy application. From this
comment are you suggesting that VFIO can still keep container/
group concepts and user just deprecates the use of vfio iommu
uAPI (e.g. VFIO_SET_IOMMU) by using /dev/ioasid (which has
a simple policy that an IOASID will reject cmd if partially-attached 
group exists)?

> 
> 
> > Three types of IOASIDs are considered:
> >
> > 	gpa_ioasid[1...N]: 	for GPA address space
> > 	giova_ioasid[1...N]:	for guest IOVA address space
> > 	gva_ioasid[1...N]:	for guest CPU VA address space
> >
> > At least one gpa_ioasid must always be created per guest, while the other
> > two are relevant as far as vIOMMU is concerned.
> >
> > Examples here apply to both pdev and mdev, if not explicitly marked out
> > (e.g. in section 5.5). VFIO device driver in the kernel will figure out the
> > associated routing information in the attaching operation.
> >
> > For illustration simplicity, IOASID_CHECK_EXTENSION and IOASID_GET_
> > INFO are skipped in these examples.
> >
> > 5.1. A simple example
> > ++++++++++++++++++
> >
> > Dev1 is assigned to the guest. One gpa_ioasid is created. The GPA address
> > space is managed through DMA mapping protocol:
> >
> > 	/* Bind device to IOASID fd */
> > 	device_fd = open("/dev/vfio/devices/dev1", mode);
> > 	ioasid_fd = open("/dev/ioasid", mode);
> > 	ioctl(device_fd, VFIO_BIND_IOASID_FD, ioasid_fd);
> >
> > 	/* Attach device to IOASID */
> > 	gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> > 	at_data = { .ioasid = gpa_ioasid};
> > 	ioctl(device_fd, VFIO_ATTACH_IOASID, &at_data);
> >
> > 	/* Setup GPA mapping */
> > 	dma_map = {
> > 		.ioasid	= gpa_ioasid;
> > 		.iova	= 0;		// GPA
> > 		.vaddr	= 0x40000000;	// HVA
> > 		.size	= 1GB;
> > 	};
> > 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> >
> > If the guest is assigned with more than dev1, user follows above sequence
> > to attach other devices to the same gpa_ioasid i.e. sharing the GPA
> > address space cross all assigned devices.
> 
> eg
> 
>  	device2_fd = open("/dev/vfio/devices/dev1", mode);
>  	ioctl(device2_fd, VFIO_BIND_IOASID_FD, ioasid_fd);
>  	ioctl(device2_fd, VFIO_ATTACH_IOASID, &at_data);
> 
> Right?

Exactly, except a small typo ('dev1' -> 'dev2'). :)

> 
> >
> > 5.2. Multiple IOASIDs (no nesting)
> > ++++++++++++++++++++++++++++
> >
> > Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
> > both devices are attached to gpa_ioasid. After boot the guest creates
> > an GIOVA address space (giova_ioasid) for dev2, leaving dev1 in pass
> > through mode (gpa_ioasid).
> >
> > Suppose IOASID nesting is not supported in this case. Qemu need to
> > generate shadow mappings in userspace for giova_ioasid (like how
> > VFIO works today).
> >
> > To avoid duplicated locked page accounting, it's recommended to pre-
> > register the virtual address range that will be used for DMA:
> >
> > 	device_fd1 = open("/dev/vfio/devices/dev1", mode);
> > 	device_fd2 = open("/dev/vfio/devices/dev2", mode);
> > 	ioasid_fd = open("/dev/ioasid", mode);
> > 	ioctl(device_fd1, VFIO_BIND_IOASID_FD, ioasid_fd);
> > 	ioctl(device_fd2, VFIO_BIND_IOASID_FD, ioasid_fd);
> >
> > 	/* pre-register the virtual address range for accounting */
> > 	mem_info = { .vaddr = 0x40000000; .size = 1GB };
> > 	ioctl(ioasid_fd, IOASID_REGISTER_MEMORY, &mem_info);
> >
> > 	/* Attach dev1 and dev2 to gpa_ioasid */
> > 	gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> > 	at_data = { .ioasid = gpa_ioasid};
> > 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> > 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> >
> > 	/* Setup GPA mapping */
> > 	dma_map = {
> > 		.ioasid	= gpa_ioasid;
> > 		.iova	= 0; 		// GPA
> > 		.vaddr	= 0x40000000;	// HVA
> > 		.size	= 1GB;
> > 	};
> > 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> >
> > 	/* After boot, guest enables an GIOVA space for dev2 */
> > 	giova_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> >
> > 	/* First detach dev2 from previous address space */
> > 	at_data = { .ioasid = gpa_ioasid};
> > 	ioctl(device_fd2, VFIO_DETACH_IOASID, &at_data);
> >
> > 	/* Then attach dev2 to the new address space */
> > 	at_data = { .ioasid = giova_ioasid};
> > 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> >
> > 	/* Setup a shadow DMA mapping according to vIOMMU
> > 	  * GIOVA (0x2000) -> GPA (0x1000) -> HVA (0x40001000)
> > 	  */
> 
> Here "shadow DMA" means relay the guest's vIOMMU page tables to the HW
> IOMMU?

'shadow' means the merged mapping: GIOVA(0x2000) -> HVA (0x40001000)

> 
> > 	dma_map = {
> > 		.ioasid	= giova_ioasid;
> > 		.iova	= 0x2000; 	// GIOVA
> > 		.vaddr	= 0x40001000;	// HVA
> 
> eg HVA came from reading the guest's page tables and finding it wanted
> GPA 0x1000 mapped to IOVA 0x2000?

yes

> 
> 
> > 5.3. IOASID nesting (software)
> > +++++++++++++++++++++++++
> >
> > Same usage scenario as 5.2, with software-based IOASID nesting
> > available. In this mode it is the kernel instead of user to create the
> > shadow mapping.
> >
> > The flow before guest boots is same as 5.2, except one point. Because
> > giova_ioasid is nested on gpa_ioasid, locked accounting is only
> > conducted for gpa_ioasid. So it's not necessary to pre-register virtual
> > memory.
> >
> > To save space we only list the steps after boots (i.e. both dev1/dev2
> > have been attached to gpa_ioasid before guest boots):
> >
> > 	/* After boots */
> > 	/* Make GIOVA space nested on GPA space */
> > 	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > 				gpa_ioasid);
> >
> > 	/* Attach dev2 to the new address space (child)
> > 	  * Note dev2 is still attached to gpa_ioasid (parent)
> > 	  */
> > 	at_data = { .ioasid = giova_ioasid};
> > 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> >
> > 	/* Setup a GIOVA->GPA mapping for giova_ioasid, which will be
> > 	  * merged by the kernel with GPA->HVA mapping of gpa_ioasid
> > 	  * to form a shadow mapping.
> > 	  */
> > 	dma_map = {
> > 		.ioasid	= giova_ioasid;
> > 		.iova	= 0x2000;	// GIOVA
> > 		.vaddr	= 0x1000;	// GPA
> > 		.size	= 4KB;
> > 	};
> > 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> 
> And in this version the kernel reaches into the parent IOASID's page
> tables to translate 0x1000 to 0x40001000 to physical page? So we
> basically remove the qemu process address space entirely from this
> translation. It does seem convenient

yes.

> 
> > 5.4. IOASID nesting (hardware)
> > +++++++++++++++++++++++++
> >
> > Same usage scenario as 5.2, with hardware-based IOASID nesting
> > available. In this mode the pgtable binding protocol is used to
> > bind the guest IOVA page table with the IOMMU:
> >
> > 	/* After boots */
> > 	/* Make GIOVA space nested on GPA space */
> > 	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > 				gpa_ioasid);
> >
> > 	/* Attach dev2 to the new address space (child)
> > 	  * Note dev2 is still attached to gpa_ioasid (parent)
> > 	  */
> > 	at_data = { .ioasid = giova_ioasid};
> > 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> >
> > 	/* Bind guest I/O page table  */
> > 	bind_data = {
> > 		.ioasid	= giova_ioasid;
> > 		.addr	= giova_pgtable;
> > 		// and format information
> > 	};
> > 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> 
> I really think you need to use consistent language. Things that
> allocate a new IOASID should be calle IOASID_ALLOC_IOASID. If multiple
> IOCTLs are needed then it is IOASID_ALLOC_IOASID_PGTABLE, etc.
> alloc/create/bind is too confusing.
> 
> > 5.5. Guest SVA (vSVA)
> > ++++++++++++++++++
> >
> > After boots the guest further create a GVA address spaces (gpasid1) on
> > dev1. Dev2 is not affected (still attached to giova_ioasid).
> >
> > As explained in section 4, user should avoid expose ENQCMD on both
> > pdev and mdev.
> >
> > The sequence applies to all device types (being pdev or mdev), except
> > one additional step to call KVM for ENQCMD-capable mdev:
> >
> > 	/* After boots */
> > 	/* Make GVA space nested on GPA space */
> > 	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > 				gpa_ioasid);
> >
> > 	/* Attach dev1 to the new address space and specify vPASID */
> > 	at_data = {
> > 		.ioasid		= gva_ioasid;
> > 		.flag 		= IOASID_ATTACH_USER_PASID;
> > 		.user_pasid	= gpasid1;
> > 	};
> > 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> 
> Still a little unsure why the vPASID is here not on the gva_ioasid. Is
> there any scenario where we want different vpasid's for the same
> IOASID? I guess it is OK like this. Hum.

Yes, it's completely sane that the guest links a I/O page table to 
different vpasids on dev1 and dev2. The IOMMU doesn't mandate
that when multiple devices share an I/O page table they must use
the same PASID#. 

> 
> > 	/* if dev1 is ENQCMD-capable mdev, update CPU PASID
> > 	  * translation structure through KVM
> > 	  */
> > 	pa_data = {
> > 		.ioasid_fd	= ioasid_fd;
> > 		.ioasid		= gva_ioasid;
> > 		.guest_pasid	= gpasid1;
> > 	};
> > 	ioctl(kvm_fd, KVM_MAP_PASID, &pa_data);
> 
> Make sense
> 
> > 	/* Bind guest I/O page table  */
> > 	bind_data = {
> > 		.ioasid	= gva_ioasid;
> > 		.addr	= gva_pgtable1;
> > 		// and format information
> > 	};
> > 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> 
> Again I do wonder if this should just be part of alloc_ioasid. Is
> there any reason to split these things? The only advantage to the
> split is the device is known, but the device shouldn't impact
> anything..

I summarized this as open#4 in another mail for focused discussion.

> 
> > 5.6. I/O page fault
> > +++++++++++++++
> >
> > (uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
> > to guest IOMMU driver and backwards).
> >
> > -   Host IOMMU driver receives a page request with raw fault_data {rid,
> >     pasid, addr};
> >
> > -   Host IOMMU driver identifies the faulting I/O page table according to
> >     information registered by IOASID fault handler;
> >
> > -   IOASID fault handler is called with raw fault_data (rid, pasid, addr),
> which
> >     is saved in ioasid_data->fault_data (used for response);
> >
> > -   IOASID fault handler generates an user fault_data (ioasid, addr), links it
> >     to the shared ring buffer and triggers eventfd to userspace;
> 
> Here rid should be translated to a labeled device and return the
> device label from VFIO_BIND_IOASID_FD. Depending on how the device
> bound the label might match to a rid or to a rid,pasid

Yes, I acknowledged this input from you and Jean about page fault and 
bind_pasid_table. I summarized it as open#3 in another mail.

thus following is skipped...

Thanks
Kevin

> 
> > -   Upon received event, Qemu needs to find the virtual routing information
> >     (v_rid + v_pasid) of the device attached to the faulting ioasid. If there are
> >     multiple, pick a random one. This should be fine since the purpose is to
> >     fix the I/O page table on the guest;
> 
> The device label should fix this
> 
> > -   Qemu finds the pending fault event, converts virtual completion data
> >     into (ioasid, response_code), and then calls a /dev/ioasid ioctl to
> >     complete the pending fault;
> >
> > -   /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved in
> >     ioasid_data->fault_data, and then calls iommu api to complete it with
> >     {rid, pasid, response_code};
> 
> So resuming a fault on an ioasid will resume all devices pending on
> the fault?
> 
> > 5.7. BIND_PASID_TABLE
> > ++++++++++++++++++++
> >
> > PASID table is put in the GPA space on some platform, thus must be
> updated
> > by the guest. It is treated as another user page table to be bound with the
> > IOMMU.
> >
> > As explained earlier, the user still needs to explicitly bind every user I/O
> > page table to the kernel so the same pgtable binding protocol (bind, cache
> > invalidate and fault handling) is unified cross platforms.
> >
> > vIOMMUs may include a caching mode (or paravirtualized way) which,
> once
> > enabled, requires the guest to invalidate PASID cache for any change on the
> > PASID table. This allows Qemu to track the lifespan of guest I/O page tables.
> >
> > In case of missing such capability, Qemu could enable write-protection on
> > the guest PASID table to achieve the same effect.
> >
> > 	/* After boots */
> > 	/* Make vPASID space nested on GPA space */
> > 	pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > 				gpa_ioasid);
> >
> > 	/* Attach dev1 to pasidtbl_ioasid */
> > 	at_data = { .ioasid = pasidtbl_ioasid};
> > 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> >
> > 	/* Bind PASID table */
> > 	bind_data = {
> > 		.ioasid	= pasidtbl_ioasid;
> > 		.addr	= gpa_pasid_table;
> > 		// and format information
> > 	};
> > 	ioctl(ioasid_fd, IOASID_BIND_PASID_TABLE, &bind_data);
> >
> > 	/* vIOMMU detects a new GVA I/O space created */
> > 	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > 				gpa_ioasid);
> >
> > 	/* Attach dev1 to the new address space, with gpasid1 */
> > 	at_data = {
> > 		.ioasid		= gva_ioasid;
> > 		.flag 		= IOASID_ATTACH_USER_PASID;
> > 		.user_pasid	= gpasid1;
> > 	};
> > 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> >
> > 	/* Bind guest I/O page table. Because SET_PASID_TABLE has been
> > 	  * used, the kernel will not update the PASID table. Instead, just
> > 	  * track the bound I/O page table for handling invalidation and
> > 	  * I/O page faults.
> > 	  */
> > 	bind_data = {
> > 		.ioasid	= gva_ioasid;
> > 		.addr	= gva_pgtable1;
> > 		// and format information
> > 	};
> > 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> 
> I still don't quite get the benifit from doing this.
> 
> The idea to create an all PASID IOASID seems to work better with less
> fuss on HW that is directly parsing the guest's PASID table.
> 
> Cache invalidate seems easy enough to support
> 
> Fault handling needs to return the (ioasid, device_label, pasid) when
> working with this kind of ioasid.
> 
> It is true that it does create an additional flow qemu has to
> implement, but it does directly mirror the HW.
> 
> Jason

^ permalink raw reply	[flat|nested] 518+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-01  8:38     ` Tian, Kevin
  0 siblings, 0 replies; 518+ messages in thread
From: Tian, Kevin @ 2021-06-01  8:38 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker,
	Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, David Gibson, Jiang, Dave,
	David Woodhouse, Jason Wang

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Saturday, May 29, 2021 3:59 AM
> 
> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> >
> > 5. Use Cases and Flows
> >
> > Here assume VFIO will support a new model where every bound device
> > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> > going through legacy container/group interface. For illustration purpose
> > those devices are just called dev[1...N]:
> >
> > 	device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> >
> > As explained earlier, one IOASID fd is sufficient for all intended use cases:
> >
> > 	ioasid_fd = open("/dev/ioasid", mode);
> >
> > For simplicity below examples are all made for the virtualization story.
> > They are representative and could be easily adapted to a non-virtualization
> > scenario.
> 
> For others, I don't think this is *strictly* necessary, we can
> probably still get to the device_fd using the group_fd and fit in
> /dev/ioasid. It does make the rest of this more readable though.

Jason, want to confirm here. Per earlier discussion we remain an
impression that you want VFIO to be a pure device driver thus
container/group are used only for legacy application. From this
comment are you suggesting that VFIO can still keep container/
group concepts and user just deprecates the use of vfio iommu
uAPI (e.g. VFIO_SET_IOMMU) by using /dev/ioasid (which has
a simple policy that an IOASID will reject cmd if partially-attached 
group exists)?

> 
> 
> > Three types of IOASIDs are considered:
> >
> > 	gpa_ioasid[1...N]: 	for GPA address space
> > 	giova_ioasid[1...N]:	for guest IOVA address space
> > 	gva_ioasid[1...N]:	for guest CPU VA address space
> >
> > At least one gpa_ioasid must always be created per guest, while the other
> > two are relevant as far as vIOMMU is concerned.
> >
> > Examples here apply to both pdev and mdev, if not explicitly marked out
> > (e.g. in section 5.5). VFIO device driver in the kernel will figure out the
> > associated routing information in the attaching operation.
> >
> > For illustration simplicity, IOASID_CHECK_EXTENSION and IOASID_GET_
> > INFO are skipped in these examples.
> >
> > 5.1. A simple example
> > ++++++++++++++++++
> >
> > Dev1 is assigned to the guest. One gpa_ioasid is created. The GPA address
> > space is managed through DMA mapping protocol:
> >
> > 	/* Bind device to IOASID fd */
> > 	device_fd = open("/dev/vfio/devices/dev1", mode);
> > 	ioasid_fd = open("/dev/ioasid", mode);
> > 	ioctl(device_fd, VFIO_BIND_IOASID_FD, ioasid_fd);
> >
> > 	/* Attach device to IOASID */
> > 	gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> > 	at_data = { .ioasid = gpa_ioasid};
> > 	ioctl(device_fd, VFIO_ATTACH_IOASID, &at_data);
> >
> > 	/* Setup GPA mapping */
> > 	dma_map = {
> > 		.ioasid	= gpa_ioasid;
> > 		.iova	= 0;		// GPA
> > 		.vaddr	= 0x40000000;	// HVA
> > 		.size	= 1GB;
> > 	};
> > 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> >
> > If the guest is assigned with more than dev1, user follows above sequence
> > to attach other devices to the same gpa_ioasid i.e. sharing the GPA
> > address space cross all assigned devices.
> 
> eg
> 
>  	device2_fd = open("/dev/vfio/devices/dev1", mode);
>  	ioctl(device2_fd, VFIO_BIND_IOASID_FD, ioasid_fd);
>  	ioctl(device2_fd, VFIO_ATTACH_IOASID, &at_data);
> 
> Right?

Exactly, except a small typo ('dev1' -> 'dev2'). :)

> 
> >
> > 5.2. Multiple IOASIDs (no nesting)
> > ++++++++++++++++++++++++++++
> >
> > Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
> > both devices are attached to gpa_ioasid. After boot the guest creates
> > an GIOVA address space (giova_ioasid) for dev2, leaving dev1 in pass
> > through mode (gpa_ioasid).
> >
> > Suppose IOASID nesting is not supported in this case. Qemu need to
> > generate shadow mappings in userspace for giova_ioasid (like how
> > VFIO works today).
> >
> > To avoid duplicated locked page accounting, it's recommended to pre-
> > register the virtual address range that will be used for DMA:
> >
> > 	device_fd1 = open("/dev/vfio/devices/dev1", mode);
> > 	device_fd2 = open("/dev/vfio/devices/dev2", mode);
> > 	ioasid_fd = open("/dev/ioasid", mode);
> > 	ioctl(device_fd1, VFIO_BIND_IOASID_FD, ioasid_fd);
> > 	ioctl(device_fd2, VFIO_BIND_IOASID_FD, ioasid_fd);
> >
> > 	/* pre-register the virtual address range for accounting */
> > 	mem_info = { .vaddr = 0x40000000; .size = 1GB };
> > 	ioctl(ioasid_fd, IOASID_REGISTER_MEMORY, &mem_info);
> >
> > 	/* Attach dev1 and dev2 to gpa_ioasid */
> > 	gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> > 	at_data = { .ioasid = gpa_ioasid};
> > 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> > 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> >
> > 	/* Setup GPA mapping */
> > 	dma_map = {
> > 		.ioasid	= gpa_ioasid;
> > 		.iova	= 0; 		// GPA
> > 		.vaddr	= 0x40000000;	// HVA
> > 		.size	= 1GB;
> > 	};
> > 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> >
> > 	/* After boot, guest enables an GIOVA space for dev2 */
> > 	giova_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> >
> > 	/* First detach dev2 from previous address space */
> > 	at_data = { .ioasid = gpa_ioasid};
> > 	ioctl(device_fd2, VFIO_DETACH_IOASID, &at_data);
> >
> > 	/* Then attach dev2 to the new address space */
> > 	at_data = { .ioasid = giova_ioasid};
> > 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> >
> > 	/* Setup a shadow DMA mapping according to vIOMMU
> > 	  * GIOVA (0x2000) -> GPA (0x1000) -> HVA (0x40001000)
> > 	  */
> 
> Here "shadow DMA" means relay the guest's vIOMMU page tables to the HW
> IOMMU?

'shadow' means the merged mapping: GIOVA(0x2000) -> HVA (0x40001000)

> 
> > 	dma_map = {
> > 		.ioasid	= giova_ioasid;
> > 		.iova	= 0x2000; 	// GIOVA
> > 		.vaddr	= 0x40001000;	// HVA
> 
> eg HVA came from reading the guest's page tables and finding it wanted
> GPA 0x1000 mapped to IOVA 0x2000?

yes

> 
> 
> > 5.3. IOASID nesting (software)
> > +++++++++++++++++++++++++
> >
> > Same usage scenario as 5.2, with software-based IOASID nesting
> > available. In this mode it is the kernel instead of user to create the
> > shadow mapping.
> >
> > The flow before guest boots is same as 5.2, except one point. Because
> > giova_ioasid is nested on gpa_ioasid, locked accounting is only
> > conducted for gpa_ioasid. So it's not necessary to pre-register virtual
> > memory.
> >
> > To save space we only list the steps after boots (i.e. both dev1/dev2
> > have been attached to gpa_ioasid before guest boots):
> >
> > 	/* After boots */
> > 	/* Make GIOVA space nested on GPA space */
> > 	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > 				gpa_ioasid);
> >
> > 	/* Attach dev2 to the new address space (child)
> > 	  * Note dev2 is still attached to gpa_ioasid (parent)
> > 	  */
> > 	at_data = { .ioasid = giova_ioasid};
> > 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> >
> > 	/* Setup a GIOVA->GPA mapping for giova_ioasid, which will be
> > 	  * merged by the kernel with GPA->HVA mapping of gpa_ioasid
> > 	  * to form a shadow mapping.
> > 	  */
> > 	dma_map = {
> > 		.ioasid	= giova_ioasid;
> > 		.iova	= 0x2000;	// GIOVA
> > 		.vaddr	= 0x1000;	// GPA
> > 		.size	= 4KB;
> > 	};
> > 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> 
> And in this version the kernel reaches into the parent IOASID's page
> tables to translate 0x1000 to 0x40001000 to physical page? So we
> basically remove the qemu process address space entirely from this
> translation. It does seem convenient

yes.

> 
> > 5.4. IOASID nesting (hardware)
> > +++++++++++++++++++++++++
> >
> > Same usage scenario as 5.2, with hardware-based IOASID nesting
> > available. In this mode the pgtable binding protocol is used to
> > bind the guest IOVA page table with the IOMMU:
> >
> > 	/* After boots */
> > 	/* Make GIOVA space nested on GPA space */
> > 	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > 				gpa_ioasid);
> >
> > 	/* Attach dev2 to the new address space (child)
> > 	  * Note dev2 is still attached to gpa_ioasid (parent)
> > 	  */
> > 	at_data = { .ioasid = giova_ioasid};
> > 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> >
> > 	/* Bind guest I/O page table  */
> > 	bind_data = {
> > 		.ioasid	= giova_ioasid;
> > 		.addr	= giova_pgtable;
> > 		// and format information
> > 	};
> > 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> 
> I really think you need to use consistent language. Things that
> allocate a new IOASID should be calle IOASID_ALLOC_IOASID. If multiple
> IOCTLs are needed then it is IOASID_ALLOC_IOASID_PGTABLE, etc.
> alloc/create/bind is too confusing.
> 
> > 5.5. Guest SVA (vSVA)
> > ++++++++++++++++++
> >
> > After boots the guest further create a GVA address spaces (gpasid1) on
> > dev1. Dev2 is not affected (still attached to giova_ioasid).
> >
> > As explained in section 4, user should avoid expose ENQCMD on both
> > pdev and mdev.
> >
> > The sequence applies to all device types (being pdev or mdev), except
> > one additional step to call KVM for ENQCMD-capable mdev:
> >
> > 	/* After boots */
> > 	/* Make GVA space nested on GPA space */
> > 	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > 				gpa_ioasid);
> >
> > 	/* Attach dev1 to the new address space and specify vPASID */
> > 	at_data = {
> > 		.ioasid		= gva_ioasid;
> > 		.flag 		= IOASID_ATTACH_USER_PASID;
> > 		.user_pasid	= gpasid1;
> > 	};
> > 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> 
> Still a little unsure why the vPASID is here not on the gva_ioasid. Is
> there any scenario where we want different vpasid's for the same
> IOASID? I guess it is OK like this. Hum.

Yes, it's completely sane that the guest links a I/O page table to 
different vpasids on dev1 and dev2. The IOMMU doesn't mandate
that when multiple devices share an I/O page table they must use
the same PASID#. 

> 
> > 	/* if dev1 is ENQCMD-capable mdev, update CPU PASID
> > 	  * translation structure through KVM
> > 	  */
> > 	pa_data = {
> > 		.ioasid_fd	= ioasid_fd;
> > 		.ioasid		= gva_ioasid;
> > 		.guest_pasid	= gpasid1;
> > 	};
> > 	ioctl(kvm_fd, KVM_MAP_PASID, &pa_data);
> 
> Make sense
> 
> > 	/* Bind guest I/O page table  */
> > 	bind_data = {
> > 		.ioasid	= gva_ioasid;
> > 		.addr	= gva_pgtable1;
> > 		// and format information
> > 	};
> > 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> 
> Again I do wonder if this should just be part of alloc_ioasid. Is
> there any reason to split these things? The only advantage to the
> split is the device is known, but the device shouldn't impact
> anything..

I summarized this as open#4 in another mail for focused discussion.

> 
> > 5.6. I/O page fault
> > +++++++++++++++
> >
> > (uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
> > to guest IOMMU driver and backwards).
> >
> > -   Host IOMMU driver receives a page request with raw fault_data {rid,
> >     pasid, addr};
> >
> > -   Host IOMMU driver identifies the faulting I/O page table according to
> >     information registered by IOASID fault handler;
> >
> > -   IOASID fault handler is called with raw fault_data (rid, pasid, addr),
> which
> >     is saved in ioasid_data->fault_data (used for response);
> >
> > -   IOASID fault handler generates an user fault_data (ioasid, addr), links it
> >     to the shared ring buffer and triggers eventfd to userspace;
> 
> Here rid should be translated to a labeled device and return the
> device label from VFIO_BIND_IOASID_FD. Depending on how the device
> bound the label might match to a rid or to a rid,pasid

Yes, I acknowledged this input from you and Jean about page fault and 
bind_pasid_table. I summarized it as open#3 in another mail.

thus following is skipped...

Thanks
Kevin

> 
> > -   Upon received event, Qemu needs to find the virtual routing information
> >     (v_rid + v_pasid) of the device attached to the faulting ioasid. If there are
> >     multiple, pick a random one. This should be fine since the purpose is to
> >     fix the I/O page table on the guest;
> 
> The device label should fix this
> 
> > -   Qemu finds the pending fault event, converts virtual completion data
> >     into (ioasid, response_code), and then calls a /dev/ioasid ioctl to
> >     complete the pending fault;
> >
> > -   /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved in
> >     ioasid_data->fault_data, and then calls iommu api to complete it with
> >     {rid, pasid, response_code};
> 
> So resuming a fault on an ioasid will resume all devices pending on
> the fault?
> 
> > 5.7. BIND_PASID_TABLE
> > ++++++++++++++++++++
> >
> > PASID table is put in the GPA space on some platform, thus must be
> updated
> > by the guest. It is treated as another user page table to be bound with the
> > IOMMU.
> >
> > As explained earlier, the user still needs to explicitly bind every user I/O
> > page table to the kernel so the same pgtable binding protocol (bind, cache
> > invalidate and fault handling) is unified cross platforms.
> >
> > vIOMMUs may include a caching mode (or paravirtualized way) which,
> once
> > enabled, requires the guest to invalidate PASID cache for any change on the
> > PASID table. This allows Qemu to track the lifespan of guest I/O page tables.
> >
> > In case of missing such capability, Qemu could enable write-protection on
> > the guest PASID table to achieve the same effect.
> >
> > 	/* After boots */
> > 	/* Make vPASID space nested on GPA space */
> > 	pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > 				gpa_ioasid);
> >
> > 	/* Attach dev1 to pasidtbl_ioasid */
> > 	at_data = { .ioasid = pasidtbl_ioasid};
> > 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> >
> > 	/* Bind PASID table */
> > 	bind_data = {
> > 		.ioasid	= pasidtbl_ioasid;
> > 		.addr	= gpa_pasid_table;
> > 		// and format information
> > 	};
> > 	ioctl(ioasid_fd, IOASID_BIND_PASID_TABLE, &bind_data);
> >
> > 	/* vIOMMU detects a new GVA I/O space created */
> > 	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > 				gpa_ioasid);
> >
> > 	/* Attach dev1 to the new address space, with gpasid1 */
> > 	at_data = {
> > 		.ioasid		= gva_ioasid;
> > 		.flag 		= IOASID_ATTACH_USER_PASID;
> > 		.user_pasid	= gpasid1;
> > 	};
> > 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> >
> > 	/* Bind guest I/O page table. Because SET_PASID_TABLE has been
> > 	  * used, the kernel will not update the PASID table. Instead, just
> > 	  * track the bound I/O page table for handling invalidation and
> > 	  * I/O page faults.
> > 	  */
> > 	bind_data = {
> > 		.ioasid	= gva_ioasid;
> > 		.addr	= gva_pgtable1;
> > 		// and format information
> > 	};
> > 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> 
> I still don't quite get the benifit from doing this.
> 
> The idea to create an all PASID IOASID seems to work better with less
> fuss on HW that is directly parsing the guest's PASID table.
> 
> Cache invalidate seems easy enough to support
> 
> Fault handling needs to return the (ioasid, device_label, pasid) when
> working with this kind of ioasid.
> 
> It is true that it does create an additional flow qemu has to
> implement, but it does directly mirror the HW.
> 
> Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  6:16                     ` Tian, Kevin
@ 2021-06-01  8:47                       ` Jason Wang
  -1 siblings, 0 replies; 518+ messages in thread
From: Jason Wang @ 2021-06-01  8:47 UTC (permalink / raw)
  To: Tian, Kevin, Lu Baolu, Liu Yi L
  Cc: kvm, Jonathan Corbet, iommu, LKML,
	Alex Williamson (alex.williamson@redhat.com)"",
	Jason Gunthorpe, David Woodhouse


在 2021/6/1 下午2:16, Tian, Kevin 写道:
>> From: Jason Wang
>> Sent: Tuesday, June 1, 2021 2:07 PM
>>
>> 在 2021/6/1 下午1:42, Tian, Kevin 写道:
>>>> From: Jason Wang
>>>> Sent: Tuesday, June 1, 2021 1:30 PM
>>>>
>>>> 在 2021/6/1 下午1:23, Lu Baolu 写道:
>>>>> Hi Jason W,
>>>>>
>>>>> On 6/1/21 1:08 PM, Jason Wang wrote:
>>>>>>>> 2) If yes, what's the reason for not simply use the fd opened from
>>>>>>>> /dev/ioas. (This is the question that is not answered) and what
>>>>>>>> happens
>>>>>>>> if we call GET_INFO for the ioasid_fd?
>>>>>>>> 3) If not, how GET_INFO work?
>>>>>>> oh, missed this question in prior reply. Personally, no special reason
>>>>>>> yet. But using ID may give us opportunity to customize the
>> management
>>>>>>> of the handle. For one, better lookup efficiency by using xarray to
>>>>>>> store the allocated IDs. For two, could categorize the allocated IDs
>>>>>>> (parent or nested). GET_INFO just works with an input FD and an ID.
>>>>>> I'm not sure I get this, for nesting cases you can still make the
>>>>>> child an fd.
>>>>>>
>>>>>> And a question still, under what case we need to create multiple
>>>>>> ioasids on a single ioasid fd?
>>>>> One possible situation where multiple IOASIDs per FD could be used is
>>>>> that devices with different underlying IOMMU capabilities are sharing a
>>>>> single FD. In this case, only devices with consistent underlying IOMMU
>>>>> capabilities could be put in an IOASID and multiple IOASIDs per FD could
>>>>> be applied.
>>>>>
>>>>> Though, I still not sure about "multiple IOASID per-FD" vs "multiple
>>>>> IOASID FDs" for such case.
>>>> Right, that's exactly my question. The latter seems much more easier to
>>>> be understood and implemented.
>>>>
>>> A simple reason discussed in previous thread - there could be 1M's
>>> I/O address spaces per device while #FD's are precious resource.
>>
>> Is the concern for ulimit or performance? Note that we had
>>
>> #define NR_OPEN_MAX ~0U
>>
>> And with the fd semantic, you can do a lot of other stuffs: close on
>> exec, passing via SCM_RIGHTS.
> yes, fd has its merits.
>
>> For the case of 1M, I would like to know what's the use case for a
>> single process to handle 1M+ address spaces?
> This single process is Qemu with an assigned device. Within the guest
> there could be many guest processes. Though in reality I didn't see
> such 1M processes on a single device, better not restrict it in uAPI?


Sorry I don't get here.

We can open up to ~0U file descriptors, I don't see why we need to 
restrict it in uAPI.

Thanks


>
>>
>>> So this RFC treats fd as a container of address spaces which is each
>>> tagged by an IOASID.
>>
>> If the container and address space is 1:1 then the container seems useless.
>>
> yes, 1:1 then container is useless. But here it's assumed 1:M then
> even a single fd is sufficient for all intended usages.
>
> Thanks
> Kevin


^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-01  8:47                       ` Jason Wang
  0 siblings, 0 replies; 518+ messages in thread
From: Jason Wang @ 2021-06-01  8:47 UTC (permalink / raw)
  To: Tian, Kevin, Lu Baolu, Liu Yi L
  Cc: kvm, Jonathan Corbet,
	Alex Williamson (alex.williamson@redhat.com)"",
	LKML, iommu, Jason Gunthorpe, David Woodhouse


在 2021/6/1 下午2:16, Tian, Kevin 写道:
>> From: Jason Wang
>> Sent: Tuesday, June 1, 2021 2:07 PM
>>
>> 在 2021/6/1 下午1:42, Tian, Kevin 写道:
>>>> From: Jason Wang
>>>> Sent: Tuesday, June 1, 2021 1:30 PM
>>>>
>>>> 在 2021/6/1 下午1:23, Lu Baolu 写道:
>>>>> Hi Jason W,
>>>>>
>>>>> On 6/1/21 1:08 PM, Jason Wang wrote:
>>>>>>>> 2) If yes, what's the reason for not simply use the fd opened from
>>>>>>>> /dev/ioas. (This is the question that is not answered) and what
>>>>>>>> happens
>>>>>>>> if we call GET_INFO for the ioasid_fd?
>>>>>>>> 3) If not, how GET_INFO work?
>>>>>>> oh, missed this question in prior reply. Personally, no special reason
>>>>>>> yet. But using ID may give us opportunity to customize the
>> management
>>>>>>> of the handle. For one, better lookup efficiency by using xarray to
>>>>>>> store the allocated IDs. For two, could categorize the allocated IDs
>>>>>>> (parent or nested). GET_INFO just works with an input FD and an ID.
>>>>>> I'm not sure I get this, for nesting cases you can still make the
>>>>>> child an fd.
>>>>>>
>>>>>> And a question still, under what case we need to create multiple
>>>>>> ioasids on a single ioasid fd?
>>>>> One possible situation where multiple IOASIDs per FD could be used is
>>>>> that devices with different underlying IOMMU capabilities are sharing a
>>>>> single FD. In this case, only devices with consistent underlying IOMMU
>>>>> capabilities could be put in an IOASID and multiple IOASIDs per FD could
>>>>> be applied.
>>>>>
>>>>> Though, I still not sure about "multiple IOASID per-FD" vs "multiple
>>>>> IOASID FDs" for such case.
>>>> Right, that's exactly my question. The latter seems much more easier to
>>>> be understood and implemented.
>>>>
>>> A simple reason discussed in previous thread - there could be 1M's
>>> I/O address spaces per device while #FD's are precious resource.
>>
>> Is the concern for ulimit or performance? Note that we had
>>
>> #define NR_OPEN_MAX ~0U
>>
>> And with the fd semantic, you can do a lot of other stuffs: close on
>> exec, passing via SCM_RIGHTS.
> yes, fd has its merits.
>
>> For the case of 1M, I would like to know what's the use case for a
>> single process to handle 1M+ address spaces?
> This single process is Qemu with an assigned device. Within the guest
> there could be many guest processes. Though in reality I didn't see
> such 1M processes on a single device, better not restrict it in uAPI?


Sorry I don't get here.

We can open up to ~0U file descriptors, I don't see why we need to 
restrict it in uAPI.

Thanks


>
>>
>>> So this RFC treats fd as a container of address spaces which is each
>>> tagged by an IOASID.
>>
>> If the container and address space is 1:1 then the container seems useless.
>>
> yes, 1:1 then container is useless. But here it's assumed 1:M then
> even a single fd is sufficient for all intended usages.
>
> Thanks
> Kevin

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-28 23:36   ` Jason Gunthorpe
@ 2021-06-01 11:09     ` Lu Baolu
  -1 siblings, 0 replies; 518+ messages in thread
From: Lu Baolu @ 2021-06-01 11:09 UTC (permalink / raw)
  To: Jason Gunthorpe, Tian, Kevin
  Cc: baolu.lu, LKML, Joerg Roedel, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

Hi Jason,

On 2021/5/29 7:36, Jason Gunthorpe wrote:
>> /*
>>    * Bind an user-managed I/O page table with the IOMMU
>>    *
>>    * Because user page table is untrusted, IOASID nesting must be enabled
>>    * for this ioasid so the kernel can enforce its DMA isolation policy
>>    * through the parent ioasid.
>>    *
>>    * Pgtable binding protocol is different from DMA mapping. The latter
>>    * has the I/O page table constructed by the kernel and updated
>>    * according to user MAP/UNMAP commands. With pgtable binding the
>>    * whole page table is created and updated by userspace, thus different
>>    * set of commands are required (bind, iotlb invalidation, page fault, etc.).
>>    *
>>    * Because the page table is directly walked by the IOMMU, the user
>>    * must  use a format compatible to the underlying hardware. It can
>>    * check the format information through IOASID_GET_INFO.
>>    *
>>    * The page table is bound to the IOMMU according to the routing
>>    * information of each attached device under the specified IOASID. The
>>    * routing information (RID and optional PASID) is registered when a
>>    * device is attached to this IOASID through VFIO uAPI.
>>    *
>>    * Input parameters:
>>    *	- child_ioasid;
>>    *	- address of the user page table;
>>    *	- formats (vendor, address_width, etc.);
>>    *
>>    * Return: 0 on success, -errno on failure.
>>    */
>> #define IOASID_BIND_PGTABLE		_IO(IOASID_TYPE, IOASID_BASE + 9)
>> #define IOASID_UNBIND_PGTABLE	_IO(IOASID_TYPE, IOASID_BASE + 10)
> Also feels backwards, why wouldn't we specify this, and the required
> page table format, during alloc time?
> 

Thinking of the required page table format, perhaps we should shed more
light on the page table of an IOASID. So far, an IOASID might represent
one of the following page tables (might be more):

  1) an IOMMU format page table (a.k.a. iommu_domain)
  2) a user application CPU page table (SVA for example)
  3) a KVM EPT (future option)
  4) a VM guest managed page table (nesting mode)

This version only covers 1) and 4). Do you think we need to support 2),
3) and beyond? If so, it seems that we need some in-kernel helpers and
uAPIs to support pre-installing a page table to IOASID. From this point
of view an IOASID is actually not just a variant of iommu_domain, but an
I/O page table representation in a broader sense.

Best regards,
baolu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-01 11:09     ` Lu Baolu
  0 siblings, 0 replies; 518+ messages in thread
From: Lu Baolu @ 2021-06-01 11:09 UTC (permalink / raw)
  To: Jason Gunthorpe, Tian, Kevin
  Cc: Jean-Philippe Brucker,
	Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, David Gibson, Jiang, Dave,
	David Woodhouse, Jason Wang

Hi Jason,

On 2021/5/29 7:36, Jason Gunthorpe wrote:
>> /*
>>    * Bind an user-managed I/O page table with the IOMMU
>>    *
>>    * Because user page table is untrusted, IOASID nesting must be enabled
>>    * for this ioasid so the kernel can enforce its DMA isolation policy
>>    * through the parent ioasid.
>>    *
>>    * Pgtable binding protocol is different from DMA mapping. The latter
>>    * has the I/O page table constructed by the kernel and updated
>>    * according to user MAP/UNMAP commands. With pgtable binding the
>>    * whole page table is created and updated by userspace, thus different
>>    * set of commands are required (bind, iotlb invalidation, page fault, etc.).
>>    *
>>    * Because the page table is directly walked by the IOMMU, the user
>>    * must  use a format compatible to the underlying hardware. It can
>>    * check the format information through IOASID_GET_INFO.
>>    *
>>    * The page table is bound to the IOMMU according to the routing
>>    * information of each attached device under the specified IOASID. The
>>    * routing information (RID and optional PASID) is registered when a
>>    * device is attached to this IOASID through VFIO uAPI.
>>    *
>>    * Input parameters:
>>    *	- child_ioasid;
>>    *	- address of the user page table;
>>    *	- formats (vendor, address_width, etc.);
>>    *
>>    * Return: 0 on success, -errno on failure.
>>    */
>> #define IOASID_BIND_PGTABLE		_IO(IOASID_TYPE, IOASID_BASE + 9)
>> #define IOASID_UNBIND_PGTABLE	_IO(IOASID_TYPE, IOASID_BASE + 10)
> Also feels backwards, why wouldn't we specify this, and the required
> page table format, during alloc time?
> 

Thinking of the required page table format, perhaps we should shed more
light on the page table of an IOASID. So far, an IOASID might represent
one of the following page tables (might be more):

  1) an IOMMU format page table (a.k.a. iommu_domain)
  2) a user application CPU page table (SVA for example)
  3) a KVM EPT (future option)
  4) a VM guest managed page table (nesting mode)

This version only covers 1) and 4). Do you think we need to support 2),
3) and beyond? If so, it seems that we need some in-kernel helpers and
uAPIs to support pre-installing a page table to IOASID. From this point
of view an IOASID is actually not just a variant of iommu_domain, but an
I/O page table representation in a broader sense.

Best regards,
baolu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-05-31 18:12     ` Jason Gunthorpe
@ 2021-06-01 12:04       ` Parav Pandit
  -1 siblings, 0 replies; 518+ messages in thread
From: Parav Pandit @ 2021-06-01 12:04 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy



> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Monday, May 31, 2021 11:43 PM
> 
> On Mon, May 31, 2021 at 05:37:35PM +0000, Parav Pandit wrote:
> 
> > In that case, can it be a new system call? Why does it have to be under
> /dev/ioasid?
> > For example few years back such system call mpin() thought was proposed
> in [1].
> 
> Reference counting of the overall pins are required
> 
> So when a pinned pages is incorporated into an IOASID page table in a later
> IOCTL it means it cannot be unpinned while the IOASID page table is using it.
Ok. but cant it use the same refcount of that mmu uses?

> 
> This is some trick to organize the pinning into groups and then refcount each
> group, thus avoiding needing per-page refcounts.
Pinned page refcount is already maintained by the mmu without ioasid, isn't it?

> 
> The data structure would be an interval tree of pins in general
> 
> The ioasid itself would have an interval tree of its own mappings, each entry
> in this tree would reference count against an element in the above tree
> 
> Then the ioasid's interval tree would be mapped into a page table tree in HW
> format.
Does it mean in simple use case [1], second level page table copy is maintained in the IOMMU side via map interface?
I hope not. It should use the same as what mmu uses, right?

[1] one SIOV/ADI device assigned with one PASID and mapped in guest VM

> 
> The redundant storages are needed to keep track of the refencing and the
> CPU page table values for later unpinning.
> 
> Jason

^ permalink raw reply	[flat|nested] 518+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-01 12:04       ` Parav Pandit
  0 siblings, 0 replies; 518+ messages in thread
From: Parav Pandit @ 2021-06-01 12:04 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Tian, Kevin,
	Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, David Gibson, Jiang, Dave,
	David Woodhouse, Jason Wang



> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Monday, May 31, 2021 11:43 PM
> 
> On Mon, May 31, 2021 at 05:37:35PM +0000, Parav Pandit wrote:
> 
> > In that case, can it be a new system call? Why does it have to be under
> /dev/ioasid?
> > For example few years back such system call mpin() thought was proposed
> in [1].
> 
> Reference counting of the overall pins are required
> 
> So when a pinned pages is incorporated into an IOASID page table in a later
> IOCTL it means it cannot be unpinned while the IOASID page table is using it.
Ok. but cant it use the same refcount of that mmu uses?

> 
> This is some trick to organize the pinning into groups and then refcount each
> group, thus avoiding needing per-page refcounts.
Pinned page refcount is already maintained by the mmu without ioasid, isn't it?

> 
> The data structure would be an interval tree of pins in general
> 
> The ioasid itself would have an interval tree of its own mappings, each entry
> in this tree would reference count against an element in the above tree
> 
> Then the ioasid's interval tree would be mapped into a page table tree in HW
> format.
Does it mean in simple use case [1], second level page table copy is maintained in the IOMMU side via map interface?
I hope not. It should use the same as what mmu uses, right?

[1] one SIOV/ADI device assigned with one PASID and mapped in guest VM

> 
> The redundant storages are needed to keep track of the refencing and the
> CPU page table values for later unpinning.
> 
> Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  7:15       ` Shenming Lu
@ 2021-06-01 12:30         ` Lu Baolu
  -1 siblings, 0 replies; 518+ messages in thread
From: Lu Baolu @ 2021-06-01 12:30 UTC (permalink / raw)
  To: Shenming Lu, Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang
  Cc: baolu.lu, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu,
	Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker, David Gibson,
	Kirti Wankhede, Robin Murphy, Zenghui Yu, wanghaibin.wang

On 2021/6/1 15:15, Shenming Lu wrote:
> On 2021/6/1 13:10, Lu Baolu wrote:
>> Hi Shenming,
>>
>> On 6/1/21 12:31 PM, Shenming Lu wrote:
>>> On 2021/5/27 15:58, Tian, Kevin wrote:
>>>> /dev/ioasid provides an unified interface for managing I/O page tables for
>>>> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
>>>> etc.) are expected to use this interface instead of creating their own logic to
>>>> isolate untrusted device DMAs initiated by userspace.
>>>>
>>>> This proposal describes the uAPI of /dev/ioasid and also sample sequences
>>>> with VFIO as example in typical usages. The driver-facing kernel API provided
>>>> by the iommu layer is still TBD, which can be discussed after consensus is
>>>> made on this uAPI.
>>>>
>>>> It's based on a lengthy discussion starting from here:
>>>>      https://lore.kernel.org/linux-iommu/20210330132830.GO2356281@nvidia.com/
>>>>
>>>> It ends up to be a long writing due to many things to be summarized and
>>>> non-trivial effort required to connect them into a complete proposal.
>>>> Hope it provides a clean base to converge.
>>>>
>>> [..]
>>>
>>>> /*
>>>>     * Page fault report and response
>>>>     *
>>>>     * This is TBD. Can be added after other parts are cleared up. Likely it
>>>>     * will be a ring buffer shared between user/kernel, an eventfd to notify
>>>>     * the user and an ioctl to complete the fault.
>>>>     *
>>>>     * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
>>>>     */
>>> Hi,
>>>
>>> It seems that the ioasid has different usage in different situation, it could
>>> be directly used in the physical routing, or just a virtual handle that indicates
>>> a page table or a vPASID table (such as the GPA address space, in the simple
>>> passthrough case, the DMA input to IOMMU will just contain a Stream ID, no
>>> Substream ID), right?
>>>
>>> And Baolu suggested that since one device might consume multiple page tables,
>>> it's more reasonable to have one fault handler per page table. By this, do we
>>> have to maintain such an ioasid info list in the IOMMU layer?
>> As discussed earlier, the I/O page fault and cache invalidation paths
>> will have "device labels" so that the information could be easily
>> translated and routed.
>>
>> So it's likely the per-device fault handler registering API in iommu
>> core can be kept, but /dev/ioasid will be grown with a layer to
>> translate and propagate I/O page fault information to the right
>> consumers.
> Yeah, having a general preprocessing of the faults in IOASID seems to be
> a doable direction. But since there may be more than one consumer at the
> same time, who is responsible for registering the per-device fault handler?

The drivers register per page table fault handlers to /dev/ioasid which
will then register itself to iommu core to listen and route the per-
device I/O page faults. This is just a top level thought. Haven't gone
through the details yet. Need to wait and see what /dev/ioasid finally
looks like.

Best regards,
baolu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-01 12:30         ` Lu Baolu
  0 siblings, 0 replies; 518+ messages in thread
From: Lu Baolu @ 2021-06-01 12:30 UTC (permalink / raw)
  To: Shenming Lu, Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang
  Cc: Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok, Jonathan Corbet,
	Kirti Wankhede, David Gibson, wanghaibin.wang, Robin Murphy

On 2021/6/1 15:15, Shenming Lu wrote:
> On 2021/6/1 13:10, Lu Baolu wrote:
>> Hi Shenming,
>>
>> On 6/1/21 12:31 PM, Shenming Lu wrote:
>>> On 2021/5/27 15:58, Tian, Kevin wrote:
>>>> /dev/ioasid provides an unified interface for managing I/O page tables for
>>>> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
>>>> etc.) are expected to use this interface instead of creating their own logic to
>>>> isolate untrusted device DMAs initiated by userspace.
>>>>
>>>> This proposal describes the uAPI of /dev/ioasid and also sample sequences
>>>> with VFIO as example in typical usages. The driver-facing kernel API provided
>>>> by the iommu layer is still TBD, which can be discussed after consensus is
>>>> made on this uAPI.
>>>>
>>>> It's based on a lengthy discussion starting from here:
>>>>      https://lore.kernel.org/linux-iommu/20210330132830.GO2356281@nvidia.com/
>>>>
>>>> It ends up to be a long writing due to many things to be summarized and
>>>> non-trivial effort required to connect them into a complete proposal.
>>>> Hope it provides a clean base to converge.
>>>>
>>> [..]
>>>
>>>> /*
>>>>     * Page fault report and response
>>>>     *
>>>>     * This is TBD. Can be added after other parts are cleared up. Likely it
>>>>     * will be a ring buffer shared between user/kernel, an eventfd to notify
>>>>     * the user and an ioctl to complete the fault.
>>>>     *
>>>>     * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
>>>>     */
>>> Hi,
>>>
>>> It seems that the ioasid has different usage in different situation, it could
>>> be directly used in the physical routing, or just a virtual handle that indicates
>>> a page table or a vPASID table (such as the GPA address space, in the simple
>>> passthrough case, the DMA input to IOMMU will just contain a Stream ID, no
>>> Substream ID), right?
>>>
>>> And Baolu suggested that since one device might consume multiple page tables,
>>> it's more reasonable to have one fault handler per page table. By this, do we
>>> have to maintain such an ioasid info list in the IOMMU layer?
>> As discussed earlier, the I/O page fault and cache invalidation paths
>> will have "device labels" so that the information could be easily
>> translated and routed.
>>
>> So it's likely the per-device fault handler registering API in iommu
>> core can be kept, but /dev/ioasid will be grown with a layer to
>> translate and propagate I/O page fault information to the right
>> consumers.
> Yeah, having a general preprocessing of the faults in IOASID seems to be
> a doable direction. But since there may be more than one consumer at the
> same time, who is responsible for registering the per-device fault handler?

The drivers register per page table fault handlers to /dev/ioasid which
will then register itself to iommu core to listen and route the per-
device I/O page faults. This is just a top level thought. Haven't gone
through the details yet. Need to wait and see what /dev/ioasid finally
looks like.

Best regards,
baolu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01 12:30         ` Lu Baolu
@ 2021-06-01 13:10           ` Shenming Lu
  -1 siblings, 0 replies; 518+ messages in thread
From: Shenming Lu @ 2021-06-01 13:10 UTC (permalink / raw)
  To: Lu Baolu, Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Jean-Philippe Brucker
  Cc: Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu, Hao,
	Jiang, Dave, Jacob Pan, David Gibson, Kirti Wankhede,
	Robin Murphy, Zenghui Yu, wanghaibin.wang

On 2021/6/1 20:30, Lu Baolu wrote:
> On 2021/6/1 15:15, Shenming Lu wrote:
>> On 2021/6/1 13:10, Lu Baolu wrote:
>>> Hi Shenming,
>>>
>>> On 6/1/21 12:31 PM, Shenming Lu wrote:
>>>> On 2021/5/27 15:58, Tian, Kevin wrote:
>>>>> /dev/ioasid provides an unified interface for managing I/O page tables for
>>>>> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
>>>>> etc.) are expected to use this interface instead of creating their own logic to
>>>>> isolate untrusted device DMAs initiated by userspace.
>>>>>
>>>>> This proposal describes the uAPI of /dev/ioasid and also sample sequences
>>>>> with VFIO as example in typical usages. The driver-facing kernel API provided
>>>>> by the iommu layer is still TBD, which can be discussed after consensus is
>>>>> made on this uAPI.
>>>>>
>>>>> It's based on a lengthy discussion starting from here:
>>>>>      https://lore.kernel.org/linux-iommu/20210330132830.GO2356281@nvidia.com/
>>>>>
>>>>> It ends up to be a long writing due to many things to be summarized and
>>>>> non-trivial effort required to connect them into a complete proposal.
>>>>> Hope it provides a clean base to converge.
>>>>>
>>>> [..]
>>>>
>>>>> /*
>>>>>     * Page fault report and response
>>>>>     *
>>>>>     * This is TBD. Can be added after other parts are cleared up. Likely it
>>>>>     * will be a ring buffer shared between user/kernel, an eventfd to notify
>>>>>     * the user and an ioctl to complete the fault.
>>>>>     *
>>>>>     * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
>>>>>     */
>>>> Hi,
>>>>
>>>> It seems that the ioasid has different usage in different situation, it could
>>>> be directly used in the physical routing, or just a virtual handle that indicates
>>>> a page table or a vPASID table (such as the GPA address space, in the simple
>>>> passthrough case, the DMA input to IOMMU will just contain a Stream ID, no
>>>> Substream ID), right?
>>>>
>>>> And Baolu suggested that since one device might consume multiple page tables,
>>>> it's more reasonable to have one fault handler per page table. By this, do we
>>>> have to maintain such an ioasid info list in the IOMMU layer?
>>> As discussed earlier, the I/O page fault and cache invalidation paths
>>> will have "device labels" so that the information could be easily
>>> translated and routed.
>>>
>>> So it's likely the per-device fault handler registering API in iommu
>>> core can be kept, but /dev/ioasid will be grown with a layer to
>>> translate and propagate I/O page fault information to the right
>>> consumers.
>> Yeah, having a general preprocessing of the faults in IOASID seems to be
>> a doable direction. But since there may be more than one consumer at the
>> same time, who is responsible for registering the per-device fault handler?
> 
> The drivers register per page table fault handlers to /dev/ioasid which
> will then register itself to iommu core to listen and route the per-
> device I/O page faults. This is just a top level thought. Haven't gone
> through the details yet. Need to wait and see what /dev/ioasid finally
> looks like.

OK. And it needs to be confirmed by Jean since we might migrate the code from
io-pgfault.c to IOASID... Anyway, finalize /dev/ioasid first.  Thanks,

Shenming

> 
> Best regards,
> baolu
> .

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-01 13:10           ` Shenming Lu
  0 siblings, 0 replies; 518+ messages in thread
From: Shenming Lu @ 2021-06-01 13:10 UTC (permalink / raw)
  To: Lu Baolu, Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Jean-Philippe Brucker
  Cc: Jiang, Dave, Raj, Ashok, Jonathan Corbet, Kirti Wankhede,
	wanghaibin.wang, David Gibson, Robin Murphy

On 2021/6/1 20:30, Lu Baolu wrote:
> On 2021/6/1 15:15, Shenming Lu wrote:
>> On 2021/6/1 13:10, Lu Baolu wrote:
>>> Hi Shenming,
>>>
>>> On 6/1/21 12:31 PM, Shenming Lu wrote:
>>>> On 2021/5/27 15:58, Tian, Kevin wrote:
>>>>> /dev/ioasid provides an unified interface for managing I/O page tables for
>>>>> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
>>>>> etc.) are expected to use this interface instead of creating their own logic to
>>>>> isolate untrusted device DMAs initiated by userspace.
>>>>>
>>>>> This proposal describes the uAPI of /dev/ioasid and also sample sequences
>>>>> with VFIO as example in typical usages. The driver-facing kernel API provided
>>>>> by the iommu layer is still TBD, which can be discussed after consensus is
>>>>> made on this uAPI.
>>>>>
>>>>> It's based on a lengthy discussion starting from here:
>>>>>      https://lore.kernel.org/linux-iommu/20210330132830.GO2356281@nvidia.com/
>>>>>
>>>>> It ends up to be a long writing due to many things to be summarized and
>>>>> non-trivial effort required to connect them into a complete proposal.
>>>>> Hope it provides a clean base to converge.
>>>>>
>>>> [..]
>>>>
>>>>> /*
>>>>>     * Page fault report and response
>>>>>     *
>>>>>     * This is TBD. Can be added after other parts are cleared up. Likely it
>>>>>     * will be a ring buffer shared between user/kernel, an eventfd to notify
>>>>>     * the user and an ioctl to complete the fault.
>>>>>     *
>>>>>     * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
>>>>>     */
>>>> Hi,
>>>>
>>>> It seems that the ioasid has different usage in different situation, it could
>>>> be directly used in the physical routing, or just a virtual handle that indicates
>>>> a page table or a vPASID table (such as the GPA address space, in the simple
>>>> passthrough case, the DMA input to IOMMU will just contain a Stream ID, no
>>>> Substream ID), right?
>>>>
>>>> And Baolu suggested that since one device might consume multiple page tables,
>>>> it's more reasonable to have one fault handler per page table. By this, do we
>>>> have to maintain such an ioasid info list in the IOMMU layer?
>>> As discussed earlier, the I/O page fault and cache invalidation paths
>>> will have "device labels" so that the information could be easily
>>> translated and routed.
>>>
>>> So it's likely the per-device fault handler registering API in iommu
>>> core can be kept, but /dev/ioasid will be grown with a layer to
>>> translate and propagate I/O page fault information to the right
>>> consumers.
>> Yeah, having a general preprocessing of the faults in IOASID seems to be
>> a doable direction. But since there may be more than one consumer at the
>> same time, who is responsible for registering the per-device fault handler?
> 
> The drivers register per page table fault handlers to /dev/ioasid which
> will then register itself to iommu core to listen and route the per-
> device I/O page faults. This is just a top level thought. Haven't gone
> through the details yet. Need to wait and see what /dev/ioasid finally
> looks like.

OK. And it needs to be confirmed by Jean since we might migrate the code from
io-pgfault.c to IOASID... Anyway, finalize /dev/ioasid first.  Thanks,

Shenming

> 
> Best regards,
> baolu
> .
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  3:08         ` Lu Baolu
@ 2021-06-01 17:24           ` Jason Gunthorpe
  -1 siblings, 0 replies; 518+ messages in thread
From: Jason Gunthorpe @ 2021-06-01 17:24 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Liu Yi L, Jean-Philippe Brucker, Tian, Kevin, Jiang, Dave, Raj,
	Ashok, kvm, Jonathan Corbet, David Woodhouse, Jason Wang, LKML,
	Kirti Wankhede, Alex Williamson (alex.williamson@redhat.com),
	iommu, Robin Murphy, David Gibson

On Tue, Jun 01, 2021 at 11:08:53AM +0800, Lu Baolu wrote:
> On 6/1/21 2:09 AM, Jason Gunthorpe wrote:
> > > > device bind should fail if the device somehow isn't compatible with
> > > > the scheme the user is tring to use.
> > > yeah, I guess you mean to fail the device attach when the IOASID is a
> > > nesting IOASID but the device is behind an iommu without nesting support.
> > > right?
> > Right..
> 
> Just want to confirm...
> 
> Does this mean that we only support hardware nesting and don't want to
> have soft nesting (shadowed page table in kernel) in IOASID?

No, the uAPI presents a contract, if the kernel can fulfill the
contract then it should be supported.

If you want SW nesting then the kernel has to have the SW support for
it or fail.

At least for the purposes of document I wouldn't devle too much deeper
into that question.

Jason

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-01 17:24           ` Jason Gunthorpe
  0 siblings, 0 replies; 518+ messages in thread
From: Jason Gunthorpe @ 2021-06-01 17:24 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Tian, Kevin, Jiang, Dave, Raj, Ashok, kvm, Jean-Philippe Brucker,
	Robin Murphy, Jason Wang, Jonathan Corbet, LKML, iommu,
	Kirti Wankhede, Alex Williamson (alex.williamson@redhat.com),
	David Woodhouse, David Gibson

On Tue, Jun 01, 2021 at 11:08:53AM +0800, Lu Baolu wrote:
> On 6/1/21 2:09 AM, Jason Gunthorpe wrote:
> > > > device bind should fail if the device somehow isn't compatible with
> > > > the scheme the user is tring to use.
> > > yeah, I guess you mean to fail the device attach when the IOASID is a
> > > nesting IOASID but the device is behind an iommu without nesting support.
> > > right?
> > Right..
> 
> Just want to confirm...
> 
> Does this mean that we only support hardware nesting and don't want to
> have soft nesting (shadowed page table in kernel) in IOASID?

No, the uAPI presents a contract, if the kernel can fulfill the
contract then it should be supported.

If you want SW nesting then the kernel has to have the SW support for
it or fail.

At least for the purposes of document I wouldn't devle too much deeper
into that question.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01 11:09     ` Lu Baolu
@ 2021-06-01 17:26       ` Jason Gunthorpe
  -1 siblings, 0 replies; 518+ messages in thread
From: Jason Gunthorpe @ 2021-06-01 17:26 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Tian, Kevin, LKML, Joerg Roedel, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

On Tue, Jun 01, 2021 at 07:09:21PM +0800, Lu Baolu wrote:

> This version only covers 1) and 4). Do you think we need to support 2),
> 3) and beyond? 

Yes aboslutely. The API should be flexable enough to specify the
creation of all future page table formats we'd want to have and all HW
specific details on those formats.

> If so, it seems that we need some in-kernel helpers and uAPIs to
> support pre-installing a page table to IOASID. 

Not sure what this means..

> From this point of view an IOASID is actually not just a variant of
> iommu_domain, but an I/O page table representation in a broader
> sense.

Yes, and things need to evolve in a staged way. The ioctl API should
have room for this growth but you need to start out with some
constrained enough to actually implement then figure out how to grow
from there

Jason

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-01 17:26       ` Jason Gunthorpe
  0 siblings, 0 replies; 518+ messages in thread
From: Jason Gunthorpe @ 2021-06-01 17:26 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Jean-Philippe Brucker, Tian, Kevin,
	Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, Jiang, Dave, David Gibson,
	David Woodhouse, Jason Wang

On Tue, Jun 01, 2021 at 07:09:21PM +0800, Lu Baolu wrote:

> This version only covers 1) and 4). Do you think we need to support 2),
> 3) and beyond? 

Yes aboslutely. The API should be flexable enough to specify the
creation of all future page table formats we'd want to have and all HW
specific details on those formats.

> If so, it seems that we need some in-kernel helpers and uAPIs to
> support pre-installing a page table to IOASID. 

Not sure what this means..

> From this point of view an IOASID is actually not just a variant of
> iommu_domain, but an I/O page table representation in a broader
> sense.

Yes, and things need to evolve in a staged way. The ioctl API should
have room for this growth but you need to start out with some
constrained enough to actually implement then figure out how to grow
from there

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  6:07                   ` Jason Wang
@ 2021-06-01 17:29                     ` Jason Gunthorpe
  -1 siblings, 0 replies; 518+ messages in thread
From: Jason Gunthorpe @ 2021-06-01 17:29 UTC (permalink / raw)
  To: Jason Wang
  Cc: Tian, Kevin, Lu Baolu, Liu Yi L,
	Alex Williamson (alex.williamson@redhat.com)"",
	kvm, Jonathan Corbet, LKML, iommu, David Woodhouse

On Tue, Jun 01, 2021 at 02:07:05PM +0800, Jason Wang wrote:

> For the case of 1M, I would like to know what's the use case for a single
> process to handle 1M+ address spaces?

For some scenarios every guest PASID will require a IOASID ID # so
there is a large enough demand that FDs alone are not a good fit.

Further there are global container wide properties that are hard to
carry over to a multi-FD model, like the attachment of devices to the
container at the startup.

> > So this RFC treats fd as a container of address spaces which is each
> > tagged by an IOASID.
> 
> If the container and address space is 1:1 then the container seems useless.

The examples at the bottom of the document show multiple IOASIDs in
the container for a parent/child type relationship

Jason

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-01 17:29                     ` Jason Gunthorpe
  0 siblings, 0 replies; 518+ messages in thread
From: Jason Gunthorpe @ 2021-06-01 17:29 UTC (permalink / raw)
  To: Jason Wang
  Cc: Tian, Kevin, kvm, Jonathan Corbet, LKML, iommu,
	Alex Williamson (alex.williamson@redhat.com)"",
	David Woodhouse

On Tue, Jun 01, 2021 at 02:07:05PM +0800, Jason Wang wrote:

> For the case of 1M, I would like to know what's the use case for a single
> process to handle 1M+ address spaces?

For some scenarios every guest PASID will require a IOASID ID # so
there is a large enough demand that FDs alone are not a good fit.

Further there are global container wide properties that are hard to
carry over to a multi-FD model, like the attachment of devices to the
container at the startup.

> > So this RFC treats fd as a container of address spaces which is each
> > tagged by an IOASID.
> 
> If the container and address space is 1:1 then the container seems useless.

The examples at the bottom of the document show multiple IOASIDs in
the container for a parent/child type relationship

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-05-27  7:58 ` Tian, Kevin
@ 2021-06-01 17:30   ` Parav Pandit
  -1 siblings, 0 replies; 518+ messages in thread
From: Parav Pandit @ 2021-06-01 17:30 UTC (permalink / raw)
  To: Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe, Lu Baolu,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang
  Cc: Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu, Hao,
	Jiang, Dave, Jacob Pan, Jean-Philippe Brucker, David Gibson,
	Kirti Wankhede, Robin Murphy

> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Thursday, May 27, 2021 1:28 PM

> 5.6. I/O page fault
> +++++++++++++++
> 
> (uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
> to guest IOMMU driver and backwards).
> 
> -   Host IOMMU driver receives a page request with raw fault_data {rid,
>     pasid, addr};
> 
> -   Host IOMMU driver identifies the faulting I/O page table according to
>     information registered by IOASID fault handler;
> 
> -   IOASID fault handler is called with raw fault_data (rid, pasid, addr), which
>     is saved in ioasid_data->fault_data (used for response);
> 
> -   IOASID fault handler generates an user fault_data (ioasid, addr), links it
>     to the shared ring buffer and triggers eventfd to userspace;
> 
> -   Upon received event, Qemu needs to find the virtual routing information
>     (v_rid + v_pasid) of the device attached to the faulting ioasid. If there are
>     multiple, pick a random one. This should be fine since the purpose is to
>     fix the I/O page table on the guest;
> 
> -   Qemu generates a virtual I/O page fault through vIOMMU into guest,
>     carrying the virtual fault data (v_rid, v_pasid, addr);
> 
Why does it have to be through vIOMMU?
For a VFIO PCI device, have you considered to reuse the same PRI interface to inject page fault in the guest?
This eliminates any new v_rid.
It will also route the page fault request and response through the right vfio device.

> -   Guest IOMMU driver fixes up the fault, updates the I/O page table, and
>     then sends a page response with virtual completion data (v_rid, v_pasid,
>     response_code) to vIOMMU;
> 
What about fixing up the fault for mmu page table as well in guest?
Or you meant both when above you said "updates the I/O page table"?

It is unclear to me that if there is single nested page table maintained or two (one for cr3 references and other for iommu).
Can you please clarify?

> -   Qemu finds the pending fault event, converts virtual completion data
>     into (ioasid, response_code), and then calls a /dev/ioasid ioctl to
>     complete the pending fault;
> 
For VFIO PCI device a virtual PRI request response interface is done, it can be generic interface among multiple vIOMMUs.

> -   /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved in
>     ioasid_data->fault_data, and then calls iommu api to complete it with
>     {rid, pasid, response_code};
>

^ permalink raw reply	[flat|nested] 518+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-01 17:30   ` Parav Pandit
  0 siblings, 0 replies; 518+ messages in thread
From: Parav Pandit @ 2021-06-01 17:30 UTC (permalink / raw)
  To: Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe, Lu Baolu,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang
  Cc: Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok, Jonathan Corbet,
	Kirti Wankhede, David Gibson, Robin Murphy

> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Thursday, May 27, 2021 1:28 PM

> 5.6. I/O page fault
> +++++++++++++++
> 
> (uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
> to guest IOMMU driver and backwards).
> 
> -   Host IOMMU driver receives a page request with raw fault_data {rid,
>     pasid, addr};
> 
> -   Host IOMMU driver identifies the faulting I/O page table according to
>     information registered by IOASID fault handler;
> 
> -   IOASID fault handler is called with raw fault_data (rid, pasid, addr), which
>     is saved in ioasid_data->fault_data (used for response);
> 
> -   IOASID fault handler generates an user fault_data (ioasid, addr), links it
>     to the shared ring buffer and triggers eventfd to userspace;
> 
> -   Upon received event, Qemu needs to find the virtual routing information
>     (v_rid + v_pasid) of the device attached to the faulting ioasid. If there are
>     multiple, pick a random one. This should be fine since the purpose is to
>     fix the I/O page table on the guest;
> 
> -   Qemu generates a virtual I/O page fault through vIOMMU into guest,
>     carrying the virtual fault data (v_rid, v_pasid, addr);
> 
Why does it have to be through vIOMMU?
For a VFIO PCI device, have you considered to reuse the same PRI interface to inject page fault in the guest?
This eliminates any new v_rid.
It will also route the page fault request and response through the right vfio device.

> -   Guest IOMMU driver fixes up the fault, updates the I/O page table, and
>     then sends a page response with virtual completion data (v_rid, v_pasid,
>     response_code) to vIOMMU;
> 
What about fixing up the fault for mmu page table as well in guest?
Or you meant both when above you said "updates the I/O page table"?

It is unclear to me that if there is single nested page table maintained or two (one for cr3 references and other for iommu).
Can you please clarify?

> -   Qemu finds the pending fault event, converts virtual completion data
>     into (ioasid, response_code), and then calls a /dev/ioasid ioctl to
>     complete the pending fault;
> 
For VFIO PCI device a virtual PRI request response interface is done, it can be generic interface among multiple vIOMMUs.

> -   /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved in
>     ioasid_data->fault_data, and then calls iommu api to complete it with
>     {rid, pasid, response_code};
>
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  8:47                       ` Jason Wang
@ 2021-06-01 17:31                         ` Jason Gunthorpe
  -1 siblings, 0 replies; 518+ messages in thread
From: Jason Gunthorpe @ 2021-06-01 17:31 UTC (permalink / raw)
  To: Jason Wang
  Cc: Tian, Kevin, Lu Baolu, Liu Yi L, kvm, Jonathan Corbet, iommu,
	LKML, Alex Williamson (alex.williamson@redhat.com)"",
	David Woodhouse

On Tue, Jun 01, 2021 at 04:47:15PM +0800, Jason Wang wrote:
 
> We can open up to ~0U file descriptors, I don't see why we need to restrict
> it in uAPI.

There are significant problems with such large file descriptor
tables. High FD numbers man things like select don't work at all
anymore and IIRC there are more complications.

A huge number of FDs for typical usages should be avoided.

Jason

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-01 17:31                         ` Jason Gunthorpe
  0 siblings, 0 replies; 518+ messages in thread
From: Jason Gunthorpe @ 2021-06-01 17:31 UTC (permalink / raw)
  To: Jason Wang
  Cc: Tian, Kevin,
	Alex Williamson (alex.williamson@redhat.com)"",
	kvm, Jonathan Corbet, LKML, iommu, David Woodhouse

On Tue, Jun 01, 2021 at 04:47:15PM +0800, Jason Wang wrote:
 
> We can open up to ~0U file descriptors, I don't see why we need to restrict
> it in uAPI.

There are significant problems with such large file descriptor
tables. High FD numbers man things like select don't work at all
anymore and IIRC there are more complications.

A huge number of FDs for typical usages should be avoided.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01 12:30         ` Lu Baolu
@ 2021-06-01 17:33           ` Jason Gunthorpe
  -1 siblings, 0 replies; 518+ messages in thread
From: Jason Gunthorpe @ 2021-06-01 17:33 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Shenming Lu, Tian, Kevin, LKML, Joerg Roedel, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy, Zenghui Yu,
	wanghaibin.wang

On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote:

> The drivers register per page table fault handlers to /dev/ioasid which
> will then register itself to iommu core to listen and route the per-
> device I/O page faults. 

I'm still confused why drivers need fault handlers at all?

Jason

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-01 17:33           ` Jason Gunthorpe
  0 siblings, 0 replies; 518+ messages in thread
From: Jason Gunthorpe @ 2021-06-01 17:33 UTC (permalink / raw)
  To: Lu Baolu
  Cc: kvm, Jason Wang, Kirti Wankhede, Jean-Philippe Brucker, Jiang,
	Dave, Raj, Ashok, Jonathan Corbet, wanghaibin.wang, Tian, Kevin,
	Alex Williamson (alex.williamson@redhat.com),
	David Gibson, David Woodhouse, LKML, Shenming Lu, iommu,
	Robin Murphy

On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote:

> The drivers register per page table fault handlers to /dev/ioasid which
> will then register itself to iommu core to listen and route the per-
> device I/O page faults. 

I'm still confused why drivers need fault handlers at all?

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01 12:04       ` Parav Pandit
@ 2021-06-01 17:36         ` Jason Gunthorpe
  -1 siblings, 0 replies; 518+ messages in thread
From: Jason Gunthorpe @ 2021-06-01 17:36 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

On Tue, Jun 01, 2021 at 12:04:00PM +0000, Parav Pandit wrote:
> 
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Monday, May 31, 2021 11:43 PM
> > 
> > On Mon, May 31, 2021 at 05:37:35PM +0000, Parav Pandit wrote:
> > 
> > > In that case, can it be a new system call? Why does it have to be under
> > /dev/ioasid?
> > > For example few years back such system call mpin() thought was proposed
> > in [1].
> > 
> > Reference counting of the overall pins are required
> > 
> > So when a pinned pages is incorporated into an IOASID page table in a later
> > IOCTL it means it cannot be unpinned while the IOASID page table is using it.
> Ok. but cant it use the same refcount of that mmu uses?

Manipulating that refcount is part of the overhead that is trying to
be avoided here, plus ensuring that the pinned pages accounting
doesn't get out of sync with the actual account of pinned pages!

> > Then the ioasid's interval tree would be mapped into a page table tree in HW
> > format.

> Does it mean in simple use case [1], second level page table copy is
> maintained in the IOMMU side via map interface?  I hope not. It
> should use the same as what mmu uses, right?

Not a full page by page copy, but some interval reference.

Jason

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-01 17:36         ` Jason Gunthorpe
  0 siblings, 0 replies; 518+ messages in thread
From: Jason Gunthorpe @ 2021-06-01 17:36 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jean-Philippe Brucker, Tian, Kevin,
	Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, David Gibson, Jiang, Dave,
	David Woodhouse, Jason Wang

On Tue, Jun 01, 2021 at 12:04:00PM +0000, Parav Pandit wrote:
> 
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Monday, May 31, 2021 11:43 PM
> > 
> > On Mon, May 31, 2021 at 05:37:35PM +0000, Parav Pandit wrote:
> > 
> > > In that case, can it be a new system call? Why does it have to be under
> > /dev/ioasid?
> > > For example few years back such system call mpin() thought was proposed
> > in [1].
> > 
> > Reference counting of the overall pins are required
> > 
> > So when a pinned pages is incorporated into an IOASID page table in a later
> > IOCTL it means it cannot be unpinned while the IOASID page table is using it.
> Ok. but cant it use the same refcount of that mmu uses?

Manipulating that refcount is part of the overhead that is trying to
be avoided here, plus ensuring that the pinned pages accounting
doesn't get out of sync with the actual account of pinned pages!

> > Then the ioasid's interval tree would be mapped into a page table tree in HW
> > format.

> Does it mean in simple use case [1], second level page table copy is
> maintained in the IOMMU side via map interface?  I hope not. It
> should use the same as what mmu uses, right?

Not a full page by page copy, but some interval reference.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  8:10     ` Tian, Kevin
@ 2021-06-01 17:42       ` Jason Gunthorpe
  -1 siblings, 0 replies; 518+ messages in thread
From: Jason Gunthorpe @ 2021-06-01 17:42 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

On Tue, Jun 01, 2021 at 08:10:14AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Saturday, May 29, 2021 1:36 AM
> > 
> > On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > 
> > > IOASID nesting can be implemented in two ways: hardware nesting and
> > > software nesting. With hardware support the child and parent I/O page
> > > tables are walked consecutively by the IOMMU to form a nested translation.
> > > When it's implemented in software, the ioasid driver is responsible for
> > > merging the two-level mappings into a single-level shadow I/O page table.
> > > Software nesting requires both child/parent page tables operated through
> > > the dma mapping protocol, so any change in either level can be captured
> > > by the kernel to update the corresponding shadow mapping.
> > 
> > Why? A SW emulation could do this synchronization during invalidation
> > processing if invalidation contained an IOVA range.
> 
> In this proposal we differentiate between host-managed and user-
> managed I/O page tables. If host-managed, the user is expected to use
> map/unmap cmd explicitly upon any change required on the page table. 
> If user-managed, the user first binds its page table to the IOMMU and 
> then use invalidation cmd to flush iotlb when necessary (e.g. typically
> not required when changing a PTE from non-present to present).
> 
> We expect user to use map+unmap and bind+invalidate respectively
> instead of mixing them together. Following this policy, map+unmap
> must be used in both levels for software nesting, so changes in either 
> level are captured timely to synchronize the shadow mapping.

map+unmap or bind+invalidate is a policy of the IOASID itself set when
it is created. If you put two different types in a tree then each IOASID
must continue to use its own operation mode.

I don't see a reason to force all IOASIDs in a tree to be consistent??

A software emulated two level page table where the leaf level is a
bound page table in guest memory should continue to use
bind/invalidate to maintain the guest page table IOASID even though it
is a SW construct.

The GPA level should use map/unmap because it is a kernel owned page
table

Though how to efficiently mix map/unmap on the GPA when there are SW
nested levels below it looks to be quite challenging.

Jason

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-01 17:42       ` Jason Gunthorpe
  0 siblings, 0 replies; 518+ messages in thread
From: Jason Gunthorpe @ 2021-06-01 17:42 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jean-Philippe Brucker,
	Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, David Gibson, Jiang, Dave,
	David Woodhouse, Jason Wang

On Tue, Jun 01, 2021 at 08:10:14AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Saturday, May 29, 2021 1:36 AM
> > 
> > On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > 
> > > IOASID nesting can be implemented in two ways: hardware nesting and
> > > software nesting. With hardware support the child and parent I/O page
> > > tables are walked consecutively by the IOMMU to form a nested translation.
> > > When it's implemented in software, the ioasid driver is responsible for
> > > merging the two-level mappings into a single-level shadow I/O page table.
> > > Software nesting requires both child/parent page tables operated through
> > > the dma mapping protocol, so any change in either level can be captured
> > > by the kernel to update the corresponding shadow mapping.
> > 
> > Why? A SW emulation could do this synchronization during invalidation
> > processing if invalidation contained an IOVA range.
> 
> In this proposal we differentiate between host-managed and user-
> managed I/O page tables. If host-managed, the user is expected to use
> map/unmap cmd explicitly upon any change required on the page table. 
> If user-managed, the user first binds its page table to the IOMMU and 
> then use invalidation cmd to flush iotlb when necessary (e.g. typically
> not required when changing a PTE from non-present to present).
> 
> We expect user to use map+unmap and bind+invalidate respectively
> instead of mixing them together. Following this policy, map+unmap
> must be used in both levels for software nesting, so changes in either 
> level are captured timely to synchronize the shadow mapping.

map+unmap or bind+invalidate is a policy of the IOASID itself set when
it is created. If you put two different types in a tree then each IOASID
must continue to use its own operation mode.

I don't see a reason to force all IOASIDs in a tree to be consistent??

A software emulated two level page table where the leaf level is a
bound page table in guest memory should continue to use
bind/invalidate to maintain the guest page table IOASID even though it
is a SW construct.

The GPA level should use map/unmap because it is a kernel owned page
table

Though how to efficiently mix map/unmap on the GPA when there are SW
nested levels below it looks to be quite challenging.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  8:38     ` Tian, Kevin
@ 2021-06-01 17:56       ` Jason Gunthorpe
  -1 siblings, 0 replies; 518+ messages in thread
From: Jason Gunthorpe @ 2021-06-01 17:56 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

On Tue, Jun 01, 2021 at 08:38:00AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Saturday, May 29, 2021 3:59 AM
> > 
> > On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > >
> > > 5. Use Cases and Flows
> > >
> > > Here assume VFIO will support a new model where every bound device
> > > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> > > going through legacy container/group interface. For illustration purpose
> > > those devices are just called dev[1...N]:
> > >
> > > 	device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> > >
> > > As explained earlier, one IOASID fd is sufficient for all intended use cases:
> > >
> > > 	ioasid_fd = open("/dev/ioasid", mode);
> > >
> > > For simplicity below examples are all made for the virtualization story.
> > > They are representative and could be easily adapted to a non-virtualization
> > > scenario.
> > 
> > For others, I don't think this is *strictly* necessary, we can
> > probably still get to the device_fd using the group_fd and fit in
> > /dev/ioasid. It does make the rest of this more readable though.
> 
> Jason, want to confirm here. Per earlier discussion we remain an
> impression that you want VFIO to be a pure device driver thus
> container/group are used only for legacy application.

Let me call this a "nice wish".

If you get to a point where you hard need this, then identify the hard
requirement and let's do it, but I wouldn't bloat this already large
project unnecessarily.

Similarly I wouldn't depend on the group fd existing in this design
so it could be changed later.

> From this comment are you suggesting that VFIO can still keep
> container/ group concepts and user just deprecates the use of vfio
> iommu uAPI (e.g. VFIO_SET_IOMMU) by using /dev/ioasid (which has a
> simple policy that an IOASID will reject cmd if partially-attached
> group exists)?

I would say no on the container. /dev/ioasid == the container, having
two competing objects at once in a single process is just a mess.

If the group fd can be kept requires charting a path through the
ioctls where the container is not used and /dev/ioasid is sub'd in
using the same device FD specific IOCTLs you show here.

I didn't try to chart this out carefully.

Also, ultimately, something need to be done about compatability with
the vfio container fd. It looks clear enough to me that the the VFIO
container FD is just a single IOASID using a special ioctl interface
so it would be quite rasonable to harmonize these somehow.

But that is too complicated and far out for me at least to guess on at
this point..

> > Still a little unsure why the vPASID is here not on the gva_ioasid. Is
> > there any scenario where we want different vpasid's for the same
> > IOASID? I guess it is OK like this. Hum.
> 
> Yes, it's completely sane that the guest links a I/O page table to 
> different vpasids on dev1 and dev2. The IOMMU doesn't mandate
> that when multiple devices share an I/O page table they must use
> the same PASID#. 

Ok..

Jason

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-01 17:56       ` Jason Gunthorpe
  0 siblings, 0 replies; 518+ messages in thread
From: Jason Gunthorpe @ 2021-06-01 17:56 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jean-Philippe Brucker,
	Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, David Gibson, Jiang, Dave,
	David Woodhouse, Jason Wang

On Tue, Jun 01, 2021 at 08:38:00AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Saturday, May 29, 2021 3:59 AM
> > 
> > On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > >
> > > 5. Use Cases and Flows
> > >
> > > Here assume VFIO will support a new model where every bound device
> > > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> > > going through legacy container/group interface. For illustration purpose
> > > those devices are just called dev[1...N]:
> > >
> > > 	device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> > >
> > > As explained earlier, one IOASID fd is sufficient for all intended use cases:
> > >
> > > 	ioasid_fd = open("/dev/ioasid", mode);
> > >
> > > For simplicity below examples are all made for the virtualization story.
> > > They are representative and could be easily adapted to a non-virtualization
> > > scenario.
> > 
> > For others, I don't think this is *strictly* necessary, we can
> > probably still get to the device_fd using the group_fd and fit in
> > /dev/ioasid. It does make the rest of this more readable though.
> 
> Jason, want to confirm here. Per earlier discussion we remain an
> impression that you want VFIO to be a pure device driver thus
> container/group are used only for legacy application.

Let me call this a "nice wish".

If you get to a point where you hard need this, then identify the hard
requirement and let's do it, but I wouldn't bloat this already large
project unnecessarily.

Similarly I wouldn't depend on the group fd existing in this design
so it could be changed later.

> From this comment are you suggesting that VFIO can still keep
> container/ group concepts and user just deprecates the use of vfio
> iommu uAPI (e.g. VFIO_SET_IOMMU) by using /dev/ioasid (which has a
> simple policy that an IOASID will reject cmd if partially-attached
> group exists)?

I would say no on the container. /dev/ioasid == the container, having
two competing objects at once in a single process is just a mess.

If the group fd can be kept requires charting a path through the
ioctls where the container is not used and /dev/ioasid is sub'd in
using the same device FD specific IOCTLs you show here.

I didn't try to chart this out carefully.

Also, ultimately, something need to be done about compatability with
the vfio container fd. It looks clear enough to me that the the VFIO
container FD is just a single IOASID using a special ioctl interface
so it would be quite rasonable to harmonize these somehow.

But that is too complicated and far out for me at least to guess on at
this point..

> > Still a little unsure why the vPASID is here not on the gva_ioasid. Is
> > there any scenario where we want different vpasid's for the same
> > IOASID? I guess it is OK like this. Hum.
> 
> Yes, it's completely sane that the guest links a I/O page table to 
> different vpasids on dev1 and dev2. The IOMMU doesn't mandate
> that when multiple devices share an I/O page table they must use
> the same PASID#. 

Ok..

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  7:01     ` Tian, Kevin
@ 2021-06-01 20:28       ` Jason Gunthorpe
  -1 siblings, 0 replies; 518+ messages in thread
From: Jason Gunthorpe @ 2021-06-01 20:28 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

On Tue, Jun 01, 2021 at 07:01:57AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Saturday, May 29, 2021 4:03 AM
> > 
> > On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > > /dev/ioasid provides an unified interface for managing I/O page tables for
> > > devices assigned to userspace. Device passthrough frameworks (VFIO,
> > vDPA,
> > > etc.) are expected to use this interface instead of creating their own logic to
> > > isolate untrusted device DMAs initiated by userspace.
> > 
> > It is very long, but I think this has turned out quite well. It
> > certainly matches the basic sketch I had in my head when we were
> > talking about how to create vDPA devices a few years ago.
> > 
> > When you get down to the operations they all seem pretty common sense
> > and straightfoward. Create an IOASID. Connect to a device. Fill the
> > IOASID with pages somehow. Worry about PASID labeling.
> > 
> > It really is critical to get all the vendor IOMMU people to go over it
> > and see how their HW features map into this.
> > 
> 
> Agree. btw I feel it might be good to have several design opens 
> centrally discussed after going through all the comments. Otherwise 
> they may be buried in different sub-threads and potentially with 
> insufficient care (especially for people who haven't completed the
> reading).
> 
> I summarized five opens here, about:
> 
> 1)  Finalizing the name to replace /dev/ioasid;
> 2)  Whether one device is allowed to bind to multiple IOASID fd's;
> 3)  Carry device information in invalidation/fault reporting uAPI;
> 4)  What should/could be specified when allocating an IOASID;
> 5)  The protocol between vfio group and kvm;
> 
> For 1), two alternative names are mentioned: /dev/iommu and 
> /dev/ioas. I don't have a strong preference and would like to hear 
> votes from all stakeholders. /dev/iommu is slightly better imho for 
> two reasons. First, per AMD's presentation in last KVM forum they 
> implement vIOMMU in hardware thus need to support user-managed 
> domains. An iommu uAPI notation might make more sense moving 
> forward. Second, it makes later uAPI naming easier as 'IOASID' can 
> be always put as an object, e.g. IOMMU_ALLOC_IOASID instead of 
> IOASID_ALLOC_IOASID. :)

I think two years ago I suggested /dev/iommu and it didn't go very far
at the time. We've also talked about this as /dev/sva for a while and
now /dev/ioasid

I think /dev/iommu is fine, and call the things inside them IOAS
objects.

Then we don't have naming aliasing with kernel constructs.
 
> For 2), Jason prefers to not blocking it if no kernel design reason. If 
> one device is allowed to bind multiple IOASID fd's, the main problem
> is about cross-fd IOASID nesting, e.g. having gpa_ioasid created in fd1 
> and giova_ioasid created in fd2 and then nesting them together (and

Huh? This can't happen

Creating an IOASID is an operation on on the /dev/ioasid FD. We won't
provide APIs to create a tree of IOASID's outside a single FD container.

If a device can consume multiple IOASID's it doesn't care how many or
what /dev/ioasid FDs they come from.

> To the other end there was also thought whether we should make
> a single I/O address space per IOASID fd. This was discussed in previous
> thread that #fd's are insufficient to afford theoretical 1M's address
> spaces per device. But let's have another revisit and draw a clear
> conclusion whether this option is viable.

I had remarks on this, I think per-fd doesn't work
 
> This implies that VFIO_BOUND_IOASID will be extended to allow user
> specify a device label. This label will be recorded in /dev/iommu to
> serve per-device invalidation request from and report per-device 
> fault data to the user.

I wonder which of the user providing a 64 bit cookie or the kernel
returning a small IDA is the best choice here? Both have merits
depending on what qemu needs..

> In addition, vPASID (if provided by user) will
> be also recorded in /dev/iommu so vPASID<->pPASID conversion 
> is conducted properly. e.g. invalidation request from user carries
> a vPASID which must be converted into pPASID before calling iommu
> driver. Vice versa for raw fault data which carries pPASID while the
> user expects a vPASID.

I don't think the PASID should be returned at all. It should return
the IOASID number in the FD and/or a u64 cookie associated with that
IOASID. Userspace should figure out what the IOASID & device
combination means.

> Seems to close this design open we have to touch the kAPI design. and 
> Joerg's input is highly appreciated here.

uAPI is forever, the kAPI is constantly changing. I always dislike
warping the uAPI based on the current kAPI situation.

Jason

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-01 20:28       ` Jason Gunthorpe
  0 siblings, 0 replies; 518+ messages in thread
From: Jason Gunthorpe @ 2021-06-01 20:28 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jean-Philippe Brucker,
	Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, David Gibson, Jiang, Dave,
	David Woodhouse, Jason Wang

On Tue, Jun 01, 2021 at 07:01:57AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Saturday, May 29, 2021 4:03 AM
> > 
> > On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > > /dev/ioasid provides an unified interface for managing I/O page tables for
> > > devices assigned to userspace. Device passthrough frameworks (VFIO,
> > vDPA,
> > > etc.) are expected to use this interface instead of creating their own logic to
> > > isolate untrusted device DMAs initiated by userspace.
> > 
> > It is very long, but I think this has turned out quite well. It
> > certainly matches the basic sketch I had in my head when we were
> > talking about how to create vDPA devices a few years ago.
> > 
> > When you get down to the operations they all seem pretty common sense
> > and straightfoward. Create an IOASID. Connect to a device. Fill the
> > IOASID with pages somehow. Worry about PASID labeling.
> > 
> > It really is critical to get all the vendor IOMMU people to go over it
> > and see how their HW features map into this.
> > 
> 
> Agree. btw I feel it might be good to have several design opens 
> centrally discussed after going through all the comments. Otherwise 
> they may be buried in different sub-threads and potentially with 
> insufficient care (especially for people who haven't completed the
> reading).
> 
> I summarized five opens here, about:
> 
> 1)  Finalizing the name to replace /dev/ioasid;
> 2)  Whether one device is allowed to bind to multiple IOASID fd's;
> 3)  Carry device information in invalidation/fault reporting uAPI;
> 4)  What should/could be specified when allocating an IOASID;
> 5)  The protocol between vfio group and kvm;
> 
> For 1), two alternative names are mentioned: /dev/iommu and 
> /dev/ioas. I don't have a strong preference and would like to hear 
> votes from all stakeholders. /dev/iommu is slightly better imho for 
> two reasons. First, per AMD's presentation in last KVM forum they 
> implement vIOMMU in hardware thus need to support user-managed 
> domains. An iommu uAPI notation might make more sense moving 
> forward. Second, it makes later uAPI naming easier as 'IOASID' can 
> be always put as an object, e.g. IOMMU_ALLOC_IOASID instead of 
> IOASID_ALLOC_IOASID. :)

I think two years ago I suggested /dev/iommu and it didn't go very far
at the time. We've also talked about this as /dev/sva for a while and
now /dev/ioasid

I think /dev/iommu is fine, and call the things inside them IOAS
objects.

Then we don't have naming aliasing with kernel constructs.
 
> For 2), Jason prefers to not blocking it if no kernel design reason. If 
> one device is allowed to bind multiple IOASID fd's, the main problem
> is about cross-fd IOASID nesting, e.g. having gpa_ioasid created in fd1 
> and giova_ioasid created in fd2 and then nesting them together (and

Huh? This can't happen

Creating an IOASID is an operation on on the /dev/ioasid FD. We won't
provide APIs to create a tree of IOASID's outside a single FD container.

If a device can consume multiple IOASID's it doesn't care how many or
what /dev/ioasid FDs they come from.

> To the other end there was also thought whether we should make
> a single I/O address space per IOASID fd. This was discussed in previous
> thread that #fd's are insufficient to afford theoretical 1M's address
> spaces per device. But let's have another revisit and draw a clear
> conclusion whether this option is viable.

I had remarks on this, I think per-fd doesn't work
 
> This implies that VFIO_BOUND_IOASID will be extended to allow user
> specify a device label. This label will be recorded in /dev/iommu to
> serve per-device invalidation request from and report per-device 
> fault data to the user.

I wonder which of the user providing a 64 bit cookie or the kernel
returning a small IDA is the best choice here? Both have merits
depending on what qemu needs..

> In addition, vPASID (if provided by user) will
> be also recorded in /dev/iommu so vPASID<->pPASID conversion 
> is conducted properly. e.g. invalidation request from user carries
> a vPASID which must be converted into pPASID before calling iommu
> driver. Vice versa for raw fault data which carries pPASID while the
> user expects a vPASID.

I don't think the PASID should be returned at all. It should return
the IOASID number in the FD and/or a u64 cookie associated with that
IOASID. Userspace should figure out what the IOASID & device
combination means.

> Seems to close this design open we have to touch the kAPI design. and 
> Joerg's input is highly appreciated here.

uAPI is forever, the kAPI is constantly changing. I always dislike
warping the uAPI based on the current kAPI situation.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  7:01     ` Tian, Kevin
@ 2021-06-01 22:22       ` Alex Williamson
  -1 siblings, 0 replies; 518+ messages in thread
From: Alex Williamson @ 2021-06-01 22:22 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok,
	Liu, Yi L, Wu, Hao, Jiang, Dave, Jacob Pan,
	Jean-Philippe Brucker, David Gibson, Kirti Wankhede,
	Robin Murphy

On Tue, 1 Jun 2021 07:01:57 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> I summarized five opens here, about:
> 
> 1)  Finalizing the name to replace /dev/ioasid;
> 2)  Whether one device is allowed to bind to multiple IOASID fd's;
> 3)  Carry device information in invalidation/fault reporting uAPI;
> 4)  What should/could be specified when allocating an IOASID;
> 5)  The protocol between vfio group and kvm;
> 
...
> 
> For 5), I'd expect Alex to chime in. Per my understanding looks the
> original purpose of this protocol is not about I/O address space. It's
> for KVM to know whether any device is assigned to this VM and then
> do something special (e.g. posted interrupt, EPT cache attribute, etc.).

Right, the original use case was for KVM to determine whether it needs
to emulate invlpg, so it needs to be aware when an assigned device is
present and be able to test if DMA for that device is cache coherent.
The user, QEMU, creates a KVM "pseudo" device representing the vfio
group, providing the file descriptor of that group to show ownership.
The ugly symbol_get code is to avoid hard module dependencies, ie. the
kvm module should not pull in or require the vfio module, but vfio will
be present if attempting to register this device.

With kvmgt, the interface also became a way to register the kvm pointer
with vfio for the translation mentioned elsewhere in this thread.

The PPC/SPAPR support allows KVM to associate a vfio group to an IOMMU
page table so that it can handle iotlb programming from pre-registered
memory without trapping out to userspace.

> Because KVM deduces some policy based on the fact of assigned device, 
> it needs to hold a reference to related vfio group. this part is irrelevant
> to this RFC. 

All of these use cases are related to the IOMMU, whether DMA is
coherent, translating device IOVA to GPA, and an acceleration path to
emulate IOMMU programming in kernel... they seem pretty relevant.

> But ARM's VMID usage is related to I/O address space thus needs some
> consideration. Another strange thing is about PPC. Looks it also leverages
> this protocol to do iommu group attach: kvm_spapr_tce_attach_iommu_
> group. I don't know why it's done through KVM instead of VFIO uAPI in
> the first place.

AIUI, IOMMU programming on PPC is done through hypercalls, so KVM needs
to know how to handle those for in-kernel acceleration.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-01 22:22       ` Alex Williamson
  0 siblings, 0 replies; 518+ messages in thread
From: Alex Williamson @ 2021-06-01 22:22 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok, kvm,
	Jonathan Corbet, Robin Murphy, LKML, Kirti Wankhede, iommu,
	David Gibson, Jason Gunthorpe, David Woodhouse, Jason Wang

On Tue, 1 Jun 2021 07:01:57 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> I summarized five opens here, about:
> 
> 1)  Finalizing the name to replace /dev/ioasid;
> 2)  Whether one device is allowed to bind to multiple IOASID fd's;
> 3)  Carry device information in invalidation/fault reporting uAPI;
> 4)  What should/could be specified when allocating an IOASID;
> 5)  The protocol between vfio group and kvm;
> 
...
> 
> For 5), I'd expect Alex to chime in. Per my understanding looks the
> original purpose of this protocol is not about I/O address space. It's
> for KVM to know whether any device is assigned to this VM and then
> do something special (e.g. posted interrupt, EPT cache attribute, etc.).

Right, the original use case was for KVM to determine whether it needs
to emulate invlpg, so it needs to be aware when an assigned device is
present and be able to test if DMA for that device is cache coherent.
The user, QEMU, creates a KVM "pseudo" device representing the vfio
group, providing the file descriptor of that group to show ownership.
The ugly symbol_get code is to avoid hard module dependencies, ie. the
kvm module should not pull in or require the vfio module, but vfio will
be present if attempting to register this device.

With kvmgt, the interface also became a way to register the kvm pointer
with vfio for the translation mentioned elsewhere in this thread.

The PPC/SPAPR support allows KVM to associate a vfio group to an IOMMU
page table so that it can handle iotlb programming from pre-registered
memory without trapping out to userspace.

> Because KVM deduces some policy based on the fact of assigned device, 
> it needs to hold a reference to related vfio group. this part is irrelevant
> to this RFC. 

All of these use cases are related to the IOMMU, whether DMA is
coherent, translating device IOVA to GPA, and an acceleration path to
emulate IOMMU programming in kernel... they seem pretty relevant.

> But ARM's VMID usage is related to I/O address space thus needs some
> consideration. Another strange thing is about PPC. Looks it also leverages
> this protocol to do iommu group attach: kvm_spapr_tce_attach_iommu_
> group. I don't know why it's done through KVM instead of VFIO uAPI in
> the first place.

AIUI, IOMMU programming on PPC is done through hypercalls, so KVM needs
to know how to handle those for in-kernel acceleration.  Thanks,

Alex

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-06-01 20:28       ` Jason Gunthorpe
@ 2021-06-02  1:25         ` Tian, Kevin
  -1 siblings, 0 replies; 518+ messages in thread
From: Tian, Kevin @ 2021-06-02  1:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, June 2, 2021 4:29 AM
> 
> On Tue, Jun 01, 2021 at 07:01:57AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Saturday, May 29, 2021 4:03 AM
> > >
> > > On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > > > /dev/ioasid provides an unified interface for managing I/O page tables
> for
> > > > devices assigned to userspace. Device passthrough frameworks (VFIO,
> > > vDPA,
> > > > etc.) are expected to use this interface instead of creating their own
> logic to
> > > > isolate untrusted device DMAs initiated by userspace.
> > >
> > > It is very long, but I think this has turned out quite well. It
> > > certainly matches the basic sketch I had in my head when we were
> > > talking about how to create vDPA devices a few years ago.
> > >
> > > When you get down to the operations they all seem pretty common
> sense
> > > and straightfoward. Create an IOASID. Connect to a device. Fill the
> > > IOASID with pages somehow. Worry about PASID labeling.
> > >
> > > It really is critical to get all the vendor IOMMU people to go over it
> > > and see how their HW features map into this.
> > >
> >
> > Agree. btw I feel it might be good to have several design opens
> > centrally discussed after going through all the comments. Otherwise
> > they may be buried in different sub-threads and potentially with
> > insufficient care (especially for people who haven't completed the
> > reading).
> >
> > I summarized five opens here, about:
> >
> > 1)  Finalizing the name to replace /dev/ioasid;
> > 2)  Whether one device is allowed to bind to multiple IOASID fd's;
> > 3)  Carry device information in invalidation/fault reporting uAPI;
> > 4)  What should/could be specified when allocating an IOASID;
> > 5)  The protocol between vfio group and kvm;
> >
> > For 1), two alternative names are mentioned: /dev/iommu and
> > /dev/ioas. I don't have a strong preference and would like to hear
> > votes from all stakeholders. /dev/iommu is slightly better imho for
> > two reasons. First, per AMD's presentation in last KVM forum they
> > implement vIOMMU in hardware thus need to support user-managed
> > domains. An iommu uAPI notation might make more sense moving
> > forward. Second, it makes later uAPI naming easier as 'IOASID' can
> > be always put as an object, e.g. IOMMU_ALLOC_IOASID instead of
> > IOASID_ALLOC_IOASID. :)
> 
> I think two years ago I suggested /dev/iommu and it didn't go very far
> at the time. We've also talked about this as /dev/sva for a while and
> now /dev/ioasid
> 
> I think /dev/iommu is fine, and call the things inside them IOAS
> objects.
> 
> Then we don't have naming aliasing with kernel constructs.
> 
> > For 2), Jason prefers to not blocking it if no kernel design reason. If
> > one device is allowed to bind multiple IOASID fd's, the main problem
> > is about cross-fd IOASID nesting, e.g. having gpa_ioasid created in fd1
> > and giova_ioasid created in fd2 and then nesting them together (and
> 
> Huh? This can't happen
> 
> Creating an IOASID is an operation on on the /dev/ioasid FD. We won't
> provide APIs to create a tree of IOASID's outside a single FD container.
> 
> If a device can consume multiple IOASID's it doesn't care how many or
> what /dev/ioasid FDs they come from.

OK, this implies that if one user inadvertently creates intended parent/
child via different fd's then the operation will simply fail. More specifically
taking ARM's case for example. There is only a single 2nd-level I/O page
table per device (nested by multiple 1st-level tables). Say the user already 
creates a gpa_ioasid for a device via fd1. Now he binds the device to fd2, 
intending to enable vSVA which requires nested translation thus needs 
create a parent via fd2. This parent creation will simply fail by the IOMMU 
layer because the 2nd-level (via fd1) is already installed for this device.

> 
> > To the other end there was also thought whether we should make
> > a single I/O address space per IOASID fd. This was discussed in previous
> > thread that #fd's are insufficient to afford theoretical 1M's address
> > spaces per device. But let's have another revisit and draw a clear
> > conclusion whether this option is viable.
> 
> I had remarks on this, I think per-fd doesn't work
> 
> > This implies that VFIO_BOUND_IOASID will be extended to allow user
> > specify a device label. This label will be recorded in /dev/iommu to
> > serve per-device invalidation request from and report per-device
> > fault data to the user.
> 
> I wonder which of the user providing a 64 bit cookie or the kernel
> returning a small IDA is the best choice here? Both have merits
> depending on what qemu needs..

Yes, either way can work. I don't have a strong preference. Jean?

> 
> > In addition, vPASID (if provided by user) will
> > be also recorded in /dev/iommu so vPASID<->pPASID conversion
> > is conducted properly. e.g. invalidation request from user carries
> > a vPASID which must be converted into pPASID before calling iommu
> > driver. Vice versa for raw fault data which carries pPASID while the
> > user expects a vPASID.
> 
> I don't think the PASID should be returned at all. It should return
> the IOASID number in the FD and/or a u64 cookie associated with that
> IOASID. Userspace should figure out what the IOASID & device
> combination means.

This is true for Intel. But what about ARM which has only one IOASID
(pasid table) per device to represent all guest I/O page tables?

> 
> > Seems to close this design open we have to touch the kAPI design. and
> > Joerg's input is highly appreciated here.
> 
> uAPI is forever, the kAPI is constantly changing. I always dislike
> warping the uAPI based on the current kAPI situation.
> 

I got this point. My point was that I didn't see significant gain from either 
option thus to better compare the two uAPI options we might want to 
further consider the involved kAPI effort as another factor.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 518+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-02  1:25         ` Tian, Kevin
  0 siblings, 0 replies; 518+ messages in thread
From: Tian, Kevin @ 2021-06-02  1:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker,
	Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, David Gibson, Jiang, Dave,
	David Woodhouse, Jason Wang

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, June 2, 2021 4:29 AM
> 
> On Tue, Jun 01, 2021 at 07:01:57AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Saturday, May 29, 2021 4:03 AM
> > >
> > > On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > > > /dev/ioasid provides an unified interface for managing I/O page tables
> for
> > > > devices assigned to userspace. Device passthrough frameworks (VFIO,
> > > vDPA,
> > > > etc.) are expected to use this interface instead of creating their own
> logic to
> > > > isolate untrusted device DMAs initiated by userspace.
> > >
> > > It is very long, but I think this has turned out quite well. It
> > > certainly matches the basic sketch I had in my head when we were
> > > talking about how to create vDPA devices a few years ago.
> > >
> > > When you get down to the operations they all seem pretty common
> sense
> > > and straightfoward. Create an IOASID. Connect to a device. Fill the
> > > IOASID with pages somehow. Worry about PASID labeling.
> > >
> > > It really is critical to get all the vendor IOMMU people to go over it
> > > and see how their HW features map into this.
> > >
> >
> > Agree. btw I feel it might be good to have several design opens
> > centrally discussed after going through all the comments. Otherwise
> > they may be buried in different sub-threads and potentially with
> > insufficient care (especially for people who haven't completed the
> > reading).
> >
> > I summarized five opens here, about:
> >
> > 1)  Finalizing the name to replace /dev/ioasid;
> > 2)  Whether one device is allowed to bind to multiple IOASID fd's;
> > 3)  Carry device information in invalidation/fault reporting uAPI;
> > 4)  What should/could be specified when allocating an IOASID;
> > 5)  The protocol between vfio group and kvm;
> >
> > For 1), two alternative names are mentioned: /dev/iommu and
> > /dev/ioas. I don't have a strong preference and would like to hear
> > votes from all stakeholders. /dev/iommu is slightly better imho for
> > two reasons. First, per AMD's presentation in last KVM forum they
> > implement vIOMMU in hardware thus need to support user-managed
> > domains. An iommu uAPI notation might make more sense moving
> > forward. Second, it makes later uAPI naming easier as 'IOASID' can
> > be always put as an object, e.g. IOMMU_ALLOC_IOASID instead of
> > IOASID_ALLOC_IOASID. :)
> 
> I think two years ago I suggested /dev/iommu and it didn't go very far
> at the time. We've also talked about this as /dev/sva for a while and
> now /dev/ioasid
> 
> I think /dev/iommu is fine, and call the things inside them IOAS
> objects.
> 
> Then we don't have naming aliasing with kernel constructs.
> 
> > For 2), Jason prefers to not blocking it if no kernel design reason. If
> > one device is allowed to bind multiple IOASID fd's, the main problem
> > is about cross-fd IOASID nesting, e.g. having gpa_ioasid created in fd1
> > and giova_ioasid created in fd2 and then nesting them together (and
> 
> Huh? This can't happen
> 
> Creating an IOASID is an operation on on the /dev/ioasid FD. We won't
> provide APIs to create a tree of IOASID's outside a single FD container.
> 
> If a device can consume multiple IOASID's it doesn't care how many or
> what /dev/ioasid FDs they come from.

OK, this implies that if one user inadvertently creates intended parent/
child via different fd's then the operation will simply fail. More specifically
taking ARM's case for example. There is only a single 2nd-level I/O page
table per device (nested by multiple 1st-level tables). Say the user already 
creates a gpa_ioasid for a device via fd1. Now he binds the device to fd2, 
intending to enable vSVA which requires nested translation thus needs 
create a parent via fd2. This parent creation will simply fail by the IOMMU 
layer because the 2nd-level (via fd1) is already installed for this device.

> 
> > To the other end there was also thought whether we should make
> > a single I/O address space per IOASID fd. This was discussed in previous
> > thread that #fd's are insufficient to afford theoretical 1M's address
> > spaces per device. But let's have another revisit and draw a clear
> > conclusion whether this option is viable.
> 
> I had remarks on this, I think per-fd doesn't work
> 
> > This implies that VFIO_BOUND_IOASID will be extended to allow user
> > specify a device label. This label will be recorded in /dev/iommu to
> > serve per-device invalidation request from and report per-device
> > fault data to the user.
> 
> I wonder which of the user providing a 64 bit cookie or the kernel
> returning a small IDA is the best choice here? Both have merits
> depending on what qemu needs..

Yes, either way can work. I don't have a strong preference. Jean?

> 
> > In addition, vPASID (if provided by user) will
> > be also recorded in /dev/iommu so vPASID<->pPASID conversion
> > is conducted properly. e.g. invalidation request from user carries
> > a vPASID which must be converted into pPASID before calling iommu
> > driver. Vice versa for raw fault data which carries pPASID while the
> > user expects a vPASID.
> 
> I don't think the PASID should be returned at all. It should return
> the IOASID number in the FD and/or a u64 cookie associated with that
> IOASID. Userspace should figure out what the IOASID & device
> combination means.

This is true for Intel. But what about ARM which has only one IOASID
(pasid table) per device to represent all guest I/O page tables?

> 
> > Seems to close this design open we have to touch the kAPI design. and
> > Joerg's input is highly appreciated here.
> 
> uAPI is forever, the kAPI is constantly changing. I always dislike
> warping the uAPI based on the current kAPI situation.
> 

I got this point. My point was that I didn't see significant gain from either 
option thus to better compare the two uAPI options we might want to 
further consider the involved kAPI effort as another factor.

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-06-01 17:42       ` Jason Gunthorpe
@ 2021-06-02  1:33         ` Tian, Kevin
  -1 siblings, 0 replies; 518+ messages in thread
From: Tian, Kevin @ 2021-06-02  1:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, June 2, 2021 1:42 AM
> 
> On Tue, Jun 01, 2021 at 08:10:14AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Saturday, May 29, 2021 1:36 AM
> > >
> > > On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > >
> > > > IOASID nesting can be implemented in two ways: hardware nesting and
> > > > software nesting. With hardware support the child and parent I/O page
> > > > tables are walked consecutively by the IOMMU to form a nested
> translation.
> > > > When it's implemented in software, the ioasid driver is responsible for
> > > > merging the two-level mappings into a single-level shadow I/O page
> table.
> > > > Software nesting requires both child/parent page tables operated
> through
> > > > the dma mapping protocol, so any change in either level can be
> captured
> > > > by the kernel to update the corresponding shadow mapping.
> > >
> > > Why? A SW emulation could do this synchronization during invalidation
> > > processing if invalidation contained an IOVA range.
> >
> > In this proposal we differentiate between host-managed and user-
> > managed I/O page tables. If host-managed, the user is expected to use
> > map/unmap cmd explicitly upon any change required on the page table.
> > If user-managed, the user first binds its page table to the IOMMU and
> > then use invalidation cmd to flush iotlb when necessary (e.g. typically
> > not required when changing a PTE from non-present to present).
> >
> > We expect user to use map+unmap and bind+invalidate respectively
> > instead of mixing them together. Following this policy, map+unmap
> > must be used in both levels for software nesting, so changes in either
> > level are captured timely to synchronize the shadow mapping.
> 
> map+unmap or bind+invalidate is a policy of the IOASID itself set when
> it is created. If you put two different types in a tree then each IOASID
> must continue to use its own operation mode.
> 
> I don't see a reason to force all IOASIDs in a tree to be consistent??

only for software nesting. With hardware support the parent uses map
while the child uses bind.

Yes, the policy is specified per IOASID. But if the policy violates the
requirement in a specific nesting mode, then nesting should fail.

> 
> A software emulated two level page table where the leaf level is a
> bound page table in guest memory should continue to use
> bind/invalidate to maintain the guest page table IOASID even though it
> is a SW construct.

with software nesting the leaf should be a host-managed page table
(or metadata). A bind/invalidate protocol doesn't require the user
to notify the kernel of every page table change. But for software nesting
the kernel must know every change to timely update the shadow/merged 
mapping, otherwise DMA may hit stale mapping.

> 
> The GPA level should use map/unmap because it is a kernel owned page
> table

yes, this is always true.

> 
> Though how to efficiently mix map/unmap on the GPA when there are SW
> nested levels below it looks to be quite challenging.
> 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 518+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-02  1:33         ` Tian, Kevin
  0 siblings, 0 replies; 518+ messages in thread
From: Tian, Kevin @ 2021-06-02  1:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker,
	Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, David Gibson, Jiang, Dave,
	David Woodhouse, Jason Wang

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, June 2, 2021 1:42 AM
> 
> On Tue, Jun 01, 2021 at 08:10:14AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Saturday, May 29, 2021 1:36 AM
> > >
> > > On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > >
> > > > IOASID nesting can be implemented in two ways: hardware nesting and
> > > > software nesting. With hardware support the child and parent I/O page
> > > > tables are walked consecutively by the IOMMU to form a nested
> translation.
> > > > When it's implemented in software, the ioasid driver is responsible for
> > > > merging the two-level mappings into a single-level shadow I/O page
> table.
> > > > Software nesting requires both child/parent page tables operated
> through
> > > > the dma mapping protocol, so any change in either level can be
> captured
> > > > by the kernel to update the corresponding shadow mapping.
> > >
> > > Why? A SW emulation could do this synchronization during invalidation
> > > processing if invalidation contained an IOVA range.
> >
> > In this proposal we differentiate between host-managed and user-
> > managed I/O page tables. If host-managed, the user is expected to use
> > map/unmap cmd explicitly upon any change required on the page table.
> > If user-managed, the user first binds its page table to the IOMMU and
> > then use invalidation cmd to flush iotlb when necessary (e.g. typically
> > not required when changing a PTE from non-present to present).
> >
> > We expect user to use map+unmap and bind+invalidate respectively
> > instead of mixing them together. Following this policy, map+unmap
> > must be used in both levels for software nesting, so changes in either
> > level are captured timely to synchronize the shadow mapping.
> 
> map+unmap or bind+invalidate is a policy of the IOASID itself set when
> it is created. If you put two different types in a tree then each IOASID
> must continue to use its own operation mode.
> 
> I don't see a reason to force all IOASIDs in a tree to be consistent??

only for software nesting. With hardware support the parent uses map
while the child uses bind.

Yes, the policy is specified per IOASID. But if the policy violates the
requirement in a specific nesting mode, then nesting should fail.

> 
> A software emulated two level page table where the leaf level is a
> bound page table in guest memory should continue to use
> bind/invalidate to maintain the guest page table IOASID even though it
> is a SW construct.

with software nesting the leaf should be a host-managed page table
(or metadata). A bind/invalidate protocol doesn't require the user
to notify the kernel of every page table change. But for software nesting
the kernel must know every change to timely update the shadow/merged 
mapping, otherwise DMA may hit stale mapping.

> 
> The GPA level should use map/unmap because it is a kernel owned page
> table

yes, this is always true.

> 
> Though how to efficiently mix map/unmap on the GPA when there are SW
> nested levels below it looks to be quite challenging.
> 

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-06-01 17:56       ` Jason Gunthorpe
@ 2021-06-02  2:00         ` Tian, Kevin
  -1 siblings, 0 replies; 518+ messages in thread
From: Tian, Kevin @ 2021-06-02  2:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, June 2, 2021 1:57 AM
> 
> On Tue, Jun 01, 2021 at 08:38:00AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Saturday, May 29, 2021 3:59 AM
> > >
> > > On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > > >
> > > > 5. Use Cases and Flows
> > > >
> > > > Here assume VFIO will support a new model where every bound device
> > > > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> > > > going through legacy container/group interface. For illustration purpose
> > > > those devices are just called dev[1...N]:
> > > >
> > > > 	device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> > > >
> > > > As explained earlier, one IOASID fd is sufficient for all intended use
> cases:
> > > >
> > > > 	ioasid_fd = open("/dev/ioasid", mode);
> > > >
> > > > For simplicity below examples are all made for the virtualization story.
> > > > They are representative and could be easily adapted to a non-
> virtualization
> > > > scenario.
> > >
> > > For others, I don't think this is *strictly* necessary, we can
> > > probably still get to the device_fd using the group_fd and fit in
> > > /dev/ioasid. It does make the rest of this more readable though.
> >
> > Jason, want to confirm here. Per earlier discussion we remain an
> > impression that you want VFIO to be a pure device driver thus
> > container/group are used only for legacy application.
> 
> Let me call this a "nice wish".
> 
> If you get to a point where you hard need this, then identify the hard
> requirement and let's do it, but I wouldn't bloat this already large
> project unnecessarily.
> 

OK, got your point. So let's start by keeping this room. For new
sub-systems like vDPA,  they don't need inventing group fd uAPI
and just leave to their user to meet the group limitation. For existing
sub-system i.e. VFIO, it could keep a stronger group enforcement
uAPI like today. One day, we may revisit it if the simple policy works
well for all other new sub-systems.

> Similarly I wouldn't depend on the group fd existing in this design
> so it could be changed later.

Yes, this is guaranteed. /dev/ioasid uAPI has no group concept.

> 
> > From this comment are you suggesting that VFIO can still keep
> > container/ group concepts and user just deprecates the use of vfio
> > iommu uAPI (e.g. VFIO_SET_IOMMU) by using /dev/ioasid (which has a
> > simple policy that an IOASID will reject cmd if partially-attached
> > group exists)?
> 
> I would say no on the container. /dev/ioasid == the container, having
> two competing objects at once in a single process is just a mess.
> 
> If the group fd can be kept requires charting a path through the
> ioctls where the container is not used and /dev/ioasid is sub'd in
> using the same device FD specific IOCTLs you show here.

yes

> 
> I didn't try to chart this out carefully.
> 
> Also, ultimately, something need to be done about compatability with
> the vfio container fd. It looks clear enough to me that the the VFIO
> container FD is just a single IOASID using a special ioctl interface
> so it would be quite rasonable to harmonize these somehow.

Possibly multiple IOASIDs as VFIO container cay hold incompatible devices
today. Suppose helper functions will be provided for VFIO container to
create IOASID and then use map/unmap to manage its I/O page table.
This is the shim iommu driver concept in previous discussion between
you and Alex.

This can be done at a later stage. Let's focus on /dev/ioasid uAPI, and
bear some code duplication between it and vfio type1 for now. 

> 
> But that is too complicated and far out for me at least to guess on at
> this point..

We're working on a prototype in parallel with this discussion. Based on
this work we'll figure out what's the best way to start with.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 518+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-02  2:00         ` Tian, Kevin
  0 siblings, 0 replies; 518+ messages in thread
From: Tian, Kevin @ 2021-06-02  2:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker,
	Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, David Gibson, Jiang, Dave,
	David Woodhouse, Jason Wang

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, June 2, 2021 1:57 AM
> 
> On Tue, Jun 01, 2021 at 08:38:00AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Saturday, May 29, 2021 3:59 AM
> > >
> > > On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > > >
> > > > 5. Use Cases and Flows
> > > >
> > > > Here assume VFIO will support a new model where every bound device
> > > > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> > > > going through legacy container/group interface. For illustration purpose
> > > > those devices are just called dev[1...N]:
> > > >
> > > > 	device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> > > >
> > > > As explained earlier, one IOASID fd is sufficient for all intended use
> cases:
> > > >
> > > > 	ioasid_fd = open("/dev/ioasid", mode);
> > > >
> > > > For simplicity below examples are all made for the virtualization story.
> > > > They are representative and could be easily adapted to a non-
> virtualization
> > > > scenario.
> > >
> > > For others, I don't think this is *strictly* necessary, we can
> > > probably still get to the device_fd using the group_fd and fit in
> > > /dev/ioasid. It does make the rest of this more readable though.
> >
> > Jason, want to confirm here. Per earlier discussion we remain an
> > impression that you want VFIO to be a pure device driver thus
> > container/group are used only for legacy application.
> 
> Let me call this a "nice wish".
> 
> If you get to a point where you hard need this, then identify the hard
> requirement and let's do it, but I wouldn't bloat this already large
> project unnecessarily.
> 

OK, got your point. So let's start by keeping this room. For new
sub-systems like vDPA,  they don't need inventing group fd uAPI
and just leave to their user to meet the group limitation. For existing
sub-system i.e. VFIO, it could keep a stronger group enforcement
uAPI like today. One day, we may revisit it if the simple policy works
well for all other new sub-systems.

> Similarly I wouldn't depend on the group fd existing in this design
> so it could be changed later.

Yes, this is guaranteed. /dev/ioasid uAPI has no group concept.

> 
> > From this comment are you suggesting that VFIO can still keep
> > container/ group concepts and user just deprecates the use of vfio
> > iommu uAPI (e.g. VFIO_SET_IOMMU) by using /dev/ioasid (which has a
> > simple policy that an IOASID will reject cmd if partially-attached
> > group exists)?
> 
> I would say no on the container. /dev/ioasid == the container, having
> two competing objects at once in a single process is just a mess.
> 
> If the group fd can be kept requires charting a path through the
> ioctls where the container is not used and /dev/ioasid is sub'd in
> using the same device FD specific IOCTLs you show here.

yes

> 
> I didn't try to chart this out carefully.
> 
> Also, ultimately, something need to be done about compatability with
> the vfio container fd. It looks clear enough to me that the the VFIO
> container FD is just a single IOASID using a special ioctl interface
> so it would be quite rasonable to harmonize these somehow.

Possibly multiple IOASIDs as VFIO container cay hold incompatible devices
today. Suppose helper functions will be provided for VFIO container to
create IOASID and then use map/unmap to manage its I/O page table.
This is the shim iommu driver concept in previous discussion between
you and Alex.

This can be done at a later stage. Let's focus on /dev/ioasid uAPI, and
bear some code duplication between it and vfio type1 for now. 

> 
> But that is too complicated and far out for me at least to guess on at
> this point..

We're working on a prototype in parallel with this discussion. Based on
this work we'll figure out what's the best way to start with.

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-06-01 22:22       ` Alex Williamson
@ 2021-06-02  2:20         ` Tian, Kevin
  -1 siblings, 0 replies; 518+ messages in thread
From: Tian, Kevin @ 2021-06-02  2:20 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok,
	Liu, Yi L, Wu, Hao, Jiang, Dave, Jacob Pan,
	Jean-Philippe Brucker, David Gibson, Kirti Wankhede,
	Robin Murphy

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, June 2, 2021 6:22 AM
> 
> On Tue, 1 Jun 2021 07:01:57 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >
> > I summarized five opens here, about:
> >
> > 1)  Finalizing the name to replace /dev/ioasid;
> > 2)  Whether one device is allowed to bind to multiple IOASID fd's;
> > 3)  Carry device information in invalidation/fault reporting uAPI;
> > 4)  What should/could be specified when allocating an IOASID;
> > 5)  The protocol between vfio group and kvm;
> >
> ...
> >
> > For 5), I'd expect Alex to chime in. Per my understanding looks the
> > original purpose of this protocol is not about I/O address space. It's
> > for KVM to know whether any device is assigned to this VM and then
> > do something special (e.g. posted interrupt, EPT cache attribute, etc.).
> 
> Right, the original use case was for KVM to determine whether it needs
> to emulate invlpg, so it needs to be aware when an assigned device is

invlpg -> wbinvd :)

> present and be able to test if DMA for that device is cache coherent.
> The user, QEMU, creates a KVM "pseudo" device representing the vfio
> group, providing the file descriptor of that group to show ownership.
> The ugly symbol_get code is to avoid hard module dependencies, ie. the
> kvm module should not pull in or require the vfio module, but vfio will
> be present if attempting to register this device.

so the symbol_get thing is not about the protocol itself. Whatever protocol
is defined, as long as kvm needs to call vfio or ioasid helper function, we 
need define a proper way to do it. Jason, what's your opinion of alternative 
option since you dislike symbol_get?

> 
> With kvmgt, the interface also became a way to register the kvm pointer
> with vfio for the translation mentioned elsewhere in this thread.
> 
> The PPC/SPAPR support allows KVM to associate a vfio group to an IOMMU
> page table so that it can handle iotlb programming from pre-registered
> memory without trapping out to userspace.
> 
> > Because KVM deduces some policy based on the fact of assigned device,
> > it needs to hold a reference to related vfio group. this part is irrelevant
> > to this RFC.
> 
> All of these use cases are related to the IOMMU, whether DMA is
> coherent, translating device IOVA to GPA, and an acceleration path to
> emulate IOMMU programming in kernel... they seem pretty relevant.

One open is whether kvm should hold a device reference or IOASID
reference. For DMA coherence, it only matters whether assigned 
devices are coherent or not (not for a specific address space). For kvmgt, 
it is for recoding kvm pointer in mdev driver to do write protection. For 
ppc, it does relate to a specific I/O page table.

Then I feel only a part of the protocol will be moved to /dev/ioasid and
something will still remain between kvm and vfio?

> 
> > But ARM's VMID usage is related to I/O address space thus needs some
> > consideration. Another strange thing is about PPC. Looks it also leverages
> > this protocol to do iommu group attach: kvm_spapr_tce_attach_iommu_
> > group. I don't know why it's done through KVM instead of VFIO uAPI in
> > the first place.
> 
> AIUI, IOMMU programming on PPC is done through hypercalls, so KVM
> needs
> to know how to handle those for in-kernel acceleration.  Thanks,
> 

ok.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 518+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-02  2:20         ` Tian, Kevin
  0 siblings, 0 replies; 518+ messages in thread
From: Tian, Kevin @ 2021-06-02  2:20 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok, kvm,
	Jonathan Corbet, Robin Murphy, LKML, Kirti Wankhede, iommu,
	David Gibson, Jason Gunthorpe, David Woodhouse, Jason Wang

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, June 2, 2021 6:22 AM
> 
> On Tue, 1 Jun 2021 07:01:57 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >
> > I summarized five opens here, about:
> >
> > 1)  Finalizing the name to replace /dev/ioasid;
> > 2)  Whether one device is allowed to bind to multiple IOASID fd's;
> > 3)  Carry device information in invalidation/fault reporting uAPI;
> > 4)  What should/could be specified when allocating an IOASID;
> > 5)  The protocol between vfio group and kvm;
> >
> ...
> >
> > For 5), I'd expect Alex to chime in. Per my understanding looks the
> > original purpose of this protocol is not about I/O address space. It's
> > for KVM to know whether any device is assigned to this VM and then
> > do something special (e.g. posted interrupt, EPT cache attribute, etc.).
> 
> Right, the original use case was for KVM to determine whether it needs
> to emulate invlpg, so it needs to be aware when an assigned device is

invlpg -> wbinvd :)

> present and be able to test if DMA for that device is cache coherent.
> The user, QEMU, creates a KVM "pseudo" device representing the vfio
> group, providing the file descriptor of that group to show ownership.
> The ugly symbol_get code is to avoid hard module dependencies, ie. the
> kvm module should not pull in or require the vfio module, but vfio will
> be present if attempting to register this device.

so the symbol_get thing is not about the protocol itself. Whatever protocol
is defined, as long as kvm needs to call vfio or ioasid helper function, we 
need define a proper way to do it. Jason, what's your opinion of alternative 
option since you dislike symbol_get?

> 
> With kvmgt, the interface also became a way to register the kvm pointer
> with vfio for the translation mentioned elsewhere in this thread.
> 
> The PPC/SPAPR support allows KVM to associate a vfio group to an IOMMU
> page table so that it can handle iotlb programming from pre-registered
> memory without trapping out to userspace.
> 
> > Because KVM deduces some policy based on the fact of assigned device,
> > it needs to hold a reference to related vfio group. this part is irrelevant
> > to this RFC.
> 
> All of these use cases are related to the IOMMU, whether DMA is
> coherent, translating device IOVA to GPA, and an acceleration path to
> emulate IOMMU programming in kernel... they seem pretty relevant.

One open is whether kvm should hold a device reference or IOASID
reference. For DMA coherence, it only matters whether assigned 
devices are coherent or not (not for a specific address space). For kvmgt, 
it is for recoding kvm pointer in mdev driver to do write protection. For 
ppc, it does relate to a specific I/O page table.

Then I feel only a part of the protocol will be moved to /dev/ioasid and
something will still remain between kvm and vfio?

> 
> > But ARM's VMID usage is related to I/O address space thus needs some
> > consideration. Another strange thing is about PPC. Looks it also leverages
> > this protocol to do iommu group attach: kvm_spapr_tce_attach_iommu_
> > group. I don't know why it's done through KVM instead of VFIO uAPI in
> > the first place.
> 
> AIUI, IOMMU programming on PPC is done through hypercalls, so KVM
> needs
> to know how to handle those for in-kernel acceleration.  Thanks,
> 

ok.

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01 17:26       ` Jason Gunthorpe
@ 2021-06-02  4:01         ` Lu Baolu
  -1 siblings, 0 replies; 518+ messages in thread
From: Lu Baolu @ 2021-06-02  4:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: baolu.lu, Tian, Kevin, LKML, Joerg Roedel, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

On 6/2/21 1:26 AM, Jason Gunthorpe wrote:
> On Tue, Jun 01, 2021 at 07:09:21PM +0800, Lu Baolu wrote:
> 
>> This version only covers 1) and 4). Do you think we need to support 2),
>> 3) and beyond?
> 
> Yes aboslutely. The API should be flexable enough to specify the
> creation of all future page table formats we'd want to have and all HW
> specific details on those formats.

OK, stay in the same line.

>> If so, it seems that we need some in-kernel helpers and uAPIs to
>> support pre-installing a page table to IOASID.
> 
> Not sure what this means..

Sorry that I didn't make this clear.

Let me bring back the page table types in my eyes.

  1) IOMMU format page table (a.k.a. iommu_domain)
  2) user application CPU page table (SVA for example)
  3) KVM EPT (future option)
  4) VM guest managed page table (nesting mode)

Each type of page table should be able to be associated with its IOASID.
We have BIND protocol for 4); We explicitly allocate an iommu_domain for
1). But we don't have a clear definition for 2) 3) and others. I think
it's necessary to clearly define a time point and kAPI name between
IOASID_ALLOC and IOASID_ATTACH, so that other modules have the
opportunity to associate their page table with the allocated IOASID
before attaching the page table to the real IOMMU hardware.

I/O page fault handling is similar. The provider of the page table
should take the responsibility to handle the possible page faults.

Could this answer the question of "I'm still confused why drivers need
fault handlers at all?" in below thread?

https://lore.kernel.org/linux-iommu/PH0PR12MB54811863B392C644E5365446DC3E9@PH0PR12MB5481.namprd12.prod.outlook.com/T/#m15def9e8b236dfcf97e21c8e9f8a58da214e3691

> 
>>  From this point of view an IOASID is actually not just a variant of
>> iommu_domain, but an I/O page table representation in a broader
>> sense.
> 
> Yes, and things need to evolve in a staged way. The ioctl API should
> have room for this growth but you need to start out with some
> constrained enough to actually implement then figure out how to grow
> from there

Yes, agreed. I just think about it from the perspective of a design
document.

Best regards,
baolu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-02  4:01         ` Lu Baolu
  0 siblings, 0 replies; 518+ messages in thread
From: Lu Baolu @ 2021-06-02  4:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Tian, Kevin,
	Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, David Gibson, Jiang, Dave,
	David Woodhouse, Jason Wang

On 6/2/21 1:26 AM, Jason Gunthorpe wrote:
> On Tue, Jun 01, 2021 at 07:09:21PM +0800, Lu Baolu wrote:
> 
>> This version only covers 1) and 4). Do you think we need to support 2),
>> 3) and beyond?
> 
> Yes aboslutely. The API should be flexable enough to specify the
> creation of all future page table formats we'd want to have and all HW
> specific details on those formats.

OK, stay in the same line.

>> If so, it seems that we need some in-kernel helpers and uAPIs to
>> support pre-installing a page table to IOASID.
> 
> Not sure what this means..

Sorry that I didn't make this clear.

Let me bring back the page table types in my eyes.

  1) IOMMU format page table (a.k.a. iommu_domain)
  2) user application CPU page table (SVA for example)
  3) KVM EPT (future option)
  4) VM guest managed page table (nesting mode)

Each type of page table should be able to be associated with its IOASID.
We have BIND protocol for 4); We explicitly allocate an iommu_domain for
1). But we don't have a clear definition for 2) 3) and others. I think
it's necessary to clearly define a time point and kAPI name between
IOASID_ALLOC and IOASID_ATTACH, so that other modules have the
opportunity to associate their page table with the allocated IOASID
before attaching the page table to the real IOMMU hardware.

I/O page fault handling is similar. The provider of the page table
should take the responsibility to handle the possible page faults.

Could this answer the question of "I'm still confused why drivers need
fault handlers at all?" in below thread?

https://lore.kernel.org/linux-iommu/PH0PR12MB54811863B392C644E5365446DC3E9@PH0PR12MB5481.namprd12.prod.outlook.com/T/#m15def9e8b236dfcf97e21c8e9f8a58da214e3691

> 
>>  From this point of view an IOASID is actually not just a variant of
>> iommu_domain, but an I/O page table representation in a broader
>> sense.
> 
> Yes, and things need to evolve in a staged way. The ioctl API should
> have room for this growth but you need to start out with some
> constrained enough to actually implement then figure out how to grow
> from there

Yes, agreed. I just think about it from the perspective of a design
document.

Best regards,
baolu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01 17:33           ` Jason Gunthorpe
@ 2021-06-02  4:50             ` Shenming Lu
  -1 siblings, 0 replies; 518+ messages in thread
From: Shenming Lu @ 2021-06-02  4:50 UTC (permalink / raw)
  To: Jason Gunthorpe, Lu Baolu
  Cc: Tian, Kevin, LKML, Joerg Roedel, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy, Zenghui Yu,
	wanghaibin.wang

On 2021/6/2 1:33, Jason Gunthorpe wrote:
> On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote:
> 
>> The drivers register per page table fault handlers to /dev/ioasid which
>> will then register itself to iommu core to listen and route the per-
>> device I/O page faults. 
> 
> I'm still confused why drivers need fault handlers at all?

Essentially it is the userspace that needs the fault handlers,
one case is to deliver the faults to the vIOMMU, and another
case is to enable IOPF on the GPA address space for on-demand
paging, it seems that both could be specified in/through the
IOASID_ALLOC ioctl?

Thanks,
Shenming


^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-02  4:50             ` Shenming Lu
  0 siblings, 0 replies; 518+ messages in thread
From: Shenming Lu @ 2021-06-02  4:50 UTC (permalink / raw)
  To: Jason Gunthorpe, Lu Baolu
  Cc: Jean-Philippe Brucker, Tian, Kevin,
	Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, wanghaibin.wang, Jiang, Dave,
	David Gibson, David Woodhouse, Jason Wang

On 2021/6/2 1:33, Jason Gunthorpe wrote:
> On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote:
> 
>> The drivers register per page table fault handlers to /dev/ioasid which
>> will then register itself to iommu core to listen and route the per-
>> device I/O page faults. 
> 
> I'm still confused why drivers need fault handlers at all?

Essentially it is the userspace that needs the fault handlers,
one case is to deliver the faults to the vIOMMU, and another
case is to enable IOPF on the GPA address space for on-demand
paging, it seems that both could be specified in/through the
IOASID_ALLOC ioctl?

Thanks,
Shenming

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-27  7:58 ` Tian, Kevin
@ 2021-06-02  6:15   ` David Gibson
  -1 siblings, 0 replies; 518+ messages in thread
From: David Gibson @ 2021-06-02  6:15 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: LKML, Joerg Roedel, Jason Gunthorpe, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	Kirti Wankhede, Robin Murphy

[-- Attachment #1: Type: text/plain, Size: 52937 bytes --]

On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> /dev/ioasid provides an unified interface for managing I/O page tables for 
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, 
> etc.) are expected to use this interface instead of creating their own logic to 
> isolate untrusted device DMAs initiated by userspace. 
> 
> This proposal describes the uAPI of /dev/ioasid and also sample sequences 
> with VFIO as example in typical usages. The driver-facing kernel API provided 
> by the iommu layer is still TBD, which can be discussed after consensus is 
> made on this uAPI.
> 
> It's based on a lengthy discussion starting from here:
> 	https://lore.kernel.org/linux-iommu/20210330132830.GO2356281@nvidia.com/ 
> 
> It ends up to be a long writing due to many things to be summarized and
> non-trivial effort required to connect them into a complete proposal.
> Hope it provides a clean base to converge.

Thanks for the writeup.  I'm giving this a first pass review, note
that I haven't read all the existing replies in detail yet.

> 
> TOC
> ====
> 1. Terminologies and Concepts
> 2. uAPI Proposal
>     2.1. /dev/ioasid uAPI
>     2.2. /dev/vfio uAPI
>     2.3. /dev/kvm uAPI
> 3. Sample structures and helper functions
> 4. PASID virtualization
> 5. Use Cases and Flows
>     5.1. A simple example
>     5.2. Multiple IOASIDs (no nesting)
>     5.3. IOASID nesting (software)
>     5.4. IOASID nesting (hardware)
>     5.5. Guest SVA (vSVA)
>     5.6. I/O page fault
>     5.7. BIND_PASID_TABLE
> ====
> 
> 1. Terminologies and Concepts
> -----------------------------------------
> 
> IOASID FD is the container holding multiple I/O address spaces. User 
> manages those address spaces through FD operations. Multiple FD's are 
> allowed per process, but with this proposal one FD should be sufficient for 
> all intended usages.
> 
> IOASID is the FD-local software handle representing an I/O address space. 
> Each IOASID is associated with a single I/O page table. IOASIDs can be 
> nested together, implying the output address from one I/O page table 
> (represented by child IOASID) must be further translated by another I/O 
> page table (represented by parent IOASID).

Is there a compelling reason to have all the IOASIDs handled by one
FD?  Simply on the grounds that handles to kernel internal objects are
usualy fds, having an fd per ioasid seems like an obvious alternative.
In that case plain open() would replace IOASID_ALLOC.  Nested could be
handled either by 1) having a CREATED_NESTED on the parent fd which
spawns a new fd or 2) opening /dev/ioasid again for a new fd and doing
a SET_PARENT before doing anything else.

I may be bikeshedding here..

> I/O address space can be managed through two protocols, according to 
> whether the corresponding I/O page table is constructed by the kernel or 
> the user. When kernel-managed, a dma mapping protocol (similar to 
> existing VFIO iommu type1) is provided for the user to explicitly specify 
> how the I/O address space is mapped. Otherwise, a different protocol is 
> provided for the user to bind an user-managed I/O page table to the 
> IOMMU, plus necessary commands for iotlb invalidation and I/O fault 
> handling. 
> 
> Pgtable binding protocol can be used only on the child IOASID's, implying 
> IOASID nesting must be enabled. This is because the kernel doesn't trust 
> userspace. Nesting allows the kernel to enforce its DMA isolation policy 
> through the parent IOASID.

To clarify, I'm guessing that's a restriction of likely practice,
rather than a fundamental API restriction.  I can see a couple of
theoretical future cases where a user-managed pagetable for a "base"
IOASID would be feasible:

  1) On some fancy future MMU allowing free nesting, where the kernel
     would insert an implicit extra layer translating user addresses
     to physical addresses, and the userspace manages a pagetable with
     its own VAs being the target AS
  2) For a purely software virtual device, where its virtual DMA
     engine can interpet user addresses fine

> IOASID nesting can be implemented in two ways: hardware nesting and
> software nesting. With hardware support the child and parent I/O page 
> tables are walked consecutively by the IOMMU to form a nested translation. 
> When it's implemented in software, the ioasid driver is responsible for 
> merging the two-level mappings into a single-level shadow I/O page table. 
> Software nesting requires both child/parent page tables operated through 
> the dma mapping protocol, so any change in either level can be captured 
> by the kernel to update the corresponding shadow mapping.

As Jason also said, I don't think you need to restrict software
nesting to only kernel managed L2 tables - you already need hooks for
cache invalidation, and you can use those to trigger reshadows.

> An I/O address space takes effect in the IOMMU only after it is attached 
> to a device. The device in the /dev/ioasid context always refers to a 
> physical one or 'pdev' (PF or VF). 

What you mean by "physical" device here isn't really clear - VFs
aren't really physical devices, and the PF/VF terminology also doesn't
extent to non-PCI devices (which I think we want to consider for the
API, even if we're not implemenenting it any time soon).

Now, it's clear that we can't program things into the IOMMU before
attaching a device - we might not even know which IOMMU to use.
However, I'm not sure if its wise to automatically make the AS "real"
as soon as we attach a device:

 * If we're going to attach a whole bunch of devices, could we (for at
   least some IOMMU models) end up doing a lot of work which then has
   to be re-done for each extra device we attach?
   
 * With kernel managed IO page tables could attaching a second device
   (at least on some IOMMU models) require some operation which would
   require discarding those tables?  e.g. if the second device somehow
   forces a different IO page size

For that reason I wonder if we want some sort of explicit enable or
activate call.  Device attaches would only be valid before, map or
attach pagetable calls would only be valid after.

> One I/O address space could be attached to multiple devices. In this case, 
> /dev/ioasid uAPI applies to all attached devices under the specified IOASID.
> 
> Based on the underlying IOMMU capability one device might be allowed 
> to attach to multiple I/O address spaces, with DMAs accessing them by 
> carrying different routing information. One of them is the default I/O 
> address space routed by PCI Requestor ID (RID) or ARM Stream ID. The 
> remaining are routed by RID + Process Address Space ID (PASID) or 
> Stream+Substream ID. For simplicity the following context uses RID and
> PASID when talking about the routing information for I/O address spaces.

I'm not really clear on how this interacts with nested ioasids.  Would
you generally expect the RID+PASID IOASes to be children of the base
RID IOAS, or not?

If the PASID ASes are children of the RID AS, can we consider this not
as the device explicitly attaching to multiple IOASIDs, but instead
attaching to the parent IOASID with awareness of the child ones?

> Device attachment is initiated through passthrough framework uAPI (use
> VFIO for simplicity in following context). VFIO is responsible for identifying 
> the routing information and registering it to the ioasid driver when calling 
> ioasid attach helper function. It could be RID if the assigned device is 
> pdev (PF/VF) or RID+PASID if the device is mediated (mdev). In addition, 
> user might also provide its view of virtual routing information (vPASID) in 
> the attach call, e.g. when multiple user-managed I/O address spaces are 
> attached to the vfio_device. In this case VFIO must figure out whether 
> vPASID should be directly used (for pdev) or converted to a kernel-
> allocated one (pPASID, for mdev) for physical routing (see section 4).
> 
> Device must be bound to an IOASID FD before attach operation can be
> conducted. This is also through VFIO uAPI. In this proposal one device 
> should not be bound to multiple FD's. Not sure about the gain of 
> allowing it except adding unnecessary complexity. But if others have 
> different view we can further discuss.
> 
> VFIO must ensure its device composes DMAs with the routing information
> attached to the IOASID. For pdev it naturally happens since vPASID is 
> directly programmed to the device by guest software. For mdev this 
> implies any guest operation carrying a vPASID on this device must be 
> trapped into VFIO and then converted to pPASID before sent to the 
> device. A detail explanation about PASID virtualization policies can be 
> found in section 4. 
> 
> Modern devices may support a scalable workload submission interface 
> based on PCI DMWr capability, allowing a single work queue to access
> multiple I/O address spaces. One example is Intel ENQCMD, having 
> PASID saved in the CPU MSR and carried in the instruction payload 
> when sent out to the device. Then a single work queue shared by 
> multiple processes can compose DMAs carrying different PASIDs. 

Is the assumption here that the processes share the IOASID FD
instance, but not memory?

> When executing ENQCMD in the guest, the CPU MSR includes a vPASID 
> which, if targeting a mdev, must be converted to pPASID before sent
> to the wire. Intel CPU provides a hardware PASID translation capability 
> for auto-conversion in the fast path. The user is expected to setup the 
> PASID mapping through KVM uAPI, with information about {vpasid, 
> ioasid_fd, ioasid}. The ioasid driver provides helper function for KVM 
> to figure out the actual pPASID given an IOASID.
> 
> With above design /dev/ioasid uAPI is all about I/O address spaces. 
> It doesn't include any device routing information, which is only 
> indirectly registered to the ioasid driver through VFIO uAPI. For 
> example, I/O page fault is always reported to userspace per IOASID, 
> although it's physically reported per device (RID+PASID). If there is a 
> need of further relaying this fault into the guest, the user is responsible 
> of identifying the device attached to this IOASID (randomly pick one if 
> multiple attached devices) and then generates a per-device virtual I/O 
> page fault into guest. Similarly the iotlb invalidation uAPI describes the 
> granularity in the I/O address space (all, or a range), different from the 
> underlying IOMMU semantics (domain-wide, PASID-wide, range-based).
> 
> I/O page tables routed through PASID are installed in a per-RID PASID 
> table structure. Some platforms implement the PASID table in the guest 
> physical space (GPA), expecting it managed by the guest. The guest
> PASID table is bound to the IOMMU also by attaching to an IOASID, 
> representing the per-RID vPASID space.

Do we need to consider two management modes here, much as we have for
the pagetables themsleves: either kernel managed, in which we have
explicit calls to bind a vPASID to a parent PASID, or user managed in
which case we register a table in some format.

> We propose the host kernel needs to explicitly track  guest I/O page 
> tables even on these platforms, i.e. the same pgtable binding protocol 
> should be used universally on all platforms (with only difference on who 
> actually writes the PASID table). One opinion from previous discussion 
> was treating this special IOASID as a container for all guest I/O page 
> tables i.e. hiding them from the host. However this way significantly 
> violates the philosophy in this /dev/ioasid proposal. It is not one IOASID 
> one address space any more. Device routing information (indirectly 
> marking hidden I/O spaces) has to be carried in iotlb invalidation and 
> page faulting uAPI to help connect vIOMMU with the underlying 
> pIOMMU. This is one design choice to be confirmed with ARM guys.
> 
> Devices may sit behind IOMMU's with incompatible capabilities. The
> difference may lie in the I/O page table format, or availability of an user
> visible uAPI (e.g. hardware nesting). /dev/ioasid is responsible for 
> checking the incompatibility between newly-attached device and existing
> devices under the specific IOASID and, if found, returning error to user.
> Upon such error the user should create a new IOASID for the incompatible
> device. 
> 
> There is no explicit group enforcement in /dev/ioasid uAPI, due to no 
> device notation in this interface as aforementioned. But the ioasid driver 
> does implicit check to make sure that devices within an iommu group 
> must be all attached to the same IOASID before this IOASID starts to
> accept any uAPI command. Otherwise error information is returned to 
> the user.

An explicit ENABLE call might make this checking simpler.

> There was a long debate in previous discussion whether VFIO should keep 
> explicit container/group semantics in its uAPI. Jason Gunthorpe proposes 
> a simplified model where every device bound to VFIO is explicitly listed 
> under /dev/vfio thus a device fd can be acquired w/o going through legacy
> container/group interface. In this case the user is responsible for 
> understanding the group topology and meeting the implicit group check 
> criteria enforced in /dev/ioasid. The use case examples in this proposal 
> are based on the new model.
> 
> Of course for backward compatibility VFIO still needs to keep the existing 
> uAPI and vfio iommu type1 will become a shim layer connecting VFIO 
> iommu ops to internal ioasid helper functions.
> 
> Notes:
> -   It might be confusing as IOASID is also used in the kernel (drivers/
>     iommu/ioasid.c) to represent PCI PASID or ARM substream ID. We need
>     find a better name later to differentiate.
> 
> -   PPC has not be considered yet as we haven't got time to fully understand
>     its semantics. According to previous discussion there is some generality 
>     between PPC window-based scheme and VFIO type1 semantics. Let's 
>     first make consensus on this proposal and then further discuss how to 
>     extend it to cover PPC's requirement.

From what I've seen so far, it seems ok to me.  Note that at this
stage I'm only familiar with existing PPC IOMMUs, which don't have
PASID or anything similar.  I'm not sure what IBM's future plans are
for IOMMUs, so there will be more checking to be done.

> -   There is a protocol between vfio group and kvm. Needs to think about
>     how it will be affected following this proposal.

I think that's only used on PPC, as an optimization for PAPR's
paravirt IOMMU with a small default IOVA window.  I think we can do
something equivalent for IOASIDs from what I've seen so far.

> -   mdev in this context refers to mediated subfunctions (e.g. Intel SIOV) 
>     which can be physically isolated in-between through PASID-granular
>     IOMMU protection. Historically people also discussed one usage by 
>     mediating a pdev into a mdev. This usage is not covered here, and is 
>     supposed to be replaced by Max's work which allows overriding various 
>     VFIO operations in vfio-pci driver.

I think there are a couple of different mdev cases, so we'll need to
be careful of that and clarify our terminology a bit, I think.

> 2. uAPI Proposal
> ----------------------
> 
> /dev/ioasid uAPI covers everything about managing I/O address spaces.
> 
> /dev/vfio uAPI builds connection between devices and I/O address spaces.
> 
> /dev/kvm uAPI is optional required as far as ENQCMD is concerned.
> 
> 
> 2.1. /dev/ioasid uAPI
> +++++++++++++++++
> 
> /*
>   * Check whether an uAPI extension is supported. 
>   *
>   * This is for FD-level capabilities, such as locked page pre-registration. 
>   * IOASID-level capabilities are reported through IOASID_GET_INFO.
>   *
>   * Return: 0 if not supported, 1 if supported.
>   */
> #define IOASID_CHECK_EXTENSION	_IO(IOASID_TYPE, IOASID_BASE + 0)
> 
> 
> /*
>   * Register user space memory where DMA is allowed.
>   *
>   * It pins user pages and does the locked memory accounting so sub-
>   * sequent IOASID_MAP/UNMAP_DMA calls get faster.
>   *
>   * When this ioctl is not used, one user page might be accounted
>   * multiple times when it is mapped by multiple IOASIDs which are
>   * not nested together.
>   *
>   * Input parameters:
>   *	- vaddr;
>   *	- size;
>   *
>   * Return: 0 on success, -errno on failure.
>   */
> #define IOASID_REGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 1)
> #define IOASID_UNREGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 2)

AIUI PPC is the main user of the current pre-registration API, though
it could have value in any vIOMMU case to avoid possibly costly
accounting on every guest map/unmap.

I wonder if there's a way to model this using a nested AS rather than
requiring special operations.  e.g.

	'prereg' IOAS
	|
	\- 'rid' IOAS
	   |
	   \- 'pasid' IOAS (maybe)

'prereg' would have a kernel managed pagetable into which (for
example) qemu platform code would map all guest memory (using
IOASID_MAP_DMA).  qemu's vIOMMU driver would then mirror the guest's
IO mappings into the 'rid' IOAS in terms of GPA.

This wouldn't quite work as is, because the 'prereg' IOAS would have
no devices.  But we could potentially have another call to mark an
IOAS as a purely "preregistration" or pure virtual IOAS.  Using that
would be an alternative to attaching devices.

> /*
>   * Allocate an IOASID. 
>   *
>   * IOASID is the FD-local software handle representing an I/O address 
>   * space. Each IOASID is associated with a single I/O page table. User 
>   * must call this ioctl to get an IOASID for every I/O address space that is
>   * intended to be enabled in the IOMMU.
>   *
>   * A newly-created IOASID doesn't accept any command before it is 
>   * attached to a device. Once attached, an empty I/O page table is 
>   * bound with the IOMMU then the user could use either DMA mapping 
>   * or pgtable binding commands to manage this I/O page table.
>   *
>   * Device attachment is initiated through device driver uAPI (e.g. VFIO)
>   *
>   * Return: allocated ioasid on success, -errno on failure.
>   */
> #define IOASID_ALLOC	_IO(IOASID_TYPE, IOASID_BASE + 3)
> #define IOASID_FREE	_IO(IOASID_TYPE, IOASID_BASE + 4)
> 
> 
> /*
>   * Get information about an I/O address space
>   *
>   * Supported capabilities:
>   *	- VFIO type1 map/unmap;
>   *	- pgtable/pasid_table binding
>   *	- hardware nesting vs. software nesting;
>   *	- ...
>   *
>   * Related attributes:
>   * 	- supported page sizes, reserved IOVA ranges (DMA mapping);

Can I request we represent this in terms of permitted IOVA ranges,
rather than reserved IOVA ranges.  This works better with the "window"
model I have in mind for unifying the restrictions of the POWER IOMMU
with Type1 like mapping.

>   *	- vendor pgtable formats (pgtable binding);
>   *	- number of child IOASIDs (nesting);
>   *	- ...
>   *
>   * Above information is available only after one or more devices are
>   * attached to the specified IOASID. Otherwise the IOASID is just a
>   * number w/o any capability or attribute.
>   *
>   * Input parameters:
>   *	- u32 ioasid;
>   *
>   * Output parameters:
>   *	- many. TBD.
>   */
> #define IOASID_GET_INFO	_IO(IOASID_TYPE, IOASID_BASE + 5)
> 
> 
> /*
>   * Map/unmap process virtual addresses to I/O virtual addresses.
>   *
>   * Provide VFIO type1 equivalent semantics. Start with the same 
>   * restriction e.g. the unmap size should match those used in the 
>   * original mapping call. 
>   *
>   * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
>   * must be already in the preregistered list.
>   *
>   * Input parameters:
>   *	- u32 ioasid;
>   *	- refer to vfio_iommu_type1_dma_{un}map
>   *
>   * Return: 0 on success, -errno on failure.
>   */
> #define IOASID_MAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 6)
> #define IOASID_UNMAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 7)

I'm assuming these would be expected to fail if a user managed
pagetable has been bound?

> /*
>   * Create a nesting IOASID (child) on an existing IOASID (parent)
>   *
>   * IOASIDs can be nested together, implying that the output address 
>   * from one I/O page table (child) must be further translated by 
>   * another I/O page table (parent).
>   *
>   * As the child adds essentially another reference to the I/O page table 
>   * represented by the parent, any device attached to the child ioasid 
>   * must be already attached to the parent.
>   *
>   * In concept there is no limit on the number of the nesting levels. 
>   * However for the majority case one nesting level is sufficient. The
>   * user should check whether an IOASID supports nesting through 
>   * IOASID_GET_INFO. For example, if only one nesting level is allowed,
>   * the nesting capability is reported only on the parent instead of the
>   * child.
>   *
>   * User also needs check (via IOASID_GET_INFO) whether the nesting 
>   * is implemented in hardware or software. If software-based, DMA 
>   * mapping protocol should be used on the child IOASID. Otherwise, 
>   * the child should be operated with pgtable binding protocol.
>   *
>   * Input parameters:
>   *	- u32 parent_ioasid;
>   *
>   * Return: child_ioasid on success, -errno on failure;
>   */
> #define IOASID_CREATE_NESTING	_IO(IOASID_TYPE, IOASID_BASE + 8)
> 
> 
> /*
>   * Bind an user-managed I/O page table with the IOMMU
>   *
>   * Because user page table is untrusted, IOASID nesting must be enabled 
>   * for this ioasid so the kernel can enforce its DMA isolation policy 
>   * through the parent ioasid.
>   *
>   * Pgtable binding protocol is different from DMA mapping. The latter 
>   * has the I/O page table constructed by the kernel and updated 
>   * according to user MAP/UNMAP commands. With pgtable binding the 
>   * whole page table is created and updated by userspace, thus different 
>   * set of commands are required (bind, iotlb invalidation, page fault, etc.).
>   *
>   * Because the page table is directly walked by the IOMMU, the user 
>   * must  use a format compatible to the underlying hardware. It can 
>   * check the format information through IOASID_GET_INFO.
>   *
>   * The page table is bound to the IOMMU according to the routing 
>   * information of each attached device under the specified IOASID. The
>   * routing information (RID and optional PASID) is registered when a 
>   * device is attached to this IOASID through VFIO uAPI. 
>   *
>   * Input parameters:
>   *	- child_ioasid;
>   *	- address of the user page table;
>   *	- formats (vendor, address_width, etc.);
>   * 
>   * Return: 0 on success, -errno on failure.
>   */
> #define IOASID_BIND_PGTABLE		_IO(IOASID_TYPE, IOASID_BASE + 9)
> #define IOASID_UNBIND_PGTABLE	_IO(IOASID_TYPE, IOASID_BASE + 10)

I'm assuming that UNBIND would return the IOASID to a kernel-managed
pagetable?

For debugging and certain hypervisor edge cases it might be useful to
have a call to allow userspace to lookup and specific IOVA in a guest
managed pgtable.


> /*
>   * Bind an user-managed PASID table to the IOMMU
>   *
>   * This is required for platforms which place PASID table in the GPA space.
>   * In this case the specified IOASID represents the per-RID PASID space.
>   *
>   * Alternatively this may be replaced by IOASID_BIND_PGTABLE plus a
>   * special flag to indicate the difference from normal I/O address spaces.
>   *
>   * The format info of the PASID table is reported in IOASID_GET_INFO.
>   *
>   * As explained in the design section, user-managed I/O page tables must
>   * be explicitly bound to the kernel even on these platforms. It allows
>   * the kernel to uniformly manage I/O address spaces cross all platforms.
>   * Otherwise, the iotlb invalidation and page faulting uAPI must be hacked
>   * to carry device routing information to indirectly mark the hidden I/O
>   * address spaces.
>   *
>   * Input parameters:
>   *	- child_ioasid;

Wouldn't this be the parent ioasid, rather than one of the potentially
many child ioasids?

>   *	- address of PASID table;
>   *	- formats (vendor, size, etc.);
>   *
>   * Return: 0 on success, -errno on failure.
>   */
> #define IOASID_BIND_PASID_TABLE	_IO(IOASID_TYPE, IOASID_BASE + 11)
> #define IOASID_UNBIND_PASID_TABLE	_IO(IOASID_TYPE, IOASID_BASE + 12)
> 
> 
> /*
>   * Invalidate IOTLB for an user-managed I/O page table
>   *
>   * Unlike what's defined in include/uapi/linux/iommu.h, this command 
>   * doesn't allow the user to specify cache type and likely support only
>   * two granularities (all, or a specified range) in the I/O address space.
>   *
>   * Physical IOMMU have three cache types (iotlb, dev_iotlb and pasid
>   * cache). If the IOASID represents an I/O address space, the invalidation
>   * always applies to the iotlb (and dev_iotlb if enabled). If the IOASID
>   * represents a vPASID space, then this command applies to the PASID
>   * cache.
>   *
>   * Similarly this command doesn't provide IOMMU-like granularity
>   * info (domain-wide, pasid-wide, range-based), since it's all about the
>   * I/O address space itself. The ioasid driver walks the attached
>   * routing information to match the IOMMU semantics under the
>   * hood. 
>   *
>   * Input parameters:
>   *	- child_ioasid;

And couldn't this be be any ioasid, not just a child one, depending on
whether you want PASID scope or RID scope invalidation?

>   *	- granularity
>   * 
>   * Return: 0 on success, -errno on failure
>   */
> #define IOASID_INVALIDATE_CACHE	_IO(IOASID_TYPE, IOASID_BASE + 13)
> 
> 
> /*
>   * Page fault report and response
>   *
>   * This is TBD. Can be added after other parts are cleared up. Likely it 
>   * will be a ring buffer shared between user/kernel, an eventfd to notify 
>   * the user and an ioctl to complete the fault.
>   *
>   * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
>   */
> 
> 
> /*
>   * Dirty page tracking 
>   *
>   * Track and report memory pages dirtied in I/O address spaces. There 
>   * is an ongoing work by Kunkun Jiang by extending existing VFIO type1. 
>   * It needs be adapted to /dev/ioasid later.
>   */
> 
> 
> 2.2. /dev/vfio uAPI
> ++++++++++++++++
> 
> /*
>   * Bind a vfio_device to the specified IOASID fd
>   *
>   * Multiple vfio devices can be bound to a single ioasid_fd, but a single 
>   * vfio device should not be bound to multiple ioasid_fd's. 
>   *
>   * Input parameters:
>   *	- ioasid_fd;
>   *
>   * Return: 0 on success, -errno on failure.
>   */
> #define VFIO_BIND_IOASID_FD		_IO(VFIO_TYPE, VFIO_BASE + 22)
> #define VFIO_UNBIND_IOASID_FD	_IO(VFIO_TYPE, VFIO_BASE + 23)
> 
> 
> /*
>   * Attach a vfio device to the specified IOASID
>   *
>   * Multiple vfio devices can be attached to the same IOASID, and vice 
>   * versa. 
>   *
>   * User may optionally provide a "virtual PASID" to mark an I/O page 
>   * table on this vfio device. Whether the virtual PASID is physically used 
>   * or converted to another kernel-allocated PASID is a policy in vfio device 
>   * driver.
>   *
>   * There is no need to specify ioasid_fd in this call due to the assumption 
>   * of 1:1 connection between vfio device and the bound fd.
>   *
>   * Input parameter:
>   *	- ioasid;
>   *	- flag;
>   *	- user_pasid (if specified);

Wouldn't the PASID be communicated by whether you give a parent or
child ioasid, rather than needing an extra value?

>   * Return: 0 on success, -errno on failure.
>   */
> #define VFIO_ATTACH_IOASID		_IO(VFIO_TYPE, VFIO_BASE + 24)
> #define VFIO_DETACH_IOASID		_IO(VFIO_TYPE, VFIO_BASE + 25)
> 
> 
> 2.3. KVM uAPI
> ++++++++++++
> 
> /*
>   * Update CPU PASID mapping
>   *
>   * This is necessary when ENQCMD will be used in the guest while the
>   * targeted device doesn't accept the vPASID saved in the CPU MSR.
>   *
>   * This command allows user to set/clear the vPASID->pPASID mapping
>   * in the CPU, by providing the IOASID (and FD) information representing
>   * the I/O address space marked by this vPASID.
>   *
>   * Input parameters:
>   *	- user_pasid;
>   *	- ioasid_fd;
>   *	- ioasid;
>   */
> #define KVM_MAP_PASID	_IO(KVMIO, 0xf0)
> #define KVM_UNMAP_PASID	_IO(KVMIO, 0xf1)
> 
> 
> 3. Sample structures and helper functions
> --------------------------------------------------------
> 
> Three helper functions are provided to support VFIO_BIND_IOASID_FD:
> 
> 	struct ioasid_ctx *ioasid_ctx_fdget(int fd);
> 	int ioasid_register_device(struct ioasid_ctx *ctx, struct ioasid_dev *dev);
> 	int ioasid_unregister_device(struct ioasid_dev *dev);
> 
> An ioasid_ctx is created for each fd:
> 
> 	struct ioasid_ctx {
> 		// a list of allocated IOASID data's
> 		struct list_head		ioasid_list;
> 		// a list of registered devices
> 		struct list_head		dev_list;
> 		// a list of pre-registered virtual address ranges
> 		struct list_head		prereg_list;
> 	};
> 
> Each registered device is represented by ioasid_dev:
> 
> 	struct ioasid_dev {
> 		struct list_head		next;
> 		struct ioasid_ctx	*ctx;
> 		// always be the physical device

Again "physical" isn't really clearly defined here.

> 		struct device 		*device;
> 		struct kref		kref;
> 	};
> 
> Because we assume one vfio_device connected to at most one ioasid_fd, 
> here ioasid_dev could be embedded in vfio_device and then linked to 
> ioasid_ctx->dev_list when registration succeeds. For mdev the struct
> device should be the pointer to the parent device. PASID marking this
> mdev is specified later when VFIO_ATTACH_IOASID.
> 
> An ioasid_data is created when IOASID_ALLOC, as the main object 
> describing characteristics about an I/O page table:
> 
> 	struct ioasid_data {
> 		// link to ioasid_ctx->ioasid_list
> 		struct list_head		next;
> 
> 		// the IOASID number
> 		u32			ioasid;
> 
> 		// the handle to convey iommu operations
> 		// hold the pgd (TBD until discussing iommu api)
> 		struct iommu_domain *domain;
> 
> 		// map metadata (vfio type1 semantics)
> 		struct rb_node		dma_list;

Why do you need this?  Can't you just store the kernel managed
mappings in the host IO pgtable?

> 		// pointer to user-managed pgtable (for nesting case)
> 		u64			user_pgd;

> 		// link to the parent ioasid (for nesting)
> 		struct ioasid_data	*parent;
> 
> 		// cache the global PASID shared by ENQCMD-capable
> 		// devices (see below explanation in section 4)
> 		u32			pasid;
> 
> 		// a list of device attach data (routing information)
> 		struct list_head		attach_data;
> 
> 		// a list of partially-attached devices (group)
> 		struct list_head		partial_devices;
> 
> 		// a list of fault_data reported from the iommu layer
> 		struct list_head		fault_data;
> 
> 		...
> 	}
> 
> ioasid_data and iommu_domain have overlapping roles as both are 
> introduced to represent an I/O address space. It is still a big TBD how 
> the two should be corelated or even merged, and whether new iommu 
> ops are required to handle RID+PASID explicitly. We leave this as open 
> for now as this proposal is mainly about uAPI. For simplification 
> purpose the two objects are kept separate in this context, assuming an 
> 1:1 connection in-between and the domain as the place-holder 
> representing the 1st class object in the iommu ops. 
> 
> Two helper functions are provided to support VFIO_ATTACH_IOASID:
> 
> 	struct attach_info {
> 		u32	ioasid;
> 		// If valid, the PASID to be used physically
> 		u32	pasid;

Again shouldn't the choice of a parent or child ioasid inform whether
there is a pasid, and if so which one?

> 	};
> 	int ioasid_device_attach(struct ioasid_dev *dev, 
> 		struct attach_info info);
> 	int ioasid_device_detach(struct ioasid_dev *dev, u32 ioasid);
> 
> The pasid parameter is optionally provided based on the policy in vfio
> device driver. It could be the PASID marking the default I/O address 
> space for a mdev, or the user-provided PASID marking an user I/O page
> table, or another kernel-allocated PASID backing the user-provided one.
> Please check next section for detail explanation.
> 
> A new object is introduced and linked to ioasid_data->attach_data for 
> each successful attach operation:
> 
> 	struct ioasid_attach_data {
> 		struct list_head		next;
> 		struct ioasid_dev	*dev;
> 		u32 			pasid;
> 	}
> 
> As explained in the design section, there is no explicit group enforcement
> in /dev/ioasid uAPI or helper functions. But the ioasid driver does
> implicit group check - before every device within an iommu group is 
> attached to this IOASID, the previously-attached devices in this group are
> put in ioasid_data->partial_devices. The IOASID rejects any command if
> the partial_devices list is not empty.
> 
> Then is the last helper function:
> 	u32 ioasid_get_global_pasid(struct ioasid_ctx *ctx, 
> 		u32 ioasid, bool alloc);
> 
> ioasid_get_global_pasid is necessary in scenarios where multiple devices 
> want to share a same PASID value on the attached I/O page table (e.g. 
> when ENQCMD is enabled, as explained in next section). We need a 
> centralized place (ioasid_data->pasid) to hold this value (allocated when
> first called with alloc=true). vfio device driver calls this function (alloc=
> true) to get the global PASID for an ioasid before calling ioasid_device_
> attach. KVM also calls this function (alloc=false) to setup PASID translation 
> structure when user calls KVM_MAP_PASID.
> 
> 4. PASID Virtualization
> ------------------------------
> 
> When guest SVA (vSVA) is enabled, multiple GVA address spaces are 
> created on the assigned vfio device. This leads to the concepts of 
> "virtual PASID" (vPASID) vs. "physical PASID" (pPASID). vPASID is assigned 
> by the guest to mark an GVA address space while pPASID is the one 
> selected by the host and actually routed in the wire.
> 
> vPASID is conveyed to the kernel when user calls VFIO_ATTACH_IOASID.
> 
> vfio device driver translates vPASID to pPASID before calling ioasid_attach_
> device, with two factors to be considered:
> 
> -    Whether vPASID is directly used (vPASID==pPASID) in the wire, or 
>      should be instead converted to a newly-allocated one (vPASID!=
>      pPASID);
> 
> -    If vPASID!=pPASID, whether pPASID is allocated from per-RID PASID
>      space or a global PASID space (implying sharing pPASID cross devices,
>      e.g. when supporting Intel ENQCMD which puts PASID in a CPU MSR
>      as part of the process context);
> 
> The actual policy depends on pdev vs. mdev, and whether ENQCMD is
> supported. There are three possible scenarios:
> 
> (Note: /dev/ioasid uAPI is not affected by underlying PASID virtualization 
> policies.)
> 
> 1)  pdev (w/ or w/o ENQCMD): vPASID==pPASID
> 
>      vPASIDs are directly programmed by the guest to the assigned MMIO 
>      bar, implying all DMAs out of this device having vPASID in the packet 
>      header. This mandates vPASID==pPASID, sort of delegating the entire 
>      per-RID PASID space to the guest.
> 
>      When ENQCMD is enabled, the CPU MSR when running a guest task
>      contains a vPASID. In this case the CPU PASID translation capability 
>      should be disabled so this vPASID in CPU MSR is directly sent to the
>      wire.
> 
>      This ensures consistent vPASID usage on pdev regardless of the 
>      workload submitted through a MMIO register or ENQCMD instruction.
> 
> 2)  mdev: vPASID!=pPASID (per-RID if w/o ENQCMD, otherwise global)
> 
>      PASIDs are also used by kernel to mark the default I/O address space 
>      for mdev, thus cannot be delegated to the guest. Instead, the mdev 
>      driver must allocate a new pPASID for each vPASID (thus vPASID!=
>      pPASID) and then use pPASID when attaching this mdev to an ioasid.
> 
>      The mdev driver needs cache the PASID mapping so in mediation 
>      path vPASID programmed by the guest can be converted to pPASID 
>      before updating the physical MMIO register. The mapping should
>      also be saved in the CPU PASID translation structure (via KVM uAPI), 
>      so the vPASID saved in the CPU MSR is auto-translated to pPASID 
>      before sent to the wire, when ENQCMD is enabled. 
> 
>      Generally pPASID could be allocated from the per-RID PASID space
>      if all mdev's created on the parent device don't support ENQCMD.
> 
>      However if the parent supports ENQCMD-capable mdev, pPASIDs
>      must be allocated from a global pool because the CPU PASID 
>      translation structure is per-VM. It implies that when an guest I/O 
>      page table is attached to two mdevs with a single vPASID (i.e. bind 
>      to the same guest process), a same pPASID should be used for 
>      both mdevs even when they belong to different parents. Sharing
>      pPASID cross mdevs is achieved by calling aforementioned ioasid_
>      get_global_pasid().
> 
> 3)  Mix pdev/mdev together
> 
>      Above policies are per device type thus are not affected when mixing 
>      those device types together (when assigned to a single guest). However, 
>      there is one exception - when both pdev/mdev support ENQCMD.
> 
>      Remember the two types have conflicting requirements on whether 
>      CPU PASID translation should be enabled. This capability is per-VM, 
>      and must be enabled for mdev isolation. When enabled, pdev will 
>      receive a mdev pPASID violating its vPASID expectation.
> 
>      In previous thread a PASID range split scheme was discussed to support
>      this combination, but we haven't worked out a clean uAPI design yet.
>      Therefore in this proposal we decide to not support it, implying the 
>      user should have some intelligence to avoid such scenario. It could be
>      a TODO task for future.
> 
> In spite of those subtle considerations, the kernel implementation could
> start simple, e.g.:
> 
> -    v==p for pdev;
> -    v!=p and always use a global PASID pool for all mdev's;
> 
> Regardless of the kernel policy, the user policy is unchanged:
> 
> -    provide vPASID when calling VFIO_ATTACH_IOASID;
> -    call KVM uAPI to setup CPU PASID translation if ENQCMD-capable mdev;
> -    Don't expose ENQCMD capability on both pdev and mdev;
> 
> Sample user flow is described in section 5.5.
> 
> 5. Use Cases and Flows
> -------------------------------
> 
> Here assume VFIO will support a new model where every bound device
> is explicitly listed under /dev/vfio thus a device fd can be acquired w/o 
> going through legacy container/group interface. For illustration purpose
> those devices are just called dev[1...N]:
> 
> 	device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);

Minor detail, but I'd suggest /dev/vfio/pci/DDDD:BB:SS.F for the
filenames for actual PCI functions.  Maybe /dev/vfio/mdev/something
for mdevs.  That leaves other subdirs of /dev/vfio free for future
non-PCI device types, and /dev/vfio itself for the legacy group
devices.

> As explained earlier, one IOASID fd is sufficient for all intended use cases:
> 
> 	ioasid_fd = open("/dev/ioasid", mode);
> 
> For simplicity below examples are all made for the virtualization story.
> They are representative and could be easily adapted to a non-virtualization
> scenario.
> 
> Three types of IOASIDs are considered:
> 
> 	gpa_ioasid[1...N]: 	for GPA address space
> 	giova_ioasid[1...N]:	for guest IOVA address space
> 	gva_ioasid[1...N]:	for guest CPU VA address space
> 
> At least one gpa_ioasid must always be created per guest, while the other 
> two are relevant as far as vIOMMU is concerned.
> 
> Examples here apply to both pdev and mdev, if not explicitly marked out
> (e.g. in section 5.5). VFIO device driver in the kernel will figure out the 
> associated routing information in the attaching operation.
> 
> For illustration simplicity, IOASID_CHECK_EXTENSION and IOASID_GET_
> INFO are skipped in these examples.
> 
> 5.1. A simple example
> ++++++++++++++++++
> 
> Dev1 is assigned to the guest. One gpa_ioasid is created. The GPA address
> space is managed through DMA mapping protocol:
> 
> 	/* Bind device to IOASID fd */
> 	device_fd = open("/dev/vfio/devices/dev1", mode);
> 	ioasid_fd = open("/dev/ioasid", mode);
> 	ioctl(device_fd, VFIO_BIND_IOASID_FD, ioasid_fd);
> 
> 	/* Attach device to IOASID */
> 	gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> 	at_data = { .ioasid = gpa_ioasid};
> 	ioctl(device_fd, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Setup GPA mapping */
> 	dma_map = {
> 		.ioasid	= gpa_ioasid;
> 		.iova	= 0;		// GPA
> 		.vaddr	= 0x40000000;	// HVA
> 		.size	= 1GB;
> 	};
> 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> 
> If the guest is assigned with more than dev1, user follows above sequence
> to attach other devices to the same gpa_ioasid i.e. sharing the GPA 
> address space cross all assigned devices.
> 
> 5.2. Multiple IOASIDs (no nesting)
> ++++++++++++++++++++++++++++
> 
> Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
> both devices are attached to gpa_ioasid.

Doesn't really affect your example, but note that the PAPR IOMMU does
not have a passthrough mode, so devices will not initially be attached
to gpa_ioasid - they will be unusable for DMA until attached to a
gIOVA ioasid.

> After boot the guest creates 
> an GIOVA address space (giova_ioasid) for dev2, leaving dev1 in pass
> through mode (gpa_ioasid).
> 
> Suppose IOASID nesting is not supported in this case. Qemu need to
> generate shadow mappings in userspace for giova_ioasid (like how
> VFIO works today).
> 
> To avoid duplicated locked page accounting, it's recommended to pre-
> register the virtual address range that will be used for DMA:
> 
> 	device_fd1 = open("/dev/vfio/devices/dev1", mode);
> 	device_fd2 = open("/dev/vfio/devices/dev2", mode);
> 	ioasid_fd = open("/dev/ioasid", mode);
> 	ioctl(device_fd1, VFIO_BIND_IOASID_FD, ioasid_fd);
> 	ioctl(device_fd2, VFIO_BIND_IOASID_FD, ioasid_fd);
> 
> 	/* pre-register the virtual address range for accounting */
> 	mem_info = { .vaddr = 0x40000000; .size = 1GB };
> 	ioctl(ioasid_fd, IOASID_REGISTER_MEMORY, &mem_info);
> 	/* Attach dev1 and dev2 to gpa_ioasid */
> 	gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> 	at_data = { .ioasid = gpa_ioasid};
> 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Setup GPA mapping */
> 	dma_map = {
> 		.ioasid	= gpa_ioasid;
> 		.iova	= 0; 		// GPA
> 		.vaddr	= 0x40000000;	// HVA
> 		.size	= 1GB;
> 	};
> 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> 
> 	/* After boot, guest enables an GIOVA space for dev2 */

Again, doesn't break the example, but this need not happen after guest
boot.  On the PAPR vIOMMU, the guest IOVA spaces (known as "logical IO
bus numbers" / liobns) and which devices are in each are fixed at
guest creation time and advertised to the guest via firmware.

> 	giova_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> 
> 	/* First detach dev2 from previous address space */
> 	at_data = { .ioasid = gpa_ioasid};
> 	ioctl(device_fd2, VFIO_DETACH_IOASID, &at_data);
> 
> 	/* Then attach dev2 to the new address space */
> 	at_data = { .ioasid = giova_ioasid};
> 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Setup a shadow DMA mapping according to vIOMMU
> 	  * GIOVA (0x2000) -> GPA (0x1000) -> HVA (0x40001000)
> 	  */
> 	dma_map = {
> 		.ioasid	= giova_ioasid;
> 		.iova	= 0x2000; 	// GIOVA
> 		.vaddr	= 0x40001000;	// HVA
> 		.size	= 4KB;
> 	};
> 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> 
> 5.3. IOASID nesting (software)
> +++++++++++++++++++++++++
> 
> Same usage scenario as 5.2, with software-based IOASID nesting 
> available. In this mode it is the kernel instead of user to create the
> shadow mapping.

In this case, I feel like the preregistration is redundant with the
GPA level mapping.  As long as the gIOVA mappings (which might be
frequent) can piggyback on the accounting done for the GPA mapping we
accomplish what we need from preregistration.

> The flow before guest boots is same as 5.2, except one point. Because 
> giova_ioasid is nested on gpa_ioasid, locked accounting is only 
> conducted for gpa_ioasid. So it's not necessary to pre-register virtual 
> memory.
> 
> To save space we only list the steps after boots (i.e. both dev1/dev2
> have been attached to gpa_ioasid before guest boots):
> 
> 	/* After boots */
> 	/* Make GIOVA space nested on GPA space */
> 	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
> 
> 	/* Attach dev2 to the new address space (child)
> 	  * Note dev2 is still attached to gpa_ioasid (parent)
> 	  */
> 	at_data = { .ioasid = giova_ioasid};
> 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Setup a GIOVA->GPA mapping for giova_ioasid, which will be 
> 	  * merged by the kernel with GPA->HVA mapping of gpa_ioasid
> 	  * to form a shadow mapping.
> 	  */
> 	dma_map = {
> 		.ioasid	= giova_ioasid;
> 		.iova	= 0x2000;	// GIOVA
> 		.vaddr	= 0x1000;	// GPA
> 		.size	= 4KB;
> 	};
> 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> 
> 5.4. IOASID nesting (hardware)
> +++++++++++++++++++++++++
> 
> Same usage scenario as 5.2, with hardware-based IOASID nesting
> available. In this mode the pgtable binding protocol is used to 
> bind the guest IOVA page table with the IOMMU:
> 
> 	/* After boots */
> 	/* Make GIOVA space nested on GPA space */
> 	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
> 
> 	/* Attach dev2 to the new address space (child)
> 	  * Note dev2 is still attached to gpa_ioasid (parent)
> 	  */
> 	at_data = { .ioasid = giova_ioasid};
> 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Bind guest I/O page table  */
> 	bind_data = {
> 		.ioasid	= giova_ioasid;
> 		.addr	= giova_pgtable;
> 		// and format information
> 	};
> 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> 
> 	/* Invalidate IOTLB when required */
> 	inv_data = {
> 		.ioasid	= giova_ioasid;
> 		// granular information
> 	};
> 	ioctl(ioasid_fd, IOASID_INVALIDATE_CACHE, &inv_data);
> 
> 	/* See 5.6 for I/O page fault handling */
> 	
> 5.5. Guest SVA (vSVA)
> ++++++++++++++++++
> 
> After boots the guest further create a GVA address spaces (gpasid1) on 
> dev1. Dev2 is not affected (still attached to giova_ioasid).
> 
> As explained in section 4, user should avoid expose ENQCMD on both
> pdev and mdev.
> 
> The sequence applies to all device types (being pdev or mdev), except
> one additional step to call KVM for ENQCMD-capable mdev:
> 
> 	/* After boots */
> 	/* Make GVA space nested on GPA space */
> 	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);

I'm not clear what gva_ioasid is representing.  Is it representing a
single vPASID's address space, or a whole bunch of vPASIDs address
spaces?

> 	/* Attach dev1 to the new address space and specify vPASID */
> 	at_data = {
> 		.ioasid		= gva_ioasid;
> 		.flag 		= IOASID_ATTACH_USER_PASID;
> 		.user_pasid	= gpasid1;
> 	};
> 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* if dev1 is ENQCMD-capable mdev, update CPU PASID 
> 	  * translation structure through KVM
> 	  */
> 	pa_data = {
> 		.ioasid_fd	= ioasid_fd;
> 		.ioasid		= gva_ioasid;
> 		.guest_pasid	= gpasid1;
> 	};
> 	ioctl(kvm_fd, KVM_MAP_PASID, &pa_data);
> 
> 	/* Bind guest I/O page table  */
> 	bind_data = {
> 		.ioasid	= gva_ioasid;
> 		.addr	= gva_pgtable1;
> 		// and format information
> 	};
> 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> 
> 	...
> 
> 
> 5.6. I/O page fault
> +++++++++++++++
> 
> (uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
> to guest IOMMU driver and backwards).
> 
> -   Host IOMMU driver receives a page request with raw fault_data {rid, 
>     pasid, addr};
> 
> -   Host IOMMU driver identifies the faulting I/O page table according to
>     information registered by IOASID fault handler;
> 
> -   IOASID fault handler is called with raw fault_data (rid, pasid, addr), which 
>     is saved in ioasid_data->fault_data (used for response);
> 
> -   IOASID fault handler generates an user fault_data (ioasid, addr), links it 
>     to the shared ring buffer and triggers eventfd to userspace;
> 
> -   Upon received event, Qemu needs to find the virtual routing information 
>     (v_rid + v_pasid) of the device attached to the faulting ioasid. If there are 
>     multiple, pick a random one. This should be fine since the purpose is to
>     fix the I/O page table on the guest;
> 
> -   Qemu generates a virtual I/O page fault through vIOMMU into guest,
>     carrying the virtual fault data (v_rid, v_pasid, addr);
> 
> -   Guest IOMMU driver fixes up the fault, updates the I/O page table, and
>     then sends a page response with virtual completion data (v_rid, v_pasid, 
>     response_code) to vIOMMU;
> 
> -   Qemu finds the pending fault event, converts virtual completion data 
>     into (ioasid, response_code), and then calls a /dev/ioasid ioctl to 
>     complete the pending fault;
> 
> -   /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved in 
>     ioasid_data->fault_data, and then calls iommu api to complete it with
>     {rid, pasid, response_code};
> 
> 5.7. BIND_PASID_TABLE
> ++++++++++++++++++++
> 
> PASID table is put in the GPA space on some platform, thus must be updated
> by the guest. It is treated as another user page table to be bound with the 
> IOMMU.
> 
> As explained earlier, the user still needs to explicitly bind every user I/O 
> page table to the kernel so the same pgtable binding protocol (bind, cache 
> invalidate and fault handling) is unified cross platforms.
> 
> vIOMMUs may include a caching mode (or paravirtualized way) which, once 
> enabled, requires the guest to invalidate PASID cache for any change on the 
> PASID table. This allows Qemu to track the lifespan of guest I/O page tables.
> 
> In case of missing such capability, Qemu could enable write-protection on
> the guest PASID table to achieve the same effect.
> 
> 	/* After boots */
> 	/* Make vPASID space nested on GPA space */
> 	pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);

I think this time pasidtbl_ioasid is representing multiple vPASID
address spaces, yes?  In which case I don't think it should be treated
as the same sort of object as a normal IOASID, which represents a
single address space IIUC.

> 	/* Attach dev1 to pasidtbl_ioasid */
> 	at_data = { .ioasid = pasidtbl_ioasid};
> 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Bind PASID table */
> 	bind_data = {
> 		.ioasid	= pasidtbl_ioasid;
> 		.addr	= gpa_pasid_table;
> 		// and format information
> 	};
> 	ioctl(ioasid_fd, IOASID_BIND_PASID_TABLE, &bind_data);
> 
> 	/* vIOMMU detects a new GVA I/O space created */
> 	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
> 
> 	/* Attach dev1 to the new address space, with gpasid1 */
> 	at_data = {
> 		.ioasid		= gva_ioasid;
> 		.flag 		= IOASID_ATTACH_USER_PASID;
> 		.user_pasid	= gpasid1;
> 	};
> 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Bind guest I/O page table. Because SET_PASID_TABLE has been
> 	  * used, the kernel will not update the PASID table. Instead, just
> 	  * track the bound I/O page table for handling invalidation and
> 	  * I/O page faults.
> 	  */
> 	bind_data = {
> 		.ioasid	= gva_ioasid;
> 		.addr	= gva_pgtable1;
> 		// and format information
> 	};
> 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);

Hrm.. if you still have to individually bind a table for each vPASID,
what's the point of BIND_PASID_TABLE?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-02  6:15   ` David Gibson
  0 siblings, 0 replies; 518+ messages in thread
From: David Gibson @ 2021-06-02  6:15 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jean-Philippe Brucker,
	Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, Jason Gunthorpe, Jiang, Dave,
	David Woodhouse, Jason Wang


[-- Attachment #1.1: Type: text/plain, Size: 52937 bytes --]

On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> /dev/ioasid provides an unified interface for managing I/O page tables for 
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, 
> etc.) are expected to use this interface instead of creating their own logic to 
> isolate untrusted device DMAs initiated by userspace. 
> 
> This proposal describes the uAPI of /dev/ioasid and also sample sequences 
> with VFIO as example in typical usages. The driver-facing kernel API provided 
> by the iommu layer is still TBD, which can be discussed after consensus is 
> made on this uAPI.
> 
> It's based on a lengthy discussion starting from here:
> 	https://lore.kernel.org/linux-iommu/20210330132830.GO2356281@nvidia.com/ 
> 
> It ends up to be a long writing due to many things to be summarized and
> non-trivial effort required to connect them into a complete proposal.
> Hope it provides a clean base to converge.

Thanks for the writeup.  I'm giving this a first pass review, note
that I haven't read all the existing replies in detail yet.

> 
> TOC
> ====
> 1. Terminologies and Concepts
> 2. uAPI Proposal
>     2.1. /dev/ioasid uAPI
>     2.2. /dev/vfio uAPI
>     2.3. /dev/kvm uAPI
> 3. Sample structures and helper functions
> 4. PASID virtualization
> 5. Use Cases and Flows
>     5.1. A simple example
>     5.2. Multiple IOASIDs (no nesting)
>     5.3. IOASID nesting (software)
>     5.4. IOASID nesting (hardware)
>     5.5. Guest SVA (vSVA)
>     5.6. I/O page fault
>     5.7. BIND_PASID_TABLE
> ====
> 
> 1. Terminologies and Concepts
> -----------------------------------------
> 
> IOASID FD is the container holding multiple I/O address spaces. User 
> manages those address spaces through FD operations. Multiple FD's are 
> allowed per process, but with this proposal one FD should be sufficient for 
> all intended usages.
> 
> IOASID is the FD-local software handle representing an I/O address space. 
> Each IOASID is associated with a single I/O page table. IOASIDs can be 
> nested together, implying the output address from one I/O page table 
> (represented by child IOASID) must be further translated by another I/O 
> page table (represented by parent IOASID).

Is there a compelling reason to have all the IOASIDs handled by one
FD?  Simply on the grounds that handles to kernel internal objects are
usualy fds, having an fd per ioasid seems like an obvious alternative.
In that case plain open() would replace IOASID_ALLOC.  Nested could be
handled either by 1) having a CREATED_NESTED on the parent fd which
spawns a new fd or 2) opening /dev/ioasid again for a new fd and doing
a SET_PARENT before doing anything else.

I may be bikeshedding here..

> I/O address space can be managed through two protocols, according to 
> whether the corresponding I/O page table is constructed by the kernel or 
> the user. When kernel-managed, a dma mapping protocol (similar to 
> existing VFIO iommu type1) is provided for the user to explicitly specify 
> how the I/O address space is mapped. Otherwise, a different protocol is 
> provided for the user to bind an user-managed I/O page table to the 
> IOMMU, plus necessary commands for iotlb invalidation and I/O fault 
> handling. 
> 
> Pgtable binding protocol can be used only on the child IOASID's, implying 
> IOASID nesting must be enabled. This is because the kernel doesn't trust 
> userspace. Nesting allows the kernel to enforce its DMA isolation policy 
> through the parent IOASID.

To clarify, I'm guessing that's a restriction of likely practice,
rather than a fundamental API restriction.  I can see a couple of
theoretical future cases where a user-managed pagetable for a "base"
IOASID would be feasible:

  1) On some fancy future MMU allowing free nesting, where the kernel
     would insert an implicit extra layer translating user addresses
     to physical addresses, and the userspace manages a pagetable with
     its own VAs being the target AS
  2) For a purely software virtual device, where its virtual DMA
     engine can interpet user addresses fine

> IOASID nesting can be implemented in two ways: hardware nesting and
> software nesting. With hardware support the child and parent I/O page 
> tables are walked consecutively by the IOMMU to form a nested translation. 
> When it's implemented in software, the ioasid driver is responsible for 
> merging the two-level mappings into a single-level shadow I/O page table. 
> Software nesting requires both child/parent page tables operated through 
> the dma mapping protocol, so any change in either level can be captured 
> by the kernel to update the corresponding shadow mapping.

As Jason also said, I don't think you need to restrict software
nesting to only kernel managed L2 tables - you already need hooks for
cache invalidation, and you can use those to trigger reshadows.

> An I/O address space takes effect in the IOMMU only after it is attached 
> to a device. The device in the /dev/ioasid context always refers to a 
> physical one or 'pdev' (PF or VF). 

What you mean by "physical" device here isn't really clear - VFs
aren't really physical devices, and the PF/VF terminology also doesn't
extent to non-PCI devices (which I think we want to consider for the
API, even if we're not implemenenting it any time soon).

Now, it's clear that we can't program things into the IOMMU before
attaching a device - we might not even know which IOMMU to use.
However, I'm not sure if its wise to automatically make the AS "real"
as soon as we attach a device:

 * If we're going to attach a whole bunch of devices, could we (for at
   least some IOMMU models) end up doing a lot of work which then has
   to be re-done for each extra device we attach?
   
 * With kernel managed IO page tables could attaching a second device
   (at least on some IOMMU models) require some operation which would
   require discarding those tables?  e.g. if the second device somehow
   forces a different IO page size

For that reason I wonder if we want some sort of explicit enable or
activate call.  Device attaches would only be valid before, map or
attach pagetable calls would only be valid after.

> One I/O address space could be attached to multiple devices. In this case, 
> /dev/ioasid uAPI applies to all attached devices under the specified IOASID.
> 
> Based on the underlying IOMMU capability one device might be allowed 
> to attach to multiple I/O address spaces, with DMAs accessing them by 
> carrying different routing information. One of them is the default I/O 
> address space routed by PCI Requestor ID (RID) or ARM Stream ID. The 
> remaining are routed by RID + Process Address Space ID (PASID) or 
> Stream+Substream ID. For simplicity the following context uses RID and
> PASID when talking about the routing information for I/O address spaces.

I'm not really clear on how this interacts with nested ioasids.  Would
you generally expect the RID+PASID IOASes to be children of the base
RID IOAS, or not?

If the PASID ASes are children of the RID AS, can we consider this not
as the device explicitly attaching to multiple IOASIDs, but instead
attaching to the parent IOASID with awareness of the child ones?

> Device attachment is initiated through passthrough framework uAPI (use
> VFIO for simplicity in following context). VFIO is responsible for identifying 
> the routing information and registering it to the ioasid driver when calling 
> ioasid attach helper function. It could be RID if the assigned device is 
> pdev (PF/VF) or RID+PASID if the device is mediated (mdev). In addition, 
> user might also provide its view of virtual routing information (vPASID) in 
> the attach call, e.g. when multiple user-managed I/O address spaces are 
> attached to the vfio_device. In this case VFIO must figure out whether 
> vPASID should be directly used (for pdev) or converted to a kernel-
> allocated one (pPASID, for mdev) for physical routing (see section 4).
> 
> Device must be bound to an IOASID FD before attach operation can be
> conducted. This is also through VFIO uAPI. In this proposal one device 
> should not be bound to multiple FD's. Not sure about the gain of 
> allowing it except adding unnecessary complexity. But if others have 
> different view we can further discuss.
> 
> VFIO must ensure its device composes DMAs with the routing information
> attached to the IOASID. For pdev it naturally happens since vPASID is 
> directly programmed to the device by guest software. For mdev this 
> implies any guest operation carrying a vPASID on this device must be 
> trapped into VFIO and then converted to pPASID before sent to the 
> device. A detail explanation about PASID virtualization policies can be 
> found in section 4. 
> 
> Modern devices may support a scalable workload submission interface 
> based on PCI DMWr capability, allowing a single work queue to access
> multiple I/O address spaces. One example is Intel ENQCMD, having 
> PASID saved in the CPU MSR and carried in the instruction payload 
> when sent out to the device. Then a single work queue shared by 
> multiple processes can compose DMAs carrying different PASIDs. 

Is the assumption here that the processes share the IOASID FD
instance, but not memory?

> When executing ENQCMD in the guest, the CPU MSR includes a vPASID 
> which, if targeting a mdev, must be converted to pPASID before sent
> to the wire. Intel CPU provides a hardware PASID translation capability 
> for auto-conversion in the fast path. The user is expected to setup the 
> PASID mapping through KVM uAPI, with information about {vpasid, 
> ioasid_fd, ioasid}. The ioasid driver provides helper function for KVM 
> to figure out the actual pPASID given an IOASID.
> 
> With above design /dev/ioasid uAPI is all about I/O address spaces. 
> It doesn't include any device routing information, which is only 
> indirectly registered to the ioasid driver through VFIO uAPI. For 
> example, I/O page fault is always reported to userspace per IOASID, 
> although it's physically reported per device (RID+PASID). If there is a 
> need of further relaying this fault into the guest, the user is responsible 
> of identifying the device attached to this IOASID (randomly pick one if 
> multiple attached devices) and then generates a per-device virtual I/O 
> page fault into guest. Similarly the iotlb invalidation uAPI describes the 
> granularity in the I/O address space (all, or a range), different from the 
> underlying IOMMU semantics (domain-wide, PASID-wide, range-based).
> 
> I/O page tables routed through PASID are installed in a per-RID PASID 
> table structure. Some platforms implement the PASID table in the guest 
> physical space (GPA), expecting it managed by the guest. The guest
> PASID table is bound to the IOMMU also by attaching to an IOASID, 
> representing the per-RID vPASID space.

Do we need to consider two management modes here, much as we have for
the pagetables themsleves: either kernel managed, in which we have
explicit calls to bind a vPASID to a parent PASID, or user managed in
which case we register a table in some format.

> We propose the host kernel needs to explicitly track  guest I/O page 
> tables even on these platforms, i.e. the same pgtable binding protocol 
> should be used universally on all platforms (with only difference on who 
> actually writes the PASID table). One opinion from previous discussion 
> was treating this special IOASID as a container for all guest I/O page 
> tables i.e. hiding them from the host. However this way significantly 
> violates the philosophy in this /dev/ioasid proposal. It is not one IOASID 
> one address space any more. Device routing information (indirectly 
> marking hidden I/O spaces) has to be carried in iotlb invalidation and 
> page faulting uAPI to help connect vIOMMU with the underlying 
> pIOMMU. This is one design choice to be confirmed with ARM guys.
> 
> Devices may sit behind IOMMU's with incompatible capabilities. The
> difference may lie in the I/O page table format, or availability of an user
> visible uAPI (e.g. hardware nesting). /dev/ioasid is responsible for 
> checking the incompatibility between newly-attached device and existing
> devices under the specific IOASID and, if found, returning error to user.
> Upon such error the user should create a new IOASID for the incompatible
> device. 
> 
> There is no explicit group enforcement in /dev/ioasid uAPI, due to no 
> device notation in this interface as aforementioned. But the ioasid driver 
> does implicit check to make sure that devices within an iommu group 
> must be all attached to the same IOASID before this IOASID starts to
> accept any uAPI command. Otherwise error information is returned to 
> the user.

An explicit ENABLE call might make this checking simpler.

> There was a long debate in previous discussion whether VFIO should keep 
> explicit container/group semantics in its uAPI. Jason Gunthorpe proposes 
> a simplified model where every device bound to VFIO is explicitly listed 
> under /dev/vfio thus a device fd can be acquired w/o going through legacy
> container/group interface. In this case the user is responsible for 
> understanding the group topology and meeting the implicit group check 
> criteria enforced in /dev/ioasid. The use case examples in this proposal 
> are based on the new model.
> 
> Of course for backward compatibility VFIO still needs to keep the existing 
> uAPI and vfio iommu type1 will become a shim layer connecting VFIO 
> iommu ops to internal ioasid helper functions.
> 
> Notes:
> -   It might be confusing as IOASID is also used in the kernel (drivers/
>     iommu/ioasid.c) to represent PCI PASID or ARM substream ID. We need
>     find a better name later to differentiate.
> 
> -   PPC has not be considered yet as we haven't got time to fully understand
>     its semantics. According to previous discussion there is some generality 
>     between PPC window-based scheme and VFIO type1 semantics. Let's 
>     first make consensus on this proposal and then further discuss how to 
>     extend it to cover PPC's requirement.

From what I've seen so far, it seems ok to me.  Note that at this
stage I'm only familiar with existing PPC IOMMUs, which don't have
PASID or anything similar.  I'm not sure what IBM's future plans are
for IOMMUs, so there will be more checking to be done.

> -   There is a protocol between vfio group and kvm. Needs to think about
>     how it will be affected following this proposal.

I think that's only used on PPC, as an optimization for PAPR's
paravirt IOMMU with a small default IOVA window.  I think we can do
something equivalent for IOASIDs from what I've seen so far.

> -   mdev in this context refers to mediated subfunctions (e.g. Intel SIOV) 
>     which can be physically isolated in-between through PASID-granular
>     IOMMU protection. Historically people also discussed one usage by 
>     mediating a pdev into a mdev. This usage is not covered here, and is 
>     supposed to be replaced by Max's work which allows overriding various 
>     VFIO operations in vfio-pci driver.

I think there are a couple of different mdev cases, so we'll need to
be careful of that and clarify our terminology a bit, I think.

> 2. uAPI Proposal
> ----------------------
> 
> /dev/ioasid uAPI covers everything about managing I/O address spaces.
> 
> /dev/vfio uAPI builds connection between devices and I/O address spaces.
> 
> /dev/kvm uAPI is optional required as far as ENQCMD is concerned.
> 
> 
> 2.1. /dev/ioasid uAPI
> +++++++++++++++++
> 
> /*
>   * Check whether an uAPI extension is supported. 
>   *
>   * This is for FD-level capabilities, such as locked page pre-registration. 
>   * IOASID-level capabilities are reported through IOASID_GET_INFO.
>   *
>   * Return: 0 if not supported, 1 if supported.
>   */
> #define IOASID_CHECK_EXTENSION	_IO(IOASID_TYPE, IOASID_BASE + 0)
> 
> 
> /*
>   * Register user space memory where DMA is allowed.
>   *
>   * It pins user pages and does the locked memory accounting so sub-
>   * sequent IOASID_MAP/UNMAP_DMA calls get faster.
>   *
>   * When this ioctl is not used, one user page might be accounted
>   * multiple times when it is mapped by multiple IOASIDs which are
>   * not nested together.
>   *
>   * Input parameters:
>   *	- vaddr;
>   *	- size;
>   *
>   * Return: 0 on success, -errno on failure.
>   */
> #define IOASID_REGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 1)
> #define IOASID_UNREGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 2)

AIUI PPC is the main user of the current pre-registration API, though
it could have value in any vIOMMU case to avoid possibly costly
accounting on every guest map/unmap.

I wonder if there's a way to model this using a nested AS rather than
requiring special operations.  e.g.

	'prereg' IOAS
	|
	\- 'rid' IOAS
	   |
	   \- 'pasid' IOAS (maybe)

'prereg' would have a kernel managed pagetable into which (for
example) qemu platform code would map all guest memory (using
IOASID_MAP_DMA).  qemu's vIOMMU driver would then mirror the guest's
IO mappings into the 'rid' IOAS in terms of GPA.

This wouldn't quite work as is, because the 'prereg' IOAS would have
no devices.  But we could potentially have another call to mark an
IOAS as a purely "preregistration" or pure virtual IOAS.  Using that
would be an alternative to attaching devices.

> /*
>   * Allocate an IOASID. 
>   *
>   * IOASID is the FD-local software handle representing an I/O address 
>   * space. Each IOASID is associated with a single I/O page table. User 
>   * must call this ioctl to get an IOASID for every I/O address space that is
>   * intended to be enabled in the IOMMU.
>   *
>   * A newly-created IOASID doesn't accept any command before it is 
>   * attached to a device. Once attached, an empty I/O page table is 
>   * bound with the IOMMU then the user could use either DMA mapping 
>   * or pgtable binding commands to manage this I/O page table.
>   *
>   * Device attachment is initiated through device driver uAPI (e.g. VFIO)
>   *
>   * Return: allocated ioasid on success, -errno on failure.
>   */
> #define IOASID_ALLOC	_IO(IOASID_TYPE, IOASID_BASE + 3)
> #define IOASID_FREE	_IO(IOASID_TYPE, IOASID_BASE + 4)
> 
> 
> /*
>   * Get information about an I/O address space
>   *
>   * Supported capabilities:
>   *	- VFIO type1 map/unmap;
>   *	- pgtable/pasid_table binding
>   *	- hardware nesting vs. software nesting;
>   *	- ...
>   *
>   * Related attributes:
>   * 	- supported page sizes, reserved IOVA ranges (DMA mapping);

Can I request we represent this in terms of permitted IOVA ranges,
rather than reserved IOVA ranges.  This works better with the "window"
model I have in mind for unifying the restrictions of the POWER IOMMU
with Type1 like mapping.

>   *	- vendor pgtable formats (pgtable binding);
>   *	- number of child IOASIDs (nesting);
>   *	- ...
>   *
>   * Above information is available only after one or more devices are
>   * attached to the specified IOASID. Otherwise the IOASID is just a
>   * number w/o any capability or attribute.
>   *
>   * Input parameters:
>   *	- u32 ioasid;
>   *
>   * Output parameters:
>   *	- many. TBD.
>   */
> #define IOASID_GET_INFO	_IO(IOASID_TYPE, IOASID_BASE + 5)
> 
> 
> /*
>   * Map/unmap process virtual addresses to I/O virtual addresses.
>   *
>   * Provide VFIO type1 equivalent semantics. Start with the same 
>   * restriction e.g. the unmap size should match those used in the 
>   * original mapping call. 
>   *
>   * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
>   * must be already in the preregistered list.
>   *
>   * Input parameters:
>   *	- u32 ioasid;
>   *	- refer to vfio_iommu_type1_dma_{un}map
>   *
>   * Return: 0 on success, -errno on failure.
>   */
> #define IOASID_MAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 6)
> #define IOASID_UNMAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 7)

I'm assuming these would be expected to fail if a user managed
pagetable has been bound?

> /*
>   * Create a nesting IOASID (child) on an existing IOASID (parent)
>   *
>   * IOASIDs can be nested together, implying that the output address 
>   * from one I/O page table (child) must be further translated by 
>   * another I/O page table (parent).
>   *
>   * As the child adds essentially another reference to the I/O page table 
>   * represented by the parent, any device attached to the child ioasid 
>   * must be already attached to the parent.
>   *
>   * In concept there is no limit on the number of the nesting levels. 
>   * However for the majority case one nesting level is sufficient. The
>   * user should check whether an IOASID supports nesting through 
>   * IOASID_GET_INFO. For example, if only one nesting level is allowed,
>   * the nesting capability is reported only on the parent instead of the
>   * child.
>   *
>   * User also needs check (via IOASID_GET_INFO) whether the nesting 
>   * is implemented in hardware or software. If software-based, DMA 
>   * mapping protocol should be used on the child IOASID. Otherwise, 
>   * the child should be operated with pgtable binding protocol.
>   *
>   * Input parameters:
>   *	- u32 parent_ioasid;
>   *
>   * Return: child_ioasid on success, -errno on failure;
>   */
> #define IOASID_CREATE_NESTING	_IO(IOASID_TYPE, IOASID_BASE + 8)
> 
> 
> /*
>   * Bind an user-managed I/O page table with the IOMMU
>   *
>   * Because user page table is untrusted, IOASID nesting must be enabled 
>   * for this ioasid so the kernel can enforce its DMA isolation policy 
>   * through the parent ioasid.
>   *
>   * Pgtable binding protocol is different from DMA mapping. The latter 
>   * has the I/O page table constructed by the kernel and updated 
>   * according to user MAP/UNMAP commands. With pgtable binding the 
>   * whole page table is created and updated by userspace, thus different 
>   * set of commands are required (bind, iotlb invalidation, page fault, etc.).
>   *
>   * Because the page table is directly walked by the IOMMU, the user 
>   * must  use a format compatible to the underlying hardware. It can 
>   * check the format information through IOASID_GET_INFO.
>   *
>   * The page table is bound to the IOMMU according to the routing 
>   * information of each attached device under the specified IOASID. The
>   * routing information (RID and optional PASID) is registered when a 
>   * device is attached to this IOASID through VFIO uAPI. 
>   *
>   * Input parameters:
>   *	- child_ioasid;
>   *	- address of the user page table;
>   *	- formats (vendor, address_width, etc.);
>   * 
>   * Return: 0 on success, -errno on failure.
>   */
> #define IOASID_BIND_PGTABLE		_IO(IOASID_TYPE, IOASID_BASE + 9)
> #define IOASID_UNBIND_PGTABLE	_IO(IOASID_TYPE, IOASID_BASE + 10)

I'm assuming that UNBIND would return the IOASID to a kernel-managed
pagetable?

For debugging and certain hypervisor edge cases it might be useful to
have a call to allow userspace to lookup and specific IOVA in a guest
managed pgtable.


> /*
>   * Bind an user-managed PASID table to the IOMMU
>   *
>   * This is required for platforms which place PASID table in the GPA space.
>   * In this case the specified IOASID represents the per-RID PASID space.
>   *
>   * Alternatively this may be replaced by IOASID_BIND_PGTABLE plus a
>   * special flag to indicate the difference from normal I/O address spaces.
>   *
>   * The format info of the PASID table is reported in IOASID_GET_INFO.
>   *
>   * As explained in the design section, user-managed I/O page tables must
>   * be explicitly bound to the kernel even on these platforms. It allows
>   * the kernel to uniformly manage I/O address spaces cross all platforms.
>   * Otherwise, the iotlb invalidation and page faulting uAPI must be hacked
>   * to carry device routing information to indirectly mark the hidden I/O
>   * address spaces.
>   *
>   * Input parameters:
>   *	- child_ioasid;

Wouldn't this be the parent ioasid, rather than one of the potentially
many child ioasids?

>   *	- address of PASID table;
>   *	- formats (vendor, size, etc.);
>   *
>   * Return: 0 on success, -errno on failure.
>   */
> #define IOASID_BIND_PASID_TABLE	_IO(IOASID_TYPE, IOASID_BASE + 11)
> #define IOASID_UNBIND_PASID_TABLE	_IO(IOASID_TYPE, IOASID_BASE + 12)
> 
> 
> /*
>   * Invalidate IOTLB for an user-managed I/O page table
>   *
>   * Unlike what's defined in include/uapi/linux/iommu.h, this command 
>   * doesn't allow the user to specify cache type and likely support only
>   * two granularities (all, or a specified range) in the I/O address space.
>   *
>   * Physical IOMMU have three cache types (iotlb, dev_iotlb and pasid
>   * cache). If the IOASID represents an I/O address space, the invalidation
>   * always applies to the iotlb (and dev_iotlb if enabled). If the IOASID
>   * represents a vPASID space, then this command applies to the PASID
>   * cache.
>   *
>   * Similarly this command doesn't provide IOMMU-like granularity
>   * info (domain-wide, pasid-wide, range-based), since it's all about the
>   * I/O address space itself. The ioasid driver walks the attached
>   * routing information to match the IOMMU semantics under the
>   * hood. 
>   *
>   * Input parameters:
>   *	- child_ioasid;

And couldn't this be be any ioasid, not just a child one, depending on
whether you want PASID scope or RID scope invalidation?

>   *	- granularity
>   * 
>   * Return: 0 on success, -errno on failure
>   */
> #define IOASID_INVALIDATE_CACHE	_IO(IOASID_TYPE, IOASID_BASE + 13)
> 
> 
> /*
>   * Page fault report and response
>   *
>   * This is TBD. Can be added after other parts are cleared up. Likely it 
>   * will be a ring buffer shared between user/kernel, an eventfd to notify 
>   * the user and an ioctl to complete the fault.
>   *
>   * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
>   */
> 
> 
> /*
>   * Dirty page tracking 
>   *
>   * Track and report memory pages dirtied in I/O address spaces. There 
>   * is an ongoing work by Kunkun Jiang by extending existing VFIO type1. 
>   * It needs be adapted to /dev/ioasid later.
>   */
> 
> 
> 2.2. /dev/vfio uAPI
> ++++++++++++++++
> 
> /*
>   * Bind a vfio_device to the specified IOASID fd
>   *
>   * Multiple vfio devices can be bound to a single ioasid_fd, but a single 
>   * vfio device should not be bound to multiple ioasid_fd's. 
>   *
>   * Input parameters:
>   *	- ioasid_fd;
>   *
>   * Return: 0 on success, -errno on failure.
>   */
> #define VFIO_BIND_IOASID_FD		_IO(VFIO_TYPE, VFIO_BASE + 22)
> #define VFIO_UNBIND_IOASID_FD	_IO(VFIO_TYPE, VFIO_BASE + 23)
> 
> 
> /*
>   * Attach a vfio device to the specified IOASID
>   *
>   * Multiple vfio devices can be attached to the same IOASID, and vice 
>   * versa. 
>   *
>   * User may optionally provide a "virtual PASID" to mark an I/O page 
>   * table on this vfio device. Whether the virtual PASID is physically used 
>   * or converted to another kernel-allocated PASID is a policy in vfio device 
>   * driver.
>   *
>   * There is no need to specify ioasid_fd in this call due to the assumption 
>   * of 1:1 connection between vfio device and the bound fd.
>   *
>   * Input parameter:
>   *	- ioasid;
>   *	- flag;
>   *	- user_pasid (if specified);

Wouldn't the PASID be communicated by whether you give a parent or
child ioasid, rather than needing an extra value?

>   * Return: 0 on success, -errno on failure.
>   */
> #define VFIO_ATTACH_IOASID		_IO(VFIO_TYPE, VFIO_BASE + 24)
> #define VFIO_DETACH_IOASID		_IO(VFIO_TYPE, VFIO_BASE + 25)
> 
> 
> 2.3. KVM uAPI
> ++++++++++++
> 
> /*
>   * Update CPU PASID mapping
>   *
>   * This is necessary when ENQCMD will be used in the guest while the
>   * targeted device doesn't accept the vPASID saved in the CPU MSR.
>   *
>   * This command allows user to set/clear the vPASID->pPASID mapping
>   * in the CPU, by providing the IOASID (and FD) information representing
>   * the I/O address space marked by this vPASID.
>   *
>   * Input parameters:
>   *	- user_pasid;
>   *	- ioasid_fd;
>   *	- ioasid;
>   */
> #define KVM_MAP_PASID	_IO(KVMIO, 0xf0)
> #define KVM_UNMAP_PASID	_IO(KVMIO, 0xf1)
> 
> 
> 3. Sample structures and helper functions
> --------------------------------------------------------
> 
> Three helper functions are provided to support VFIO_BIND_IOASID_FD:
> 
> 	struct ioasid_ctx *ioasid_ctx_fdget(int fd);
> 	int ioasid_register_device(struct ioasid_ctx *ctx, struct ioasid_dev *dev);
> 	int ioasid_unregister_device(struct ioasid_dev *dev);
> 
> An ioasid_ctx is created for each fd:
> 
> 	struct ioasid_ctx {
> 		// a list of allocated IOASID data's
> 		struct list_head		ioasid_list;
> 		// a list of registered devices
> 		struct list_head		dev_list;
> 		// a list of pre-registered virtual address ranges
> 		struct list_head		prereg_list;
> 	};
> 
> Each registered device is represented by ioasid_dev:
> 
> 	struct ioasid_dev {
> 		struct list_head		next;
> 		struct ioasid_ctx	*ctx;
> 		// always be the physical device

Again "physical" isn't really clearly defined here.

> 		struct device 		*device;
> 		struct kref		kref;
> 	};
> 
> Because we assume one vfio_device connected to at most one ioasid_fd, 
> here ioasid_dev could be embedded in vfio_device and then linked to 
> ioasid_ctx->dev_list when registration succeeds. For mdev the struct
> device should be the pointer to the parent device. PASID marking this
> mdev is specified later when VFIO_ATTACH_IOASID.
> 
> An ioasid_data is created when IOASID_ALLOC, as the main object 
> describing characteristics about an I/O page table:
> 
> 	struct ioasid_data {
> 		// link to ioasid_ctx->ioasid_list
> 		struct list_head		next;
> 
> 		// the IOASID number
> 		u32			ioasid;
> 
> 		// the handle to convey iommu operations
> 		// hold the pgd (TBD until discussing iommu api)
> 		struct iommu_domain *domain;
> 
> 		// map metadata (vfio type1 semantics)
> 		struct rb_node		dma_list;

Why do you need this?  Can't you just store the kernel managed
mappings in the host IO pgtable?

> 		// pointer to user-managed pgtable (for nesting case)
> 		u64			user_pgd;

> 		// link to the parent ioasid (for nesting)
> 		struct ioasid_data	*parent;
> 
> 		// cache the global PASID shared by ENQCMD-capable
> 		// devices (see below explanation in section 4)
> 		u32			pasid;
> 
> 		// a list of device attach data (routing information)
> 		struct list_head		attach_data;
> 
> 		// a list of partially-attached devices (group)
> 		struct list_head		partial_devices;
> 
> 		// a list of fault_data reported from the iommu layer
> 		struct list_head		fault_data;
> 
> 		...
> 	}
> 
> ioasid_data and iommu_domain have overlapping roles as both are 
> introduced to represent an I/O address space. It is still a big TBD how 
> the two should be corelated or even merged, and whether new iommu 
> ops are required to handle RID+PASID explicitly. We leave this as open 
> for now as this proposal is mainly about uAPI. For simplification 
> purpose the two objects are kept separate in this context, assuming an 
> 1:1 connection in-between and the domain as the place-holder 
> representing the 1st class object in the iommu ops. 
> 
> Two helper functions are provided to support VFIO_ATTACH_IOASID:
> 
> 	struct attach_info {
> 		u32	ioasid;
> 		// If valid, the PASID to be used physically
> 		u32	pasid;

Again shouldn't the choice of a parent or child ioasid inform whether
there is a pasid, and if so which one?

> 	};
> 	int ioasid_device_attach(struct ioasid_dev *dev, 
> 		struct attach_info info);
> 	int ioasid_device_detach(struct ioasid_dev *dev, u32 ioasid);
> 
> The pasid parameter is optionally provided based on the policy in vfio
> device driver. It could be the PASID marking the default I/O address 
> space for a mdev, or the user-provided PASID marking an user I/O page
> table, or another kernel-allocated PASID backing the user-provided one.
> Please check next section for detail explanation.
> 
> A new object is introduced and linked to ioasid_data->attach_data for 
> each successful attach operation:
> 
> 	struct ioasid_attach_data {
> 		struct list_head		next;
> 		struct ioasid_dev	*dev;
> 		u32 			pasid;
> 	}
> 
> As explained in the design section, there is no explicit group enforcement
> in /dev/ioasid uAPI or helper functions. But the ioasid driver does
> implicit group check - before every device within an iommu group is 
> attached to this IOASID, the previously-attached devices in this group are
> put in ioasid_data->partial_devices. The IOASID rejects any command if
> the partial_devices list is not empty.
> 
> Then is the last helper function:
> 	u32 ioasid_get_global_pasid(struct ioasid_ctx *ctx, 
> 		u32 ioasid, bool alloc);
> 
> ioasid_get_global_pasid is necessary in scenarios where multiple devices 
> want to share a same PASID value on the attached I/O page table (e.g. 
> when ENQCMD is enabled, as explained in next section). We need a 
> centralized place (ioasid_data->pasid) to hold this value (allocated when
> first called with alloc=true). vfio device driver calls this function (alloc=
> true) to get the global PASID for an ioasid before calling ioasid_device_
> attach. KVM also calls this function (alloc=false) to setup PASID translation 
> structure when user calls KVM_MAP_PASID.
> 
> 4. PASID Virtualization
> ------------------------------
> 
> When guest SVA (vSVA) is enabled, multiple GVA address spaces are 
> created on the assigned vfio device. This leads to the concepts of 
> "virtual PASID" (vPASID) vs. "physical PASID" (pPASID). vPASID is assigned 
> by the guest to mark an GVA address space while pPASID is the one 
> selected by the host and actually routed in the wire.
> 
> vPASID is conveyed to the kernel when user calls VFIO_ATTACH_IOASID.
> 
> vfio device driver translates vPASID to pPASID before calling ioasid_attach_
> device, with two factors to be considered:
> 
> -    Whether vPASID is directly used (vPASID==pPASID) in the wire, or 
>      should be instead converted to a newly-allocated one (vPASID!=
>      pPASID);
> 
> -    If vPASID!=pPASID, whether pPASID is allocated from per-RID PASID
>      space or a global PASID space (implying sharing pPASID cross devices,
>      e.g. when supporting Intel ENQCMD which puts PASID in a CPU MSR
>      as part of the process context);
> 
> The actual policy depends on pdev vs. mdev, and whether ENQCMD is
> supported. There are three possible scenarios:
> 
> (Note: /dev/ioasid uAPI is not affected by underlying PASID virtualization 
> policies.)
> 
> 1)  pdev (w/ or w/o ENQCMD): vPASID==pPASID
> 
>      vPASIDs are directly programmed by the guest to the assigned MMIO 
>      bar, implying all DMAs out of this device having vPASID in the packet 
>      header. This mandates vPASID==pPASID, sort of delegating the entire 
>      per-RID PASID space to the guest.
> 
>      When ENQCMD is enabled, the CPU MSR when running a guest task
>      contains a vPASID. In this case the CPU PASID translation capability 
>      should be disabled so this vPASID in CPU MSR is directly sent to the
>      wire.
> 
>      This ensures consistent vPASID usage on pdev regardless of the 
>      workload submitted through a MMIO register or ENQCMD instruction.
> 
> 2)  mdev: vPASID!=pPASID (per-RID if w/o ENQCMD, otherwise global)
> 
>      PASIDs are also used by kernel to mark the default I/O address space 
>      for mdev, thus cannot be delegated to the guest. Instead, the mdev 
>      driver must allocate a new pPASID for each vPASID (thus vPASID!=
>      pPASID) and then use pPASID when attaching this mdev to an ioasid.
> 
>      The mdev driver needs cache the PASID mapping so in mediation 
>      path vPASID programmed by the guest can be converted to pPASID 
>      before updating the physical MMIO register. The mapping should
>      also be saved in the CPU PASID translation structure (via KVM uAPI), 
>      so the vPASID saved in the CPU MSR is auto-translated to pPASID 
>      before sent to the wire, when ENQCMD is enabled. 
> 
>      Generally pPASID could be allocated from the per-RID PASID space
>      if all mdev's created on the parent device don't support ENQCMD.
> 
>      However if the parent supports ENQCMD-capable mdev, pPASIDs
>      must be allocated from a global pool because the CPU PASID 
>      translation structure is per-VM. It implies that when an guest I/O 
>      page table is attached to two mdevs with a single vPASID (i.e. bind 
>      to the same guest process), a same pPASID should be used for 
>      both mdevs even when they belong to different parents. Sharing
>      pPASID cross mdevs is achieved by calling aforementioned ioasid_
>      get_global_pasid().
> 
> 3)  Mix pdev/mdev together
> 
>      Above policies are per device type thus are not affected when mixing 
>      those device types together (when assigned to a single guest). However, 
>      there is one exception - when both pdev/mdev support ENQCMD.
> 
>      Remember the two types have conflicting requirements on whether 
>      CPU PASID translation should be enabled. This capability is per-VM, 
>      and must be enabled for mdev isolation. When enabled, pdev will 
>      receive a mdev pPASID violating its vPASID expectation.
> 
>      In previous thread a PASID range split scheme was discussed to support
>      this combination, but we haven't worked out a clean uAPI design yet.
>      Therefore in this proposal we decide to not support it, implying the 
>      user should have some intelligence to avoid such scenario. It could be
>      a TODO task for future.
> 
> In spite of those subtle considerations, the kernel implementation could
> start simple, e.g.:
> 
> -    v==p for pdev;
> -    v!=p and always use a global PASID pool for all mdev's;
> 
> Regardless of the kernel policy, the user policy is unchanged:
> 
> -    provide vPASID when calling VFIO_ATTACH_IOASID;
> -    call KVM uAPI to setup CPU PASID translation if ENQCMD-capable mdev;
> -    Don't expose ENQCMD capability on both pdev and mdev;
> 
> Sample user flow is described in section 5.5.
> 
> 5. Use Cases and Flows
> -------------------------------
> 
> Here assume VFIO will support a new model where every bound device
> is explicitly listed under /dev/vfio thus a device fd can be acquired w/o 
> going through legacy container/group interface. For illustration purpose
> those devices are just called dev[1...N]:
> 
> 	device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);

Minor detail, but I'd suggest /dev/vfio/pci/DDDD:BB:SS.F for the
filenames for actual PCI functions.  Maybe /dev/vfio/mdev/something
for mdevs.  That leaves other subdirs of /dev/vfio free for future
non-PCI device types, and /dev/vfio itself for the legacy group
devices.

> As explained earlier, one IOASID fd is sufficient for all intended use cases:
> 
> 	ioasid_fd = open("/dev/ioasid", mode);
> 
> For simplicity below examples are all made for the virtualization story.
> They are representative and could be easily adapted to a non-virtualization
> scenario.
> 
> Three types of IOASIDs are considered:
> 
> 	gpa_ioasid[1...N]: 	for GPA address space
> 	giova_ioasid[1...N]:	for guest IOVA address space
> 	gva_ioasid[1...N]:	for guest CPU VA address space
> 
> At least one gpa_ioasid must always be created per guest, while the other 
> two are relevant as far as vIOMMU is concerned.
> 
> Examples here apply to both pdev and mdev, if not explicitly marked out
> (e.g. in section 5.5). VFIO device driver in the kernel will figure out the 
> associated routing information in the attaching operation.
> 
> For illustration simplicity, IOASID_CHECK_EXTENSION and IOASID_GET_
> INFO are skipped in these examples.
> 
> 5.1. A simple example
> ++++++++++++++++++
> 
> Dev1 is assigned to the guest. One gpa_ioasid is created. The GPA address
> space is managed through DMA mapping protocol:
> 
> 	/* Bind device to IOASID fd */
> 	device_fd = open("/dev/vfio/devices/dev1", mode);
> 	ioasid_fd = open("/dev/ioasid", mode);
> 	ioctl(device_fd, VFIO_BIND_IOASID_FD, ioasid_fd);
> 
> 	/* Attach device to IOASID */
> 	gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> 	at_data = { .ioasid = gpa_ioasid};
> 	ioctl(device_fd, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Setup GPA mapping */
> 	dma_map = {
> 		.ioasid	= gpa_ioasid;
> 		.iova	= 0;		// GPA
> 		.vaddr	= 0x40000000;	// HVA
> 		.size	= 1GB;
> 	};
> 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> 
> If the guest is assigned with more than dev1, user follows above sequence
> to attach other devices to the same gpa_ioasid i.e. sharing the GPA 
> address space cross all assigned devices.
> 
> 5.2. Multiple IOASIDs (no nesting)
> ++++++++++++++++++++++++++++
> 
> Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
> both devices are attached to gpa_ioasid.

Doesn't really affect your example, but note that the PAPR IOMMU does
not have a passthrough mode, so devices will not initially be attached
to gpa_ioasid - they will be unusable for DMA until attached to a
gIOVA ioasid.

> After boot the guest creates 
> an GIOVA address space (giova_ioasid) for dev2, leaving dev1 in pass
> through mode (gpa_ioasid).
> 
> Suppose IOASID nesting is not supported in this case. Qemu need to
> generate shadow mappings in userspace for giova_ioasid (like how
> VFIO works today).
> 
> To avoid duplicated locked page accounting, it's recommended to pre-
> register the virtual address range that will be used for DMA:
> 
> 	device_fd1 = open("/dev/vfio/devices/dev1", mode);
> 	device_fd2 = open("/dev/vfio/devices/dev2", mode);
> 	ioasid_fd = open("/dev/ioasid", mode);
> 	ioctl(device_fd1, VFIO_BIND_IOASID_FD, ioasid_fd);
> 	ioctl(device_fd2, VFIO_BIND_IOASID_FD, ioasid_fd);
> 
> 	/* pre-register the virtual address range for accounting */
> 	mem_info = { .vaddr = 0x40000000; .size = 1GB };
> 	ioctl(ioasid_fd, IOASID_REGISTER_MEMORY, &mem_info);
> 	/* Attach dev1 and dev2 to gpa_ioasid */
> 	gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> 	at_data = { .ioasid = gpa_ioasid};
> 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Setup GPA mapping */
> 	dma_map = {
> 		.ioasid	= gpa_ioasid;
> 		.iova	= 0; 		// GPA
> 		.vaddr	= 0x40000000;	// HVA
> 		.size	= 1GB;
> 	};
> 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> 
> 	/* After boot, guest enables an GIOVA space for dev2 */

Again, doesn't break the example, but this need not happen after guest
boot.  On the PAPR vIOMMU, the guest IOVA spaces (known as "logical IO
bus numbers" / liobns) and which devices are in each are fixed at
guest creation time and advertised to the guest via firmware.

> 	giova_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> 
> 	/* First detach dev2 from previous address space */
> 	at_data = { .ioasid = gpa_ioasid};
> 	ioctl(device_fd2, VFIO_DETACH_IOASID, &at_data);
> 
> 	/* Then attach dev2 to the new address space */
> 	at_data = { .ioasid = giova_ioasid};
> 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Setup a shadow DMA mapping according to vIOMMU
> 	  * GIOVA (0x2000) -> GPA (0x1000) -> HVA (0x40001000)
> 	  */
> 	dma_map = {
> 		.ioasid	= giova_ioasid;
> 		.iova	= 0x2000; 	// GIOVA
> 		.vaddr	= 0x40001000;	// HVA
> 		.size	= 4KB;
> 	};
> 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> 
> 5.3. IOASID nesting (software)
> +++++++++++++++++++++++++
> 
> Same usage scenario as 5.2, with software-based IOASID nesting 
> available. In this mode it is the kernel instead of user to create the
> shadow mapping.

In this case, I feel like the preregistration is redundant with the
GPA level mapping.  As long as the gIOVA mappings (which might be
frequent) can piggyback on the accounting done for the GPA mapping we
accomplish what we need from preregistration.

> The flow before guest boots is same as 5.2, except one point. Because 
> giova_ioasid is nested on gpa_ioasid, locked accounting is only 
> conducted for gpa_ioasid. So it's not necessary to pre-register virtual 
> memory.
> 
> To save space we only list the steps after boots (i.e. both dev1/dev2
> have been attached to gpa_ioasid before guest boots):
> 
> 	/* After boots */
> 	/* Make GIOVA space nested on GPA space */
> 	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
> 
> 	/* Attach dev2 to the new address space (child)
> 	  * Note dev2 is still attached to gpa_ioasid (parent)
> 	  */
> 	at_data = { .ioasid = giova_ioasid};
> 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Setup a GIOVA->GPA mapping for giova_ioasid, which will be 
> 	  * merged by the kernel with GPA->HVA mapping of gpa_ioasid
> 	  * to form a shadow mapping.
> 	  */
> 	dma_map = {
> 		.ioasid	= giova_ioasid;
> 		.iova	= 0x2000;	// GIOVA
> 		.vaddr	= 0x1000;	// GPA
> 		.size	= 4KB;
> 	};
> 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> 
> 5.4. IOASID nesting (hardware)
> +++++++++++++++++++++++++
> 
> Same usage scenario as 5.2, with hardware-based IOASID nesting
> available. In this mode the pgtable binding protocol is used to 
> bind the guest IOVA page table with the IOMMU:
> 
> 	/* After boots */
> 	/* Make GIOVA space nested on GPA space */
> 	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
> 
> 	/* Attach dev2 to the new address space (child)
> 	  * Note dev2 is still attached to gpa_ioasid (parent)
> 	  */
> 	at_data = { .ioasid = giova_ioasid};
> 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Bind guest I/O page table  */
> 	bind_data = {
> 		.ioasid	= giova_ioasid;
> 		.addr	= giova_pgtable;
> 		// and format information
> 	};
> 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> 
> 	/* Invalidate IOTLB when required */
> 	inv_data = {
> 		.ioasid	= giova_ioasid;
> 		// granular information
> 	};
> 	ioctl(ioasid_fd, IOASID_INVALIDATE_CACHE, &inv_data);
> 
> 	/* See 5.6 for I/O page fault handling */
> 	
> 5.5. Guest SVA (vSVA)
> ++++++++++++++++++
> 
> After boots the guest further create a GVA address spaces (gpasid1) on 
> dev1. Dev2 is not affected (still attached to giova_ioasid).
> 
> As explained in section 4, user should avoid expose ENQCMD on both
> pdev and mdev.
> 
> The sequence applies to all device types (being pdev or mdev), except
> one additional step to call KVM for ENQCMD-capable mdev:
> 
> 	/* After boots */
> 	/* Make GVA space nested on GPA space */
> 	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);

I'm not clear what gva_ioasid is representing.  Is it representing a
single vPASID's address space, or a whole bunch of vPASIDs address
spaces?

> 	/* Attach dev1 to the new address space and specify vPASID */
> 	at_data = {
> 		.ioasid		= gva_ioasid;
> 		.flag 		= IOASID_ATTACH_USER_PASID;
> 		.user_pasid	= gpasid1;
> 	};
> 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* if dev1 is ENQCMD-capable mdev, update CPU PASID 
> 	  * translation structure through KVM
> 	  */
> 	pa_data = {
> 		.ioasid_fd	= ioasid_fd;
> 		.ioasid		= gva_ioasid;
> 		.guest_pasid	= gpasid1;
> 	};
> 	ioctl(kvm_fd, KVM_MAP_PASID, &pa_data);
> 
> 	/* Bind guest I/O page table  */
> 	bind_data = {
> 		.ioasid	= gva_ioasid;
> 		.addr	= gva_pgtable1;
> 		// and format information
> 	};
> 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> 
> 	...
> 
> 
> 5.6. I/O page fault
> +++++++++++++++
> 
> (uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
> to guest IOMMU driver and backwards).
> 
> -   Host IOMMU driver receives a page request with raw fault_data {rid, 
>     pasid, addr};
> 
> -   Host IOMMU driver identifies the faulting I/O page table according to
>     information registered by IOASID fault handler;
> 
> -   IOASID fault handler is called with raw fault_data (rid, pasid, addr), which 
>     is saved in ioasid_data->fault_data (used for response);
> 
> -   IOASID fault handler generates an user fault_data (ioasid, addr), links it 
>     to the shared ring buffer and triggers eventfd to userspace;
> 
> -   Upon received event, Qemu needs to find the virtual routing information 
>     (v_rid + v_pasid) of the device attached to the faulting ioasid. If there are 
>     multiple, pick a random one. This should be fine since the purpose is to
>     fix the I/O page table on the guest;
> 
> -   Qemu generates a virtual I/O page fault through vIOMMU into guest,
>     carrying the virtual fault data (v_rid, v_pasid, addr);
> 
> -   Guest IOMMU driver fixes up the fault, updates the I/O page table, and
>     then sends a page response with virtual completion data (v_rid, v_pasid, 
>     response_code) to vIOMMU;
> 
> -   Qemu finds the pending fault event, converts virtual completion data 
>     into (ioasid, response_code), and then calls a /dev/ioasid ioctl to 
>     complete the pending fault;
> 
> -   /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved in 
>     ioasid_data->fault_data, and then calls iommu api to complete it with
>     {rid, pasid, response_code};
> 
> 5.7. BIND_PASID_TABLE
> ++++++++++++++++++++
> 
> PASID table is put in the GPA space on some platform, thus must be updated
> by the guest. It is treated as another user page table to be bound with the 
> IOMMU.
> 
> As explained earlier, the user still needs to explicitly bind every user I/O 
> page table to the kernel so the same pgtable binding protocol (bind, cache 
> invalidate and fault handling) is unified cross platforms.
> 
> vIOMMUs may include a caching mode (or paravirtualized way) which, once 
> enabled, requires the guest to invalidate PASID cache for any change on the 
> PASID table. This allows Qemu to track the lifespan of guest I/O page tables.
> 
> In case of missing such capability, Qemu could enable write-protection on
> the guest PASID table to achieve the same effect.
> 
> 	/* After boots */
> 	/* Make vPASID space nested on GPA space */
> 	pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);

I think this time pasidtbl_ioasid is representing multiple vPASID
address spaces, yes?  In which case I don't think it should be treated
as the same sort of object as a normal IOASID, which represents a
single address space IIUC.

> 	/* Attach dev1 to pasidtbl_ioasid */
> 	at_data = { .ioasid = pasidtbl_ioasid};
> 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Bind PASID table */
> 	bind_data = {
> 		.ioasid	= pasidtbl_ioasid;
> 		.addr	= gpa_pasid_table;
> 		// and format information
> 	};
> 	ioctl(ioasid_fd, IOASID_BIND_PASID_TABLE, &bind_data);
> 
> 	/* vIOMMU detects a new GVA I/O space created */
> 	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
> 
> 	/* Attach dev1 to the new address space, with gpasid1 */
> 	at_data = {
> 		.ioasid		= gva_ioasid;
> 		.flag 		= IOASID_ATTACH_USER_PASID;
> 		.user_pasid	= gpasid1;
> 	};
> 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Bind guest I/O page table. Because SET_PASID_TABLE has been
> 	  * used, the kernel will not update the PASID table. Instead, just
> 	  * track the bound I/O page table for handling invalidation and
> 	  * I/O page faults.
> 	  */
> 	bind_data = {
> 		.ioasid	= gva_ioasid;
> 		.addr	= gva_pgtable1;
> 		// and format information
> 	};
> 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);

Hrm.. if you still have to individually bind a table for each vPASID,
what's the point of BIND_PASID_TABLE?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 156 bytes --]

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-28 17:35   ` Jason Gunthorpe
@ 2021-06-02  6:32     ` David Gibson
  -1 siblings, 0 replies; 518+ messages in thread
From: David Gibson @ 2021-06-02  6:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	Kirti Wankhede, Robin Murphy

[-- Attachment #1: Type: text/plain, Size: 4577 bytes --]

On Fri, May 28, 2021 at 02:35:38PM -0300, Jason Gunthorpe wrote:
> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
[snip]
> > With above design /dev/ioasid uAPI is all about I/O address spaces. 
> > It doesn't include any device routing information, which is only 
> > indirectly registered to the ioasid driver through VFIO uAPI. For
> > example, I/O page fault is always reported to userspace per IOASID,
> > although it's physically reported per device (RID+PASID). 
> 
> I agree with Jean-Philippe - at the very least erasing this
> information needs a major rational - but I don't really see why it
> must be erased? The HW reports the originating device, is it just a
> matter of labeling the devices attached to the /dev/ioasid FD so it
> can be reported to userspace?

HW reports the originating device as far as it knows.  In many cases
where you have multiple devices in an IOMMU group, it's because
although they're treated as separate devices at the kernel level, they
have the same RID at the HW level.  Which means a RID for something in
the right group is the closest you can count on supplying.

[snip]
> > However this way significantly 
> > violates the philosophy in this /dev/ioasid proposal. It is not one IOASID 
> > one address space any more. Device routing information (indirectly 
> > marking hidden I/O spaces) has to be carried in iotlb invalidation and 
> > page faulting uAPI to help connect vIOMMU with the underlying 
> > pIOMMU. This is one design choice to be confirmed with ARM guys.
> 
> I'm confused by this rational.
> 
> For a vIOMMU that has IO page tables in the guest the basic
> choices are:
>  - Do we have a hypervisor trap to bind the page table or not? (RID
>    and PASID may differ here)
>  - Do we have a hypervisor trap to invaliate the page tables or not?
> 
> If the first is a hypervisor trap then I agree it makes sense to create a
> child IOASID that points to each guest page table and manage it
> directly. This should not require walking guest page tables as it is
> really just informing the HW where the page table lives. HW will walk
> them.
> 
> If there are no hypervisor traps (does this exist?) then there is no
> way to involve the hypervisor here and the child IOASID should simply
> be a pointer to the guest's data structure that describes binding. In
> this case that IOASID should claim all PASIDs when bound to a
> RID. 

And in that case I think we should call that object something other
than an IOASID, since it represents multiple address spaces.

> Invalidation should be passed up the to the IOMMU driver in terms of
> the guest tables information and either the HW or software has to walk
> to guest tables to make sense of it.
> 
> Events from the IOMMU to userspace should be tagged with the attached
> device label and the PASID/substream ID. This means there is no issue
> to have a a 'all PASID' IOASID.
> 
> > Notes:
> > -   It might be confusing as IOASID is also used in the kernel (drivers/
> >     iommu/ioasid.c) to represent PCI PASID or ARM substream ID. We need
> >     find a better name later to differentiate.
> 
> +1 on Jean-Philippe's remarks
> 
> > -   PPC has not be considered yet as we haven't got time to fully understand
> >     its semantics. According to previous discussion there is some generality 
> >     between PPC window-based scheme and VFIO type1 semantics. Let's 
> >     first make consensus on this proposal and then further discuss how to 
> >     extend it to cover PPC's requirement.
> 
> From what I understood PPC is not so bad, Nesting IOASID's did its
> preload feature and it needed a way to specify/query the IOVA range a
> IOASID will cover.
> 
> > -   There is a protocol between vfio group and kvm. Needs to think about
> >     how it will be affected following this proposal.
> 
> Ugh, I always stop looking when I reach that boundary. Can anyone
> summarize what is going on there?
> 
> Most likely passing the /dev/ioasid into KVM's FD (or vicevera) is the
> right answer. Eg if ARM needs to get the VMID from KVM and set it to
> ioasid then a KVM "ioctl set_arm_vmid(/dev/ioasid)" call is
> reasonable. Certainly better than the symbol get sutff we have right
> now.
> 
> I will read through the detail below in another email
> 
> Jason
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-02  6:32     ` David Gibson
  0 siblings, 0 replies; 518+ messages in thread
From: David Gibson @ 2021-06-02  6:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Tian, Kevin,
	Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, Jiang, Dave, David Woodhouse, Jason Wang


[-- Attachment #1.1: Type: text/plain, Size: 4577 bytes --]

On Fri, May 28, 2021 at 02:35:38PM -0300, Jason Gunthorpe wrote:
> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
[snip]
> > With above design /dev/ioasid uAPI is all about I/O address spaces. 
> > It doesn't include any device routing information, which is only 
> > indirectly registered to the ioasid driver through VFIO uAPI. For
> > example, I/O page fault is always reported to userspace per IOASID,
> > although it's physically reported per device (RID+PASID). 
> 
> I agree with Jean-Philippe - at the very least erasing this
> information needs a major rational - but I don't really see why it
> must be erased? The HW reports the originating device, is it just a
> matter of labeling the devices attached to the /dev/ioasid FD so it
> can be reported to userspace?

HW reports the originating device as far as it knows.  In many cases
where you have multiple devices in an IOMMU group, it's because
although they're treated as separate devices at the kernel level, they
have the same RID at the HW level.  Which means a RID for something in
the right group is the closest you can count on supplying.

[snip]
> > However this way significantly 
> > violates the philosophy in this /dev/ioasid proposal. It is not one IOASID 
> > one address space any more. Device routing information (indirectly 
> > marking hidden I/O spaces) has to be carried in iotlb invalidation and 
> > page faulting uAPI to help connect vIOMMU with the underlying 
> > pIOMMU. This is one design choice to be confirmed with ARM guys.
> 
> I'm confused by this rational.
> 
> For a vIOMMU that has IO page tables in the guest the basic
> choices are:
>  - Do we have a hypervisor trap to bind the page table or not? (RID
>    and PASID may differ here)
>  - Do we have a hypervisor trap to invaliate the page tables or not?
> 
> If the first is a hypervisor trap then I agree it makes sense to create a
> child IOASID that points to each guest page table and manage it
> directly. This should not require walking guest page tables as it is
> really just informing the HW where the page table lives. HW will walk
> them.
> 
> If there are no hypervisor traps (does this exist?) then there is no
> way to involve the hypervisor here and the child IOASID should simply
> be a pointer to the guest's data structure that describes binding. In
> this case that IOASID should claim all PASIDs when bound to a
> RID. 

And in that case I think we should call that object something other
than an IOASID, since it represents multiple address spaces.

> Invalidation should be passed up the to the IOMMU driver in terms of
> the guest tables information and either the HW or software has to walk
> to guest tables to make sense of it.
> 
> Events from the IOMMU to userspace should be tagged with the attached
> device label and the PASID/substream ID. This means there is no issue
> to have a a 'all PASID' IOASID.
> 
> > Notes:
> > -   It might be confusing as IOASID is also used in the kernel (drivers/
> >     iommu/ioasid.c) to represent PCI PASID or ARM substream ID. We need
> >     find a better name later to differentiate.
> 
> +1 on Jean-Philippe's remarks
> 
> > -   PPC has not be considered yet as we haven't got time to fully understand
> >     its semantics. According to previous discussion there is some generality 
> >     between PPC window-based scheme and VFIO type1 semantics. Let's 
> >     first make consensus on this proposal and then further discuss how to 
> >     extend it to cover PPC's requirement.
> 
> From what I understood PPC is not so bad, Nesting IOASID's did its
> preload feature and it needed a way to specify/query the IOVA range a
> IOASID will cover.
> 
> > -   There is a protocol between vfio group and kvm. Needs to think about
> >     how it will be affected following this proposal.
> 
> Ugh, I always stop looking when I reach that boundary. Can anyone
> summarize what is going on there?
> 
> Most likely passing the /dev/ioasid into KVM's FD (or vicevera) is the
> right answer. Eg if ARM needs to get the VMID from KVM and set it to
> ioasid then a KVM "ioctl set_arm_vmid(/dev/ioasid)" call is
> reasonable. Certainly better than the symbol get sutff we have right
> now.
> 
> I will read through the detail below in another email
> 
> Jason
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 156 bytes --]

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-28 19:58   ` Jason Gunthorpe
@ 2021-06-02  6:48     ` David Gibson
  -1 siblings, 0 replies; 518+ messages in thread
From: David Gibson @ 2021-06-02  6:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	Kirti Wankhede, Robin Murphy

[-- Attachment #1: Type: text/plain, Size: 7739 bytes --]

On Fri, May 28, 2021 at 04:58:39PM -0300, Jason Gunthorpe wrote:
> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > 
> > 5. Use Cases and Flows
> > 
> > Here assume VFIO will support a new model where every bound device
> > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o 
> > going through legacy container/group interface. For illustration purpose
> > those devices are just called dev[1...N]:
> > 
> > 	device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> > 
> > As explained earlier, one IOASID fd is sufficient for all intended use cases:
> > 
> > 	ioasid_fd = open("/dev/ioasid", mode);
> > 
> > For simplicity below examples are all made for the virtualization story.
> > They are representative and could be easily adapted to a non-virtualization
> > scenario.
> 
> For others, I don't think this is *strictly* necessary, we can
> probably still get to the device_fd using the group_fd and fit in
> /dev/ioasid. It does make the rest of this more readable though.

Leaving aside whether group fds should exist, while they *do* exist
binding to an IOASID should be done on the group not an individual
device.

[snip]
> > 	/* if dev1 is ENQCMD-capable mdev, update CPU PASID 
> > 	  * translation structure through KVM
> > 	  */
> > 	pa_data = {
> > 		.ioasid_fd	= ioasid_fd;
> > 		.ioasid		= gva_ioasid;
> > 		.guest_pasid	= gpasid1;
> > 	};
> > 	ioctl(kvm_fd, KVM_MAP_PASID, &pa_data);
> 
> Make sense
> 
> > 	/* Bind guest I/O page table  */
> > 	bind_data = {
> > 		.ioasid	= gva_ioasid;
> > 		.addr	= gva_pgtable1;
> > 		// and format information
> > 	};
> > 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> 
> Again I do wonder if this should just be part of alloc_ioasid. Is
> there any reason to split these things? The only advantage to the
> split is the device is known, but the device shouldn't impact
> anything..

I'm pretty sure the device(s) could matter, although they probably
won't usually.  But it would certainly be possible for a system to
have two different host bridges with two different IOMMUs with
different pagetable formats.  Until you know which devices (and
therefore which host bridge) you're talking about, you don't know what
formats of pagetable to accept.  And if you have devices from *both*
bridges you can't bind a page table at all - you could theoretically
support a kernel managed pagetable by mirroring each MAP and UNMAP to
tables in both formats, but it would be pretty reasonable not to
support that.

> > 5.6. I/O page fault
> > +++++++++++++++
> > 
> > (uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
> > to guest IOMMU driver and backwards).
> > 
> > -   Host IOMMU driver receives a page request with raw fault_data {rid, 
> >     pasid, addr};
> > 
> > -   Host IOMMU driver identifies the faulting I/O page table according to
> >     information registered by IOASID fault handler;
> > 
> > -   IOASID fault handler is called with raw fault_data (rid, pasid, addr), which 
> >     is saved in ioasid_data->fault_data (used for response);
> > 
> > -   IOASID fault handler generates an user fault_data (ioasid, addr), links it 
> >     to the shared ring buffer and triggers eventfd to userspace;
> 
> Here rid should be translated to a labeled device and return the
> device label from VFIO_BIND_IOASID_FD. Depending on how the device
> bound the label might match to a rid or to a rid,pasid

I like the idea of labelling devices when they're attached, it makes
extension to non-PCI devices much more obvious that having to deal
with concrete RIDs.

But, remember we can only (reliably) determine rid up to the group
boundary.  So if you're labelling devices, all devices in a group
would have to have the same label.  Or you attach the label to a group
not a device, which would be a reason to represent the group as an
object again.

> > -   Upon received event, Qemu needs to find the virtual routing information 
> >     (v_rid + v_pasid) of the device attached to the faulting ioasid. If there are 
> >     multiple, pick a random one. This should be fine since the purpose is to
> >     fix the I/O page table on the guest;
> 
> The device label should fix this
>  
> > -   Qemu finds the pending fault event, converts virtual completion data 
> >     into (ioasid, response_code), and then calls a /dev/ioasid ioctl to 
> >     complete the pending fault;
> > 
> > -   /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved in 
> >     ioasid_data->fault_data, and then calls iommu api to complete it with
> >     {rid, pasid, response_code};
> 
> So resuming a fault on an ioasid will resume all devices pending on
> the fault?
> 
> > 5.7. BIND_PASID_TABLE
> > ++++++++++++++++++++
> > 
> > PASID table is put in the GPA space on some platform, thus must be updated
> > by the guest. It is treated as another user page table to be bound with the 
> > IOMMU.
> > 
> > As explained earlier, the user still needs to explicitly bind every user I/O 
> > page table to the kernel so the same pgtable binding protocol (bind, cache 
> > invalidate and fault handling) is unified cross platforms.
> >
> > vIOMMUs may include a caching mode (or paravirtualized way) which, once 
> > enabled, requires the guest to invalidate PASID cache for any change on the 
> > PASID table. This allows Qemu to track the lifespan of guest I/O page tables.
> >
> > In case of missing such capability, Qemu could enable write-protection on
> > the guest PASID table to achieve the same effect.
> > 
> > 	/* After boots */
> > 	/* Make vPASID space nested on GPA space */
> > 	pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > 				gpa_ioasid);
> > 
> > 	/* Attach dev1 to pasidtbl_ioasid */
> > 	at_data = { .ioasid = pasidtbl_ioasid};
> > 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> > 
> > 	/* Bind PASID table */
> > 	bind_data = {
> > 		.ioasid	= pasidtbl_ioasid;
> > 		.addr	= gpa_pasid_table;
> > 		// and format information
> > 	};
> > 	ioctl(ioasid_fd, IOASID_BIND_PASID_TABLE, &bind_data);
> > 
> > 	/* vIOMMU detects a new GVA I/O space created */
> > 	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > 				gpa_ioasid);
> > 
> > 	/* Attach dev1 to the new address space, with gpasid1 */
> > 	at_data = {
> > 		.ioasid		= gva_ioasid;
> > 		.flag 		= IOASID_ATTACH_USER_PASID;
> > 		.user_pasid	= gpasid1;
> > 	};
> > 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> > 
> > 	/* Bind guest I/O page table. Because SET_PASID_TABLE has been
> > 	  * used, the kernel will not update the PASID table. Instead, just
> > 	  * track the bound I/O page table for handling invalidation and
> > 	  * I/O page faults.
> > 	  */
> > 	bind_data = {
> > 		.ioasid	= gva_ioasid;
> > 		.addr	= gva_pgtable1;
> > 		// and format information
> > 	};
> > 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> 
> I still don't quite get the benifit from doing this.
> 
> The idea to create an all PASID IOASID seems to work better with less
> fuss on HW that is directly parsing the guest's PASID table.
> 
> Cache invalidate seems easy enough to support
> 
> Fault handling needs to return the (ioasid, device_label, pasid) when
> working with this kind of ioasid.
> 
> It is true that it does create an additional flow qemu has to
> implement, but it does directly mirror the HW.
> 
> Jason
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-02  6:48     ` David Gibson
  0 siblings, 0 replies; 518+ messages in thread
From: David Gibson @ 2021-06-02  6:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Tian, Kevin,
	Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, Jiang, Dave, David Woodhouse, Jason Wang


[-- Attachment #1.1: Type: text/plain, Size: 7739 bytes --]

On Fri, May 28, 2021 at 04:58:39PM -0300, Jason Gunthorpe wrote:
> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > 
> > 5. Use Cases and Flows
> > 
> > Here assume VFIO will support a new model where every bound device
> > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o 
> > going through legacy container/group interface. For illustration purpose
> > those devices are just called dev[1...N]:
> > 
> > 	device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> > 
> > As explained earlier, one IOASID fd is sufficient for all intended use cases:
> > 
> > 	ioasid_fd = open("/dev/ioasid", mode);
> > 
> > For simplicity below examples are all made for the virtualization story.
> > They are representative and could be easily adapted to a non-virtualization
> > scenario.
> 
> For others, I don't think this is *strictly* necessary, we can
> probably still get to the device_fd using the group_fd and fit in
> /dev/ioasid. It does make the rest of this more readable though.

Leaving aside whether group fds should exist, while they *do* exist
binding to an IOASID should be done on the group not an individual
device.

[snip]
> > 	/* if dev1 is ENQCMD-capable mdev, update CPU PASID 
> > 	  * translation structure through KVM
> > 	  */
> > 	pa_data = {
> > 		.ioasid_fd	= ioasid_fd;
> > 		.ioasid		= gva_ioasid;
> > 		.guest_pasid	= gpasid1;
> > 	};
> > 	ioctl(kvm_fd, KVM_MAP_PASID, &pa_data);
> 
> Make sense
> 
> > 	/* Bind guest I/O page table  */
> > 	bind_data = {
> > 		.ioasid	= gva_ioasid;
> > 		.addr	= gva_pgtable1;
> > 		// and format information
> > 	};
> > 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> 
> Again I do wonder if this should just be part of alloc_ioasid. Is
> there any reason to split these things? The only advantage to the
> split is the device is known, but the device shouldn't impact
> anything..

I'm pretty sure the device(s) could matter, although they probably
won't usually.  But it would certainly be possible for a system to
have two different host bridges with two different IOMMUs with
different pagetable formats.  Until you know which devices (and
therefore which host bridge) you're talking about, you don't know what
formats of pagetable to accept.  And if you have devices from *both*
bridges you can't bind a page table at all - you could theoretically
support a kernel managed pagetable by mirroring each MAP and UNMAP to
tables in both formats, but it would be pretty reasonable not to
support that.

> > 5.6. I/O page fault
> > +++++++++++++++
> > 
> > (uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
> > to guest IOMMU driver and backwards).
> > 
> > -   Host IOMMU driver receives a page request with raw fault_data {rid, 
> >     pasid, addr};
> > 
> > -   Host IOMMU driver identifies the faulting I/O page table according to
> >     information registered by IOASID fault handler;
> > 
> > -   IOASID fault handler is called with raw fault_data (rid, pasid, addr), which 
> >     is saved in ioasid_data->fault_data (used for response);
> > 
> > -   IOASID fault handler generates an user fault_data (ioasid, addr), links it 
> >     to the shared ring buffer and triggers eventfd to userspace;
> 
> Here rid should be translated to a labeled device and return the
> device label from VFIO_BIND_IOASID_FD. Depending on how the device
> bound the label might match to a rid or to a rid,pasid

I like the idea of labelling devices when they're attached, it makes
extension to non-PCI devices much more obvious that having to deal
with concrete RIDs.

But, remember we can only (reliably) determine rid up to the group
boundary.  So if you're labelling devices, all devices in a group
would have to have the same label.  Or you attach the label to a group
not a device, which would be a reason to represent the group as an
object again.

> > -   Upon received event, Qemu needs to find the virtual routing information 
> >     (v_rid + v_pasid) of the device attached to the faulting ioasid. If there are 
> >     multiple, pick a random one. This should be fine since the purpose is to
> >     fix the I/O page table on the guest;
> 
> The device label should fix this
>  
> > -   Qemu finds the pending fault event, converts virtual completion data 
> >     into (ioasid, response_code), and then calls a /dev/ioasid ioctl to 
> >     complete the pending fault;
> > 
> > -   /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved in 
> >     ioasid_data->fault_data, and then calls iommu api to complete it with
> >     {rid, pasid, response_code};
> 
> So resuming a fault on an ioasid will resume all devices pending on
> the fault?
> 
> > 5.7. BIND_PASID_TABLE
> > ++++++++++++++++++++
> > 
> > PASID table is put in the GPA space on some platform, thus must be updated
> > by the guest. It is treated as another user page table to be bound with the 
> > IOMMU.
> > 
> > As explained earlier, the user still needs to explicitly bind every user I/O 
> > page table to the kernel so the same pgtable binding protocol (bind, cache 
> > invalidate and fault handling) is unified cross platforms.
> >
> > vIOMMUs may include a caching mode (or paravirtualized way) which, once 
> > enabled, requires the guest to invalidate PASID cache for any change on the 
> > PASID table. This allows Qemu to track the lifespan of guest I/O page tables.
> >
> > In case of missing such capability, Qemu could enable write-protection on
> > the guest PASID table to achieve the same effect.
> > 
> > 	/* After boots */
> > 	/* Make vPASID space nested on GPA space */
> > 	pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > 				gpa_ioasid);
> > 
> > 	/* Attach dev1 to pasidtbl_ioasid */
> > 	at_data = { .ioasid = pasidtbl_ioasid};
> > 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> > 
> > 	/* Bind PASID table */
> > 	bind_data = {
> > 		.ioasid	= pasidtbl_ioasid;
> > 		.addr	= gpa_pasid_table;
> > 		// and format information
> > 	};
> > 	ioctl(ioasid_fd, IOASID_BIND_PASID_TABLE, &bind_data);
> > 
> > 	/* vIOMMU detects a new GVA I/O space created */
> > 	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > 				gpa_ioasid);
> > 
> > 	/* Attach dev1 to the new address space, with gpasid1 */
> > 	at_data = {
> > 		.ioasid		= gva_ioasid;
> > 		.flag 		= IOASID_ATTACH_USER_PASID;
> > 		.user_pasid	= gpasid1;
> > 	};
> > 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> > 
> > 	/* Bind guest I/O page table. Because SET_PASID_TABLE has been
> > 	  * used, the kernel will not update the PASID table. Instead, just
> > 	  * track the bound I/O page table for handling invalidation and
> > 	  * I/O page faults.
> > 	  */
> > 	bind_data = {
> > 		.ioasid	= gva_ioasid;
> > 		.addr	= gva_pgtable1;
> > 		// and format information
> > 	};
> > 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> 
> I still don't quite get the benifit from doing this.
> 
> The idea to create an all PASID IOASID seems to work better with less
> fuss on HW that is directly parsing the guest's PASID table.
> 
> Cache invalidate seems easy enough to support
> 
> Fault handling needs to return the (ioasid, device_label, pasid) when
> working with this kind of ioasid.
> 
> It is true that it does create an additional flow qemu has to
> implement, but it does directly mirror the HW.
> 
> Jason
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 156 bytes --]

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01 17:56       ` Jason Gunthorpe
@ 2021-06-02  6:57         ` David Gibson
  -1 siblings, 0 replies; 518+ messages in thread
From: David Gibson @ 2021-06-02  6:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	Kirti Wankhede, Robin Murphy

[-- Attachment #1: Type: text/plain, Size: 4248 bytes --]

On Tue, Jun 01, 2021 at 02:56:43PM -0300, Jason Gunthorpe wrote:
> On Tue, Jun 01, 2021 at 08:38:00AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Saturday, May 29, 2021 3:59 AM
> > > 
> > > On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > > >
> > > > 5. Use Cases and Flows
> > > >
> > > > Here assume VFIO will support a new model where every bound device
> > > > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> > > > going through legacy container/group interface. For illustration purpose
> > > > those devices are just called dev[1...N]:
> > > >
> > > > 	device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> > > >
> > > > As explained earlier, one IOASID fd is sufficient for all intended use cases:
> > > >
> > > > 	ioasid_fd = open("/dev/ioasid", mode);
> > > >
> > > > For simplicity below examples are all made for the virtualization story.
> > > > They are representative and could be easily adapted to a non-virtualization
> > > > scenario.
> > > 
> > > For others, I don't think this is *strictly* necessary, we can
> > > probably still get to the device_fd using the group_fd and fit in
> > > /dev/ioasid. It does make the rest of this more readable though.
> > 
> > Jason, want to confirm here. Per earlier discussion we remain an
> > impression that you want VFIO to be a pure device driver thus
> > container/group are used only for legacy application.
> 
> Let me call this a "nice wish".
> 
> If you get to a point where you hard need this, then identify the hard
> requirement and let's do it, but I wouldn't bloat this already large
> project unnecessarily.
> 
> Similarly I wouldn't depend on the group fd existing in this design
> so it could be changed later.

I don't think presence or absence of a group fd makes a lot of
difference to this design.  Having a group fd just means we attach
groups to the ioasid instead of individual devices, and we no longer
need the bookkeeping of "partial" devices.

> > From this comment are you suggesting that VFIO can still keep
> > container/ group concepts and user just deprecates the use of vfio
> > iommu uAPI (e.g. VFIO_SET_IOMMU) by using /dev/ioasid (which has a
> > simple policy that an IOASID will reject cmd if partially-attached
> > group exists)?
> 
> I would say no on the container. /dev/ioasid == the container, having
> two competing objects at once in a single process is just a mess.

Right.  I'd assume that for compatibility, creating a container would
create a single IOASID under the hood with a compatiblity layer
translating the container operations to iosaid operations.

> If the group fd can be kept requires charting a path through the
> ioctls where the container is not used and /dev/ioasid is sub'd in
> using the same device FD specific IOCTLs you show here.

Again, I don't think it makes much difference.  The model doesn't
really change even if you allow both ATTACH_GROUP and ATTACH_DEVICE on
the IOASID.  Basically ATTACH_GROUP would just be equivalent to
attaching all the constituent devices.

> I didn't try to chart this out carefully.
> 
> Also, ultimately, something need to be done about compatability with
> the vfio container fd. It looks clear enough to me that the the VFIO
> container FD is just a single IOASID using a special ioctl interface
> so it would be quite rasonable to harmonize these somehow.
> 
> But that is too complicated and far out for me at least to guess on at
> this point..
> 
> > > Still a little unsure why the vPASID is here not on the gva_ioasid. Is
> > > there any scenario where we want different vpasid's for the same
> > > IOASID? I guess it is OK like this. Hum.
> > 
> > Yes, it's completely sane that the guest links a I/O page table to 
> > different vpasids on dev1 and dev2. The IOMMU doesn't mandate
> > that when multiple devices share an I/O page table they must use
> > the same PASID#. 
> 
> Ok..
> 
> Jason
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-02  6:57         ` David Gibson
  0 siblings, 0 replies; 518+ messages in thread
From: David Gibson @ 2021-06-02  6:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Tian, Kevin,
	Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, Jiang, Dave, David Woodhouse, Jason Wang


[-- Attachment #1.1: Type: text/plain, Size: 4248 bytes --]

On Tue, Jun 01, 2021 at 02:56:43PM -0300, Jason Gunthorpe wrote:
> On Tue, Jun 01, 2021 at 08:38:00AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Saturday, May 29, 2021 3:59 AM
> > > 
> > > On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > > >
> > > > 5. Use Cases and Flows
> > > >
> > > > Here assume VFIO will support a new model where every bound device
> > > > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> > > > going through legacy container/group interface. For illustration purpose
> > > > those devices are just called dev[1...N]:
> > > >
> > > > 	device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> > > >
> > > > As explained earlier, one IOASID fd is sufficient for all intended use cases:
> > > >
> > > > 	ioasid_fd = open("/dev/ioasid", mode);
> > > >
> > > > For simplicity below examples are all made for the virtualization story.
> > > > They are representative and could be easily adapted to a non-virtualization
> > > > scenario.
> > > 
> > > For others, I don't think this is *strictly* necessary, we can
> > > probably still get to the device_fd using the group_fd and fit in
> > > /dev/ioasid. It does make the rest of this more readable though.
> > 
> > Jason, want to confirm here. Per earlier discussion we remain an
> > impression that you want VFIO to be a pure device driver thus
> > container/group are used only for legacy application.
> 
> Let me call this a "nice wish".
> 
> If you get to a point where you hard need this, then identify the hard
> requirement and let's do it, but I wouldn't bloat this already large
> project unnecessarily.
> 
> Similarly I wouldn't depend on the group fd existing in this design
> so it could be changed later.

I don't think presence or absence of a group fd makes a lot of
difference to this design.  Having a group fd just means we attach
groups to the ioasid instead of individual devices, and we no longer
need the bookkeeping of "partial" devices.

> > From this comment are you suggesting that VFIO can still keep
> > container/ group concepts and user just deprecates the use of vfio
> > iommu uAPI (e.g. VFIO_SET_IOMMU) by using /dev/ioasid (which has a
> > simple policy that an IOASID will reject cmd if partially-attached
> > group exists)?
> 
> I would say no on the container. /dev/ioasid == the container, having
> two competing objects at once in a single process is just a mess.

Right.  I'd assume that for compatibility, creating a container would
create a single IOASID under the hood with a compatiblity layer
translating the container operations to iosaid operations.

> If the group fd can be kept requires charting a path through the
> ioctls where the container is not used and /dev/ioasid is sub'd in
> using the same device FD specific IOCTLs you show here.

Again, I don't think it makes much difference.  The model doesn't
really change even if you allow both ATTACH_GROUP and ATTACH_DEVICE on
the IOASID.  Basically ATTACH_GROUP would just be equivalent to
attaching all the constituent devices.

> I didn't try to chart this out carefully.
> 
> Also, ultimately, something need to be done about compatability with
> the vfio container fd. It looks clear enough to me that the the VFIO
> container FD is just a single IOASID using a special ioctl interface
> so it would be quite rasonable to harmonize these somehow.
> 
> But that is too complicated and far out for me at least to guess on at
> this point..
> 
> > > Still a little unsure why the vPASID is here not on the gva_ioasid. Is
> > > there any scenario where we want different vpasid's for the same
> > > IOASID? I guess it is OK like this. Hum.
> > 
> > Yes, it's completely sane that the guest links a I/O page table to 
> > different vpasids on dev1 and dev2. The IOMMU doesn't mandate
> > that when multiple devices share an I/O page table they must use
> > the same PASID#. 
> 
> Ok..
> 
> Jason
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 156 bytes --]

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-28 23:36   ` Jason Gunthorpe
@ 2021-06-02  7:22     ` David Gibson
  -1 siblings, 0 replies; 518+ messages in thread
From: David Gibson @ 2021-06-02  7:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	Kirti Wankhede, Robin Murphy

[-- Attachment #1: Type: text/plain, Size: 6445 bytes --]

On Fri, May 28, 2021 at 08:36:49PM -0300, Jason Gunthorpe wrote:
> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> 
> > 2.1. /dev/ioasid uAPI
> > +++++++++++++++++
> > 
> > /*
> >   * Check whether an uAPI extension is supported. 
> >   *
> >   * This is for FD-level capabilities, such as locked page pre-registration. 
> >   * IOASID-level capabilities are reported through IOASID_GET_INFO.
> >   *
> >   * Return: 0 if not supported, 1 if supported.
> >   */
> > #define IOASID_CHECK_EXTENSION	_IO(IOASID_TYPE, IOASID_BASE + 0)
> 
>  
> > /*
> >   * Register user space memory where DMA is allowed.
> >   *
> >   * It pins user pages and does the locked memory accounting so sub-
> >   * sequent IOASID_MAP/UNMAP_DMA calls get faster.
> >   *
> >   * When this ioctl is not used, one user page might be accounted
> >   * multiple times when it is mapped by multiple IOASIDs which are
> >   * not nested together.
> >   *
> >   * Input parameters:
> >   *	- vaddr;
> >   *	- size;
> >   *
> >   * Return: 0 on success, -errno on failure.
> >   */
> > #define IOASID_REGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 1)
> > #define IOASID_UNREGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 2)
> 
> So VA ranges are pinned and stored in a tree and later references to
> those VA ranges by any other IOASID use the pin cached in the tree?
> 
> It seems reasonable and is similar to the ioasid parent/child I
> suggested for PPC.
> 
> IMHO this should be merged with the all SW IOASID that is required for
> today's mdev drivers. If this can be done while keeping this uAPI then
> great, otherwise I don't think it is so bad to weakly nest a physical
> IOASID under a SW one just to optimize page pinning.

Right, I think we can simplify the interface by modelling the
preregistration as a nesting layer.  Well, mostly.. the wrinkle is
that generally you can't do anything with an ioasid until you've
attached devices to it, but that doesn't really make sense for the
prereg layer.  I expect we can find some way to deal with that,
though.

Actually... to simplify that "weak nesting" concept I wonder if we
want to expand to 3 ways of specifying the pagetables for the ioasid:
  1) kernel managed (MAP/UNMAP)
  2) user managed (BIND/INVALIDATE)
  3) pass-though (IOVA==parent address)

Obviously pass-through wouldn't be allowed in all circumstances.

> Either way this seems like a smart direction
> 
> > /*
> >   * Allocate an IOASID. 
> >   *
> >   * IOASID is the FD-local software handle representing an I/O address 
> >   * space. Each IOASID is associated with a single I/O page table. User 
> >   * must call this ioctl to get an IOASID for every I/O address space that is
> >   * intended to be enabled in the IOMMU.
> >   *
> >   * A newly-created IOASID doesn't accept any command before it is 
> >   * attached to a device. Once attached, an empty I/O page table is 
> >   * bound with the IOMMU then the user could use either DMA mapping 
> >   * or pgtable binding commands to manage this I/O page table.
> 
> Can the IOASID can be populated before being attached?

I don't think it reasonably can.  Until attached, you don't actually
know what hardware IOMMU will be backing it, and therefore you don't
know it's capabilities.  You can't really allow mappings if you don't
even know allowed IOVA ranges and page size.

> >   * Device attachment is initiated through device driver uAPI (e.g. VFIO)
> >   *
> >   * Return: allocated ioasid on success, -errno on failure.
> >   */
> > #define IOASID_ALLOC	_IO(IOASID_TYPE, IOASID_BASE + 3)
> > #define IOASID_FREE	_IO(IOASID_TYPE, IOASID_BASE + 4)
> 
> I assume alloc will include quite a big structure to satisfy the
> various vendor needs?
> 
> > /*
> >   * Get information about an I/O address space
> >   *
> >   * Supported capabilities:
> >   *	- VFIO type1 map/unmap;
> >   *	- pgtable/pasid_table binding
> >   *	- hardware nesting vs. software nesting;
> >   *	- ...
> >   *
> >   * Related attributes:
> >   * 	- supported page sizes, reserved IOVA ranges (DMA mapping);
> >   *	- vendor pgtable formats (pgtable binding);
> >   *	- number of child IOASIDs (nesting);
> >   *	- ...
> >   *
> >   * Above information is available only after one or more devices are
> >   * attached to the specified IOASID. Otherwise the IOASID is just a
> >   * number w/o any capability or attribute.
> 
> This feels wrong to learn most of these attributes of the IOASID after
> attaching to a device.

Yes... but as above, we have no idea what the IOMMU's capabilities are
until devices are attached.

> The user should have some idea how it intends to use the IOASID when
> it creates it and the rest of the system should match the intention.
> 
> For instance if the user is creating a IOASID to cover the guest GPA
> with the intention of making children it should indicate this during
> alloc.
> 
> If the user is intending to point a child IOASID to a guest page table
> in a certain descriptor format then it should indicate it during
> alloc.
> 
> device bind should fail if the device somehow isn't compatible with
> the scheme the user is tring to use.

[snip]
> > 2.2. /dev/vfio uAPI
> > ++++++++++++++++
> 
> To be clear you mean the 'struct vfio_device' API, these are not
> IOCTLs on the container or group?
> 
> > /*
> >    * Bind a vfio_device to the specified IOASID fd
> >    *
> >    * Multiple vfio devices can be bound to a single ioasid_fd, but a single
> >    * vfio device should not be bound to multiple ioasid_fd's.
> >    *
> >    * Input parameters:
> >    *  - ioasid_fd;
> >    *
> >    * Return: 0 on success, -errno on failure.
> >    */
> > #define VFIO_BIND_IOASID_FD           _IO(VFIO_TYPE, VFIO_BASE + 22)
> > #define VFIO_UNBIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 23)
> 
> This is where it would make sense to have an output "device id" that
> allows /dev/ioasid to refer to this "device" by number in events and
> other related things.

The group number could be used for that, even if there are no group
fds.  You generally can't identify things more narrowly than group
anyway.


-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-02  7:22     ` David Gibson
  0 siblings, 0 replies; 518+ messages in thread
From: David Gibson @ 2021-06-02  7:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Tian, Kevin,
	Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, Jiang, Dave, David Woodhouse, Jason Wang


[-- Attachment #1.1: Type: text/plain, Size: 6445 bytes --]

On Fri, May 28, 2021 at 08:36:49PM -0300, Jason Gunthorpe wrote:
> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> 
> > 2.1. /dev/ioasid uAPI
> > +++++++++++++++++
> > 
> > /*
> >   * Check whether an uAPI extension is supported. 
> >   *
> >   * This is for FD-level capabilities, such as locked page pre-registration. 
> >   * IOASID-level capabilities are reported through IOASID_GET_INFO.
> >   *
> >   * Return: 0 if not supported, 1 if supported.
> >   */
> > #define IOASID_CHECK_EXTENSION	_IO(IOASID_TYPE, IOASID_BASE + 0)
> 
>  
> > /*
> >   * Register user space memory where DMA is allowed.
> >   *
> >   * It pins user pages and does the locked memory accounting so sub-
> >   * sequent IOASID_MAP/UNMAP_DMA calls get faster.
> >   *
> >   * When this ioctl is not used, one user page might be accounted
> >   * multiple times when it is mapped by multiple IOASIDs which are
> >   * not nested together.
> >   *
> >   * Input parameters:
> >   *	- vaddr;
> >   *	- size;
> >   *
> >   * Return: 0 on success, -errno on failure.
> >   */
> > #define IOASID_REGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 1)
> > #define IOASID_UNREGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 2)
> 
> So VA ranges are pinned and stored in a tree and later references to
> those VA ranges by any other IOASID use the pin cached in the tree?
> 
> It seems reasonable and is similar to the ioasid parent/child I
> suggested for PPC.
> 
> IMHO this should be merged with the all SW IOASID that is required for
> today's mdev drivers. If this can be done while keeping this uAPI then
> great, otherwise I don't think it is so bad to weakly nest a physical
> IOASID under a SW one just to optimize page pinning.

Right, I think we can simplify the interface by modelling the
preregistration as a nesting layer.  Well, mostly.. the wrinkle is
that generally you can't do anything with an ioasid until you've
attached devices to it, but that doesn't really make sense for the
prereg layer.  I expect we can find some way to deal with that,
though.

Actually... to simplify that "weak nesting" concept I wonder if we
want to expand to 3 ways of specifying the pagetables for the ioasid:
  1) kernel managed (MAP/UNMAP)
  2) user managed (BIND/INVALIDATE)
  3) pass-though (IOVA==parent address)

Obviously pass-through wouldn't be allowed in all circumstances.

> Either way this seems like a smart direction
> 
> > /*
> >   * Allocate an IOASID. 
> >   *
> >   * IOASID is the FD-local software handle representing an I/O address 
> >   * space. Each IOASID is associated with a single I/O page table. User 
> >   * must call this ioctl to get an IOASID for every I/O address space that is
> >   * intended to be enabled in the IOMMU.
> >   *
> >   * A newly-created IOASID doesn't accept any command before it is 
> >   * attached to a device. Once attached, an empty I/O page table is 
> >   * bound with the IOMMU then the user could use either DMA mapping 
> >   * or pgtable binding commands to manage this I/O page table.
> 
> Can the IOASID can be populated before being attached?

I don't think it reasonably can.  Until attached, you don't actually
know what hardware IOMMU will be backing it, and therefore you don't
know it's capabilities.  You can't really allow mappings if you don't
even know allowed IOVA ranges and page size.

> >   * Device attachment is initiated through device driver uAPI (e.g. VFIO)
> >   *
> >   * Return: allocated ioasid on success, -errno on failure.
> >   */
> > #define IOASID_ALLOC	_IO(IOASID_TYPE, IOASID_BASE + 3)
> > #define IOASID_FREE	_IO(IOASID_TYPE, IOASID_BASE + 4)
> 
> I assume alloc will include quite a big structure to satisfy the
> various vendor needs?
> 
> > /*
> >   * Get information about an I/O address space
> >   *
> >   * Supported capabilities:
> >   *	- VFIO type1 map/unmap;
> >   *	- pgtable/pasid_table binding
> >   *	- hardware nesting vs. software nesting;
> >   *	- ...
> >   *
> >   * Related attributes:
> >   * 	- supported page sizes, reserved IOVA ranges (DMA mapping);
> >   *	- vendor pgtable formats (pgtable binding);
> >   *	- number of child IOASIDs (nesting);
> >   *	- ...
> >   *
> >   * Above information is available only after one or more devices are
> >   * attached to the specified IOASID. Otherwise the IOASID is just a
> >   * number w/o any capability or attribute.
> 
> This feels wrong to learn most of these attributes of the IOASID after
> attaching to a device.

Yes... but as above, we have no idea what the IOMMU's capabilities are
until devices are attached.

> The user should have some idea how it intends to use the IOASID when
> it creates it and the rest of the system should match the intention.
> 
> For instance if the user is creating a IOASID to cover the guest GPA
> with the intention of making children it should indicate this during
> alloc.
> 
> If the user is intending to point a child IOASID to a guest page table
> in a certain descriptor format then it should indicate it during
> alloc.
> 
> device bind should fail if the device somehow isn't compatible with
> the scheme the user is tring to use.

[snip]
> > 2.2. /dev/vfio uAPI
> > ++++++++++++++++
> 
> To be clear you mean the 'struct vfio_device' API, these are not
> IOCTLs on the container or group?
> 
> > /*
> >    * Bind a vfio_device to the specified IOASID fd
> >    *
> >    * Multiple vfio devices can be bound to a single ioasid_fd, but a single
> >    * vfio device should not be bound to multiple ioasid_fd's.
> >    *
> >    * Input parameters:
> >    *  - ioasid_fd;
> >    *
> >    * Return: 0 on success, -errno on failure.
> >    */
> > #define VFIO_BIND_IOASID_FD           _IO(VFIO_TYPE, VFIO_BASE + 22)
> > #define VFIO_UNBIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 23)
> 
> This is where it would make sense to have an output "device id" that
> allows /dev/ioasid to refer to this "device" by number in events and
> other related things.

The group number could be used for that, even if there are no group
fds.  You generally can't identify things more narrowly than group
anyway.


-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 156 bytes --]

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-31 17:37   ` Parav Pandit
@ 2021-06-02  8:38     ` Enrico Weigelt, metux IT consult
  -1 siblings, 0 replies; 518+ messages in thread
From: Enrico Weigelt, metux IT consult @ 2021-06-02  8:38 UTC (permalink / raw)
  To: Parav Pandit, Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe,
	Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang
  Cc: Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu, Hao,
	Jiang, Dave, Jacob Pan, Jean-Philippe Brucker, David Gibson,
	Kirti Wankhede, Robin Murphy

On 31.05.21 19:37, Parav Pandit wrote:

> It appears that this is only to make map ioctl faster apart from accounting.
> It doesn't have any ioasid handle input either.
> 
> In that case, can it be a new system call? Why does it have to be under /dev/ioasid?
> For example few years back such system call mpin() thought was proposed in [1].

I'm very reluctant to more syscall inflation. We already have lots of
syscalls that could have been easily done via devices or filesystems
(yes, some of them are just old Unix relics).

Syscalls don't play well w/ modules, containers, distributed systems,
etc, and need extra low-level code for most non-C languages (eg.
scripting languages).


--mtx

-- 
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@metux.net -- +49-151-27565287

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-02  8:38     ` Enrico Weigelt, metux IT consult
  0 siblings, 0 replies; 518+ messages in thread
From: Enrico Weigelt, metux IT consult @ 2021-06-02  8:38 UTC (permalink / raw)
  To: Parav Pandit, Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe,
	Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang
  Cc: Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok, Jonathan Corbet,
	Kirti Wankhede, David Gibson, Robin Murphy

On 31.05.21 19:37, Parav Pandit wrote:

> It appears that this is only to make map ioctl faster apart from accounting.
> It doesn't have any ioasid handle input either.
> 
> In that case, can it be a new system call? Why does it have to be under /dev/ioasid?
> For example few years back such system call mpin() thought was proposed in [1].

I'm very reluctant to more syscall inflation. We already have lots of
syscalls that could have been easily done via devices or filesystems
(yes, some of them are just old Unix relics).

Syscalls don't play well w/ modules, containers, distributed systems,
etc, and need extra low-level code for most non-C languages (eg.
scripting languages).


--mtx

-- 
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@metux.net -- +49-151-27565287
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01 20:28       ` Jason Gunthorpe
@ 2021-06-02  8:52         ` Jason Wang
  -1 siblings, 0 replies; 518+ messages in thread
From: Jason Wang @ 2021-06-02  8:52 UTC (permalink / raw)
  To: Jason Gunthorpe, Tian, Kevin
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu, Hao,
	Jiang, Dave, Jacob Pan, Jean-Philippe Brucker, David Gibson,
	Kirti Wankhede, Robin Murphy


在 2021/6/2 上午4:28, Jason Gunthorpe 写道:
>> I summarized five opens here, about:
>>
>> 1)  Finalizing the name to replace /dev/ioasid;
>> 2)  Whether one device is allowed to bind to multiple IOASID fd's;
>> 3)  Carry device information in invalidation/fault reporting uAPI;
>> 4)  What should/could be specified when allocating an IOASID;
>> 5)  The protocol between vfio group and kvm;
>>
>> For 1), two alternative names are mentioned: /dev/iommu and
>> /dev/ioas. I don't have a strong preference and would like to hear
>> votes from all stakeholders. /dev/iommu is slightly better imho for
>> two reasons. First, per AMD's presentation in last KVM forum they
>> implement vIOMMU in hardware thus need to support user-managed
>> domains. An iommu uAPI notation might make more sense moving
>> forward. Second, it makes later uAPI naming easier as 'IOASID' can
>> be always put as an object, e.g. IOMMU_ALLOC_IOASID instead of
>> IOASID_ALLOC_IOASID.:)
> I think two years ago I suggested /dev/iommu and it didn't go very far
> at the time.


It looks to me using "/dev/iommu" excludes the possibility of 
implementing IOASID in a device specific way (e.g through the 
co-operation with device MMU + platform IOMMU)?

What's more, ATS spec doesn't forbid the device #PF to be reported via a 
device specific way.

Thanks


> We've also talked about this as /dev/sva for a while and
> now /dev/ioasid
>
> I think /dev/iommu is fine, and call the things inside them IOAS
> objects.
>
> Then we don't have naming aliasing with kernel constructs.
>   


^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-02  8:52         ` Jason Wang
  0 siblings, 0 replies; 518+ messages in thread
From: Jason Wang @ 2021-06-02  8:52 UTC (permalink / raw)
  To: Jason Gunthorpe, Tian, Kevin
  Cc: Jean-Philippe Brucker,
	Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, David Gibson, Jiang, Dave,
	David Woodhouse


在 2021/6/2 上午4:28, Jason Gunthorpe 写道:
>> I summarized five opens here, about:
>>
>> 1)  Finalizing the name to replace /dev/ioasid;
>> 2)  Whether one device is allowed to bind to multiple IOASID fd's;
>> 3)  Carry device information in invalidation/fault reporting uAPI;
>> 4)  What should/could be specified when allocating an IOASID;
>> 5)  The protocol between vfio group and kvm;
>>
>> For 1), two alternative names are mentioned: /dev/iommu and
>> /dev/ioas. I don't have a strong preference and would like to hear
>> votes from all stakeholders. /dev/iommu is slightly better imho for
>> two reasons. First, per AMD's presentation in last KVM forum they
>> implement vIOMMU in hardware thus need to support user-managed
>> domains. An iommu uAPI notation might make more sense moving
>> forward. Second, it makes later uAPI naming easier as 'IOASID' can
>> be always put as an object, e.g. IOMMU_ALLOC_IOASID instead of
>> IOASID_ALLOC_IOASID.:)
> I think two years ago I suggested /dev/iommu and it didn't go very far
> at the time.


It looks to me using "/dev/iommu" excludes the possibility of 
implementing IOASID in a device specific way (e.g through the 
co-operation with device MMU + platform IOMMU)?

What's more, ATS spec doesn't forbid the device #PF to be reported via a 
device specific way.

Thanks


> We've also talked about this as /dev/sva for a while and
> now /dev/ioasid
>
> I think /dev/iommu is fine, and call the things inside them IOAS
> objects.
>
> Then we don't have naming aliasing with kernel constructs.
>   

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01 17:31                         ` Jason Gunthorpe
@ 2021-06-02  8:54                           ` Jason Wang
  -1 siblings, 0 replies; 518+ messages in thread
From: Jason Wang @ 2021-06-02  8:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Lu Baolu, Liu Yi L, kvm, Jonathan Corbet, iommu,
	LKML, Alex Williamson (alex.williamson@redhat.com)"",
	David Woodhouse


在 2021/6/2 上午1:31, Jason Gunthorpe 写道:
> On Tue, Jun 01, 2021 at 04:47:15PM +0800, Jason Wang wrote:
>   
>> We can open up to ~0U file descriptors, I don't see why we need to restrict
>> it in uAPI.
> There are significant problems with such large file descriptor
> tables. High FD numbers man things like select don't work at all
> anymore and IIRC there are more complications.


I don't see how much difference for IOASID and other type of fds. People 
can choose to use poll or epoll.

And with the current proposal, (assuming there's a N:1 ioasid to 
ioasid). I wonder how select can work for the specific ioasid.

Thanks


>
> A huge number of FDs for typical usages should be avoided.
>
> Jason
>


^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-02  8:54                           ` Jason Wang
  0 siblings, 0 replies; 518+ messages in thread
From: Jason Wang @ 2021-06-02  8:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin,
	Alex Williamson (alex.williamson@redhat.com)"",
	kvm, Jonathan Corbet, LKML, iommu, David Woodhouse


在 2021/6/2 上午1:31, Jason Gunthorpe 写道:
> On Tue, Jun 01, 2021 at 04:47:15PM +0800, Jason Wang wrote:
>   
>> We can open up to ~0U file descriptors, I don't see why we need to restrict
>> it in uAPI.
> There are significant problems with such large file descriptor
> tables. High FD numbers man things like select don't work at all
> anymore and IIRC there are more complications.


I don't see how much difference for IOASID and other type of fds. People 
can choose to use poll or epoll.

And with the current proposal, (assuming there's a N:1 ioasid to 
ioasid). I wonder how select can work for the specific ioasid.

Thanks


>
> A huge number of FDs for typical usages should be avoided.
>
> Jason
>

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-27  7:58 ` Tian, Kevin
@ 2021-06-02  8:56   ` Enrico Weigelt, metux IT consult
  -1 siblings, 0 replies; 518+ messages in thread
From: Enrico Weigelt, metux IT consult @ 2021-06-02  8:56 UTC (permalink / raw)
  To: Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe, Lu Baolu,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang
  Cc: Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu, Hao,
	Jiang, Dave, Jacob Pan, Jean-Philippe Brucker, David Gibson,
	Kirti Wankhede, Robin Murphy

On 27.05.21 09:58, Tian, Kevin wrote:

Hi,

> /dev/ioasid provides an unified interface for managing I/O page tables for
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
> etc.) are expected to use this interface instead of creating their own logic to
> isolate untrusted device DMAs initiated by userspace.

While I'm in favour of having generic APIs for generic tasks, as well as
using FDs, I wonder whether it has to be a new and separate device.

Now applications have to use multiple APIs in lockstep. One consequence
of that is operators, as well as provisioning systems, container
infrastructures, etc, always have to consider multiple devices together.

You can't just say "give workload XY access to device /dev/foo" anymore.
Now you have to take care about scenarios like "if someone wants
/dev/foo, he also needs /dev/bar"). And if that happens multiple times
together ("/dev/foo and /dev/wurst, both require /dev/bar), leading to
scenarios like the dev nodes are bind-mounted somewhere, you need to
take care that additional devices aren't bind-mounted twice, etc ...

If I understand this correctly, /dev/ioasid is a kind of "common 
supplier" to other APIs / devices. Why can't the fd be acquired by the
consumer APIs (eg. kvm, vfio, etc) ?


--mtx

-- 
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@metux.net -- +49-151-27565287

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-02  8:56   ` Enrico Weigelt, metux IT consult
  0 siblings, 0 replies; 518+ messages in thread
From: Enrico Weigelt, metux IT consult @ 2021-06-02  8:56 UTC (permalink / raw)
  To: Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe, Lu Baolu,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang
  Cc: Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok, Jonathan Corbet,
	Kirti Wankhede, David Gibson, Robin Murphy

On 27.05.21 09:58, Tian, Kevin wrote:

Hi,

> /dev/ioasid provides an unified interface for managing I/O page tables for
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
> etc.) are expected to use this interface instead of creating their own logic to
> isolate untrusted device DMAs initiated by userspace.

While I'm in favour of having generic APIs for generic tasks, as well as
using FDs, I wonder whether it has to be a new and separate device.

Now applications have to use multiple APIs in lockstep. One consequence
of that is operators, as well as provisioning systems, container
infrastructures, etc, always have to consider multiple devices together.

You can't just say "give workload XY access to device /dev/foo" anymore.
Now you have to take care about scenarios like "if someone wants
/dev/foo, he also needs /dev/bar"). And if that happens multiple times
together ("/dev/foo and /dev/wurst, both require /dev/bar), leading to
scenarios like the dev nodes are bind-mounted somewhere, you need to
take care that additional devices aren't bind-mounted twice, etc ...

If I understand this correctly, /dev/ioasid is a kind of "common 
supplier" to other APIs / devices. Why can't the fd be acquired by the
consumer APIs (eg. kvm, vfio, etc) ?


--mtx

-- 
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@metux.net -- +49-151-27565287
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01 17:29                     ` Jason Gunthorpe
@ 2021-06-02  8:58                       ` Jason Wang
  -1 siblings, 0 replies; 518+ messages in thread
From: Jason Wang @ 2021-06-02  8:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Lu Baolu, Liu Yi L,
	Alex Williamson (alex.williamson@redhat.com)"",
	kvm, Jonathan Corbet, LKML, iommu, David Woodhouse


在 2021/6/2 上午1:29, Jason Gunthorpe 写道:
> On Tue, Jun 01, 2021 at 02:07:05PM +0800, Jason Wang wrote:
>
>> For the case of 1M, I would like to know what's the use case for a single
>> process to handle 1M+ address spaces?
> For some scenarios every guest PASID will require a IOASID ID # so
> there is a large enough demand that FDs alone are not a good fit.
>
> Further there are global container wide properties that are hard to
> carry over to a multi-FD model, like the attachment of devices to the
> container at the startup.


So if we implement per fd model. The global "container" properties could 
be done via the parent fd. E.g attaching the parent to the device at the 
startup.


>
>>> So this RFC treats fd as a container of address spaces which is each
>>> tagged by an IOASID.
>> If the container and address space is 1:1 then the container seems useless.
> The examples at the bottom of the document show multiple IOASIDs in
> the container for a parent/child type relationship


This can also be done per fd? A fd parent can have multiple fd childs.

Thanks


>
> Jason
>


^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-02  8:58                       ` Jason Wang
  0 siblings, 0 replies; 518+ messages in thread
From: Jason Wang @ 2021-06-02  8:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, kvm, Jonathan Corbet, LKML, iommu,
	Alex Williamson (alex.williamson@redhat.com)"",
	David Woodhouse


在 2021/6/2 上午1:29, Jason Gunthorpe 写道:
> On Tue, Jun 01, 2021 at 02:07:05PM +0800, Jason Wang wrote:
>
>> For the case of 1M, I would like to know what's the use case for a single
>> process to handle 1M+ address spaces?
> For some scenarios every guest PASID will require a IOASID ID # so
> there is a large enough demand that FDs alone are not a good fit.
>
> Further there are global container wide properties that are hard to
> carry over to a multi-FD model, like the attachment of devices to the
> container at the startup.


So if we implement per fd model. The global "container" properties could 
be done via the parent fd. E.g attaching the parent to the device at the 
startup.


>
>>> So this RFC treats fd as a container of address spaces which is each
>>> tagged by an IOASID.
>> If the container and address space is 1:1 then the container seems useless.
> The examples at the bottom of the document show multiple IOASIDs in
> the container for a parent/child type relationship


This can also be done per fd? A fd parent can have multiple fd childs.

Thanks


>
> Jason
>

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-06-02  8:38     ` Enrico Weigelt, metux IT consult
@ 2021-06-02 12:41       ` Parav Pandit
  -1 siblings, 0 replies; 518+ messages in thread
From: Parav Pandit @ 2021-06-02 12:41 UTC (permalink / raw)
  To: Enrico Weigelt, metux IT consult, Tian, Kevin, LKML,
	Joerg Roedel, Jason Gunthorpe, Lu Baolu, David Woodhouse, iommu,
	kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang
  Cc: Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu, Hao,
	Jiang, Dave, Jacob Pan, Jean-Philippe Brucker, David Gibson,
	Kirti Wankhede, Robin Murphy


> From: Enrico Weigelt, metux IT consult <lkml@metux.net>
> Sent: Wednesday, June 2, 2021 2:09 PM
> 
> On 31.05.21 19:37, Parav Pandit wrote:
> 
> > It appears that this is only to make map ioctl faster apart from accounting.
> > It doesn't have any ioasid handle input either.
> >
> > In that case, can it be a new system call? Why does it have to be under
> /dev/ioasid?
> > For example few years back such system call mpin() thought was proposed
> in [1].
> 
> I'm very reluctant to more syscall inflation. We already have lots of syscalls
> that could have been easily done via devices or filesystems (yes, some of
> them are just old Unix relics).
> 
> Syscalls don't play well w/ modules, containers, distributed systems, etc, and
> need extra low-level code for most non-C languages (eg.
> scripting languages).

Likely, but as per my understanding, this ioctl() is a wrapper to device agnostic code as,

 {
   atomic_inc(mm->pinned_vm);
   pin_user_pages();
}

And mm must got to hold the reference to it, so that these pages cannot be munmap() or freed.

And second reason I think (I could be wrong) is that, second level page table for a PASID, should be same as what process CR3 has used.
Essentially iommu page table and mmu page table should be pointing to same page table entry.
If they are different, than even if the guest CPU has accessed the pages, device access via IOMMU will result in an expensive page faults.

So assuming both cr3 and pasid table entry points to same page table, I fail to understand for the need of extra refcount and hence driver specific ioctl().
Though I do not have strong objection to the ioctl(). But want to know what it will and will_not do.
Io uring fs has similar ioctl() doing io_sqe_buffer_register(), pinning the memory.

^ permalink raw reply	[flat|nested] 518+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
@ 2021-06-02 12:41       ` Parav Pandit
  0 siblings, 0 replies; 518+ messages in thread
From: Parav Pandit @ 2021-06-02 12:41 UTC (permalink / raw)
  To: Enrico Weigelt, metux IT consult, Tian, Kevin, LKML,
	Joerg Roedel, Jason Gunthorpe, Lu Baolu, David Woodhouse, iommu,
	kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang
  Cc: Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok, Jonathan Corbet,
	Kirti Wankhede, David Gibson, Robin Murphy


> From: Enrico Weigelt, metux IT consult <lkml@metux.net>
> Sent: Wednesday, June 2, 2021 2:09 PM
> 
> On 31.05.21 19:37, Parav Pandit wrote:
> 
> > It appears that this is only to make map ioctl faster apart from accounting.
> > It doesn't have any ioasid handle input either.
> >
> > In that case, can it be a new system call? Why does it have to be under
> /dev/ioasid?
> > For example few years back such system call mpin() thought was proposed
> in [1].
> 
> I'm very reluctant to more syscall inflation. We already have lots of syscalls
> that could have been easily done via devices or filesystems (yes, some of
> them are just old Unix relics).
> 
> Syscalls don't play well w/ modules, containers, distributed systems, etc, and
> need extra low-level code for most non-C languages (eg.
> scripting languages).

Likely, but as per my understanding, this ioctl() is a wrapper to device agnostic code as,

 {
   atomic_inc(mm->pinned_vm);
   pin_user_pages();
}

And mm must got to hold the reference to it, so that these pages cannot be munmap() or freed.

And second reason I think (I could be wrong) is that, second level page table for a PASID, should be same as what process CR3 has used.
Essentially iommu page table and mmu page table should be pointing to same page table entry.
If they are different, than even if the guest CPU has accessed the pages, device access via IOMMU will result in an expensive page faults.

So assuming both cr3 and pasid table entry points to same page table, I fail to understand for the need of extra refcount and hence driver specific ioctl().
Though I do not have strong objection to the ioctl(). But want to know what it will and will_not do.
Io uring fs has similar ioctl() doing io_sqe_buffer_register(), pinning the memory.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 518+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-02  2:20         ` Tian, Kevin
@ 2021-06-02 16:01           ` Jason Gunthorpe
  -1 siblings, 0 replies; 518+ messages in thread
From: Jason Gunthorpe @ 2021-06-02 16:01 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok,
	Liu, Yi L, Wu, Hao, Jiang, Dave, Jacob Pan,
	Jean-Philippe Brucker, David Gibson, Kirti Wankhede,
	Robin Murphy

On Wed, Jun 02, 2021 at 02:20:15AM +0000, Tian, Kevin wrote:
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, June 2, 2021 6:22 AM
> > 
> > On Tue, 1 Jun 2021 07:01:57 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > >
> > > I summarized five opens here, about:
> > >
> > > 1)  Finalizing the name to replace /dev/ioasid;
> > > 2)  Whether one device is allowed to bind to multiple IOASID fd's;
> > > 3)  Carry device information in invalidation/fault reporting uAPI;
> > > 4)  What should/could be specified when allocating an IOASID;
> > > 5)  The protocol between vfio group and kvm;
> > >
> > ...
> > >
> > > For 5), I'd expect Alex to chime in. Per my understanding looks the
> > > original purpose of this protocol is not about I/O address space. It's
> > > for KVM to know whether any device is assigned to this VM and then
> > > do something special (e.g. posted interrupt, EPT cache attribute, etc.).
> > 
> > Right, the original use case was for KVM to determine whether it needs
> > to emulate invlpg, so it needs to be aware when an assigned device is
> 
> invlpg -> wbinvd :)
> 
> > present and be able to test if DMA for that device is cache
> > coherent.

Why is this such a strong linkage to VFIO and not just a 'hey kvm
emulate wbinvd' flag from qemu?

I briefly didn't see any obvios linkage in the arch code, just some
dead code:

$ git grep iommu_noncoherent
arch/x86/include/asm/kvm_host.h:	bool iommu_noncoherent;
$ git grep iommu_domain arch/x86
arch/x86/include/asm/kvm_host.h:        struct iommu_domain *iommu_domain;

Huh?

It kind of looks like the other main point is to generate the
VFIO_GROUP_NOTIFY_SET_KVM which is being used by two VFIO drivers to
connect back to the kvm data

But that seems like it would have been better handled with some IOCTL
on the vfio_device fd to import the KVM to the driver not this
roundabout way?

> > The user, QEMU, creates a KVM "pseudo" device representing the vfio
> > group, providing the file descriptor of that group to show ownership.
> > The ugly symbol_get code is to avoid hard module dependencies, ie. the
> > kvm module should not pull in or require the vfio module, but vfio will
> > be present if attempting to register this device.
> 
> so the symbol_get thing is not about the protocol itself. Whatever protocol
> is defined, as long as kvm needs to call vfio or ioasid helper function, we 
> need define a proper way to do it. Jason, what's your opinion of alternative 
> option since you dislike symbol_get?

The symbol_get was to avoid module dependencies because bringing in
vfio along with kvm is not nice.

The symbol get is not nice here, but unless things can be truely moved
to lower levels of code where module dependencies are not a problem (eg
kvm to iommu is a non-issue) I don't see much of a solution.

Other cases like kvmgt or AP would be similarly fine to have had a
kvmgt to kvm module dependency.

> > All of these use cases are related to the IOMMU, whether DMA is
>