linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] /dev/ioasid uAPI proposal
@ 2021-05-27  7:58 Tian, Kevin
  2021-05-28  2:24 ` Jason Wang
                   ` (10 more replies)
  0 siblings, 11 replies; 258+ messages in thread
From: Tian, Kevin @ 2021-05-27  7:58 UTC (permalink / raw)
  To: LKML, Joerg Roedel, Jason Gunthorpe, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang
  Cc: Eric Auger, Jonathan Corbet, Raj, Ashok, Tian, Kevin, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

/dev/ioasid provides an unified interface for managing I/O page tables for 
devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, 
etc.) are expected to use this interface instead of creating their own logic to 
isolate untrusted device DMAs initiated by userspace. 

This proposal describes the uAPI of /dev/ioasid and also sample sequences 
with VFIO as example in typical usages. The driver-facing kernel API provided 
by the iommu layer is still TBD, which can be discussed after consensus is 
made on this uAPI.

It's based on a lengthy discussion starting from here:
	https://lore.kernel.org/linux-iommu/20210330132830.GO2356281@nvidia.com/ 

It ends up to be a long writing due to many things to be summarized and
non-trivial effort required to connect them into a complete proposal.
Hope it provides a clean base to converge.

TOC
====
1. Terminologies and Concepts
2. uAPI Proposal
    2.1. /dev/ioasid uAPI
    2.2. /dev/vfio uAPI
    2.3. /dev/kvm uAPI
3. Sample structures and helper functions
4. PASID virtualization
5. Use Cases and Flows
    5.1. A simple example
    5.2. Multiple IOASIDs (no nesting)
    5.3. IOASID nesting (software)
    5.4. IOASID nesting (hardware)
    5.5. Guest SVA (vSVA)
    5.6. I/O page fault
    5.7. BIND_PASID_TABLE
====

1. Terminologies and Concepts
-----------------------------------------

IOASID FD is the container holding multiple I/O address spaces. User 
manages those address spaces through FD operations. Multiple FD's are 
allowed per process, but with this proposal one FD should be sufficient for 
all intended usages.

IOASID is the FD-local software handle representing an I/O address space. 
Each IOASID is associated with a single I/O page table. IOASIDs can be 
nested together, implying the output address from one I/O page table 
(represented by child IOASID) must be further translated by another I/O 
page table (represented by parent IOASID).

I/O address space can be managed through two protocols, according to 
whether the corresponding I/O page table is constructed by the kernel or 
the user. When kernel-managed, a dma mapping protocol (similar to 
existing VFIO iommu type1) is provided for the user to explicitly specify 
how the I/O address space is mapped. Otherwise, a different protocol is 
provided for the user to bind an user-managed I/O page table to the 
IOMMU, plus necessary commands for iotlb invalidation and I/O fault 
handling. 

Pgtable binding protocol can be used only on the child IOASID's, implying 
IOASID nesting must be enabled. This is because the kernel doesn't trust 
userspace. Nesting allows the kernel to enforce its DMA isolation policy 
through the parent IOASID.

IOASID nesting can be implemented in two ways: hardware nesting and 
software nesting. With hardware support the child and parent I/O page 
tables are walked consecutively by the IOMMU to form a nested translation. 
When it's implemented in software, the ioasid driver is responsible for 
merging the two-level mappings into a single-level shadow I/O page table. 
Software nesting requires both child/parent page tables operated through 
the dma mapping protocol, so any change in either level can be captured 
by the kernel to update the corresponding shadow mapping.

An I/O address space takes effect in the IOMMU only after it is attached 
to a device. The device in the /dev/ioasid context always refers to a 
physical one or 'pdev' (PF or VF). 

One I/O address space could be attached to multiple devices. In this case, 
/dev/ioasid uAPI applies to all attached devices under the specified IOASID.

Based on the underlying IOMMU capability one device might be allowed 
to attach to multiple I/O address spaces, with DMAs accessing them by 
carrying different routing information. One of them is the default I/O 
address space routed by PCI Requestor ID (RID) or ARM Stream ID. The 
remaining are routed by RID + Process Address Space ID (PASID) or 
Stream+Substream ID. For simplicity the following context uses RID and
PASID when talking about the routing information for I/O address spaces.

Device attachment is initiated through passthrough framework uAPI (use
VFIO for simplicity in following context). VFIO is responsible for identifying 
the routing information and registering it to the ioasid driver when calling 
ioasid attach helper function. It could be RID if the assigned device is 
pdev (PF/VF) or RID+PASID if the device is mediated (mdev). In addition, 
user might also provide its view of virtual routing information (vPASID) in 
the attach call, e.g. when multiple user-managed I/O address spaces are 
attached to the vfio_device. In this case VFIO must figure out whether 
vPASID should be directly used (for pdev) or converted to a kernel-
allocated one (pPASID, for mdev) for physical routing (see section 4).

Device must be bound to an IOASID FD before attach operation can be
conducted. This is also through VFIO uAPI. In this proposal one device 
should not be bound to multiple FD's. Not sure about the gain of 
allowing it except adding unnecessary complexity. But if others have 
different view we can further discuss.

VFIO must ensure its device composes DMAs with the routing information
attached to the IOASID. For pdev it naturally happens since vPASID is 
directly programmed to the device by guest software. For mdev this 
implies any guest operation carrying a vPASID on this device must be 
trapped into VFIO and then converted to pPASID before sent to the 
device. A detail explanation about PASID virtualization policies can be 
found in section 4. 

Modern devices may support a scalable workload submission interface 
based on PCI DMWr capability, allowing a single work queue to access
multiple I/O address spaces. One example is Intel ENQCMD, having 
PASID saved in the CPU MSR and carried in the instruction payload 
when sent out to the device. Then a single work queue shared by 
multiple processes can compose DMAs carrying different PASIDs. 

When executing ENQCMD in the guest, the CPU MSR includes a vPASID 
which, if targeting a mdev, must be converted to pPASID before sent
to the wire. Intel CPU provides a hardware PASID translation capability 
for auto-conversion in the fast path. The user is expected to setup the 
PASID mapping through KVM uAPI, with information about {vpasid, 
ioasid_fd, ioasid}. The ioasid driver provides helper function for KVM 
to figure out the actual pPASID given an IOASID.

With above design /dev/ioasid uAPI is all about I/O address spaces. 
It doesn't include any device routing information, which is only 
indirectly registered to the ioasid driver through VFIO uAPI. For 
example, I/O page fault is always reported to userspace per IOASID, 
although it's physically reported per device (RID+PASID). If there is a 
need of further relaying this fault into the guest, the user is responsible 
of identifying the device attached to this IOASID (randomly pick one if 
multiple attached devices) and then generates a per-device virtual I/O 
page fault into guest. Similarly the iotlb invalidation uAPI describes the 
granularity in the I/O address space (all, or a range), different from the 
underlying IOMMU semantics (domain-wide, PASID-wide, range-based).

I/O page tables routed through PASID are installed in a per-RID PASID 
table structure. Some platforms implement the PASID table in the guest 
physical space (GPA), expecting it managed by the guest. The guest
PASID table is bound to the IOMMU also by attaching to an IOASID, 
representing the per-RID vPASID space. 

We propose the host kernel needs to explicitly track  guest I/O page 
tables even on these platforms, i.e. the same pgtable binding protocol 
should be used universally on all platforms (with only difference on who 
actually writes the PASID table). One opinion from previous discussion 
was treating this special IOASID as a container for all guest I/O page 
tables i.e. hiding them from the host. However this way significantly 
violates the philosophy in this /dev/ioasid proposal. It is not one IOASID 
one address space any more. Device routing information (indirectly 
marking hidden I/O spaces) has to be carried in iotlb invalidation and 
page faulting uAPI to help connect vIOMMU with the underlying 
pIOMMU. This is one design choice to be confirmed with ARM guys.

Devices may sit behind IOMMU's with incompatible capabilities. The
difference may lie in the I/O page table format, or availability of an user
visible uAPI (e.g. hardware nesting). /dev/ioasid is responsible for 
checking the incompatibility between newly-attached device and existing
devices under the specific IOASID and, if found, returning error to user.
Upon such error the user should create a new IOASID for the incompatible
device. 

There is no explicit group enforcement in /dev/ioasid uAPI, due to no 
device notation in this interface as aforementioned. But the ioasid driver 
does implicit check to make sure that devices within an iommu group 
must be all attached to the same IOASID before this IOASID starts to
accept any uAPI command. Otherwise error information is returned to 
the user.

There was a long debate in previous discussion whether VFIO should keep 
explicit container/group semantics in its uAPI. Jason Gunthorpe proposes 
a simplified model where every device bound to VFIO is explicitly listed 
under /dev/vfio thus a device fd can be acquired w/o going through legacy
container/group interface. In this case the user is responsible for 
understanding the group topology and meeting the implicit group check 
criteria enforced in /dev/ioasid. The use case examples in this proposal 
are based on the new model.

Of course for backward compatibility VFIO still needs to keep the existing 
uAPI and vfio iommu type1 will become a shim layer connecting VFIO 
iommu ops to internal ioasid helper functions.

Notes:
-   It might be confusing as IOASID is also used in the kernel (drivers/
    iommu/ioasid.c) to represent PCI PASID or ARM substream ID. We need
    find a better name later to differentiate.

-   PPC has not be considered yet as we haven't got time to fully understand
    its semantics. According to previous discussion there is some generality 
    between PPC window-based scheme and VFIO type1 semantics. Let's 
    first make consensus on this proposal and then further discuss how to 
    extend it to cover PPC's requirement.

-   There is a protocol between vfio group and kvm. Needs to think about
    how it will be affected following this proposal.

-   mdev in this context refers to mediated subfunctions (e.g. Intel SIOV) 
    which can be physically isolated in-between through PASID-granular
    IOMMU protection. Historically people also discussed one usage by 
    mediating a pdev into a mdev. This usage is not covered here, and is 
    supposed to be replaced by Max's work which allows overriding various 
    VFIO operations in vfio-pci driver.

2. uAPI Proposal
----------------------

/dev/ioasid uAPI covers everything about managing I/O address spaces.

/dev/vfio uAPI builds connection between devices and I/O address spaces.

/dev/kvm uAPI is optional required as far as ENQCMD is concerned.


2.1. /dev/ioasid uAPI
+++++++++++++++++

/*
  * Check whether an uAPI extension is supported. 
  *
  * This is for FD-level capabilities, such as locked page pre-registration. 
  * IOASID-level capabilities are reported through IOASID_GET_INFO.
  *
  * Return: 0 if not supported, 1 if supported.
  */
#define IOASID_CHECK_EXTENSION	_IO(IOASID_TYPE, IOASID_BASE + 0)


/*
  * Register user space memory where DMA is allowed.
  *
  * It pins user pages and does the locked memory accounting so sub-
  * sequent IOASID_MAP/UNMAP_DMA calls get faster.
  *
  * When this ioctl is not used, one user page might be accounted
  * multiple times when it is mapped by multiple IOASIDs which are
  * not nested together.
  *
  * Input parameters:
  *	- vaddr;
  *	- size;
  *
  * Return: 0 on success, -errno on failure.
  */
#define IOASID_REGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 1)
#define IOASID_UNREGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 2)


/*
  * Allocate an IOASID. 
  *
  * IOASID is the FD-local software handle representing an I/O address 
  * space. Each IOASID is associated with a single I/O page table. User 
  * must call this ioctl to get an IOASID for every I/O address space that is
  * intended to be enabled in the IOMMU.
  *
  * A newly-created IOASID doesn't accept any command before it is 
  * attached to a device. Once attached, an empty I/O page table is 
  * bound with the IOMMU then the user could use either DMA mapping 
  * or pgtable binding commands to manage this I/O page table.
  *
  * Device attachment is initiated through device driver uAPI (e.g. VFIO)
  *
  * Return: allocated ioasid on success, -errno on failure.
  */
#define IOASID_ALLOC	_IO(IOASID_TYPE, IOASID_BASE + 3)
#define IOASID_FREE	_IO(IOASID_TYPE, IOASID_BASE + 4)


/*
  * Get information about an I/O address space
  *
  * Supported capabilities:
  *	- VFIO type1 map/unmap;
  *	- pgtable/pasid_table binding
  *	- hardware nesting vs. software nesting;
  *	- ...
  *
  * Related attributes:
  * 	- supported page sizes, reserved IOVA ranges (DMA mapping);
  *	- vendor pgtable formats (pgtable binding);
  *	- number of child IOASIDs (nesting);
  *	- ...
  *
  * Above information is available only after one or more devices are
  * attached to the specified IOASID. Otherwise the IOASID is just a
  * number w/o any capability or attribute.
  *
  * Input parameters:
  *	- u32 ioasid;
  *
  * Output parameters:
  *	- many. TBD.
  */
#define IOASID_GET_INFO	_IO(IOASID_TYPE, IOASID_BASE + 5)


/*
  * Map/unmap process virtual addresses to I/O virtual addresses.
  *
  * Provide VFIO type1 equivalent semantics. Start with the same 
  * restriction e.g. the unmap size should match those used in the 
  * original mapping call. 
  *
  * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
  * must be already in the preregistered list.
  *
  * Input parameters:
  *	- u32 ioasid;
  *	- refer to vfio_iommu_type1_dma_{un}map
  *
  * Return: 0 on success, -errno on failure.
  */
#define IOASID_MAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 6)
#define IOASID_UNMAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 7)


/*
  * Create a nesting IOASID (child) on an existing IOASID (parent)
  *
  * IOASIDs can be nested together, implying that the output address 
  * from one I/O page table (child) must be further translated by 
  * another I/O page table (parent).
  *
  * As the child adds essentially another reference to the I/O page table 
  * represented by the parent, any device attached to the child ioasid 
  * must be already attached to the parent.
  *
  * In concept there is no limit on the number of the nesting levels. 
  * However for the majority case one nesting level is sufficient. The
  * user should check whether an IOASID supports nesting through 
  * IOASID_GET_INFO. For example, if only one nesting level is allowed,
  * the nesting capability is reported only on the parent instead of the
  * child.
  *
  * User also needs check (via IOASID_GET_INFO) whether the nesting 
  * is implemented in hardware or software. If software-based, DMA 
  * mapping protocol should be used on the child IOASID. Otherwise, 
  * the child should be operated with pgtable binding protocol.
  *
  * Input parameters:
  *	- u32 parent_ioasid;
  *
  * Return: child_ioasid on success, -errno on failure;
  */
#define IOASID_CREATE_NESTING	_IO(IOASID_TYPE, IOASID_BASE + 8)


/*
  * Bind an user-managed I/O page table with the IOMMU
  *
  * Because user page table is untrusted, IOASID nesting must be enabled 
  * for this ioasid so the kernel can enforce its DMA isolation policy 
  * through the parent ioasid.
  *
  * Pgtable binding protocol is different from DMA mapping. The latter 
  * has the I/O page table constructed by the kernel and updated 
  * according to user MAP/UNMAP commands. With pgtable binding the 
  * whole page table is created and updated by userspace, thus different 
  * set of commands are required (bind, iotlb invalidation, page fault, etc.).
  *
  * Because the page table is directly walked by the IOMMU, the user 
  * must  use a format compatible to the underlying hardware. It can 
  * check the format information through IOASID_GET_INFO.
  *
  * The page table is bound to the IOMMU according to the routing 
  * information of each attached device under the specified IOASID. The
  * routing information (RID and optional PASID) is registered when a 
  * device is attached to this IOASID through VFIO uAPI. 
  *
  * Input parameters:
  *	- child_ioasid;
  *	- address of the user page table;
  *	- formats (vendor, address_width, etc.);
  * 
  * Return: 0 on success, -errno on failure.
  */
#define IOASID_BIND_PGTABLE		_IO(IOASID_TYPE, IOASID_BASE + 9)
#define IOASID_UNBIND_PGTABLE	_IO(IOASID_TYPE, IOASID_BASE + 10)


/*
  * Bind an user-managed PASID table to the IOMMU
  *
  * This is required for platforms which place PASID table in the GPA space.
  * In this case the specified IOASID represents the per-RID PASID space.
  *
  * Alternatively this may be replaced by IOASID_BIND_PGTABLE plus a
  * special flag to indicate the difference from normal I/O address spaces.
  *
  * The format info of the PASID table is reported in IOASID_GET_INFO.
  *
  * As explained in the design section, user-managed I/O page tables must
  * be explicitly bound to the kernel even on these platforms. It allows
  * the kernel to uniformly manage I/O address spaces cross all platforms.
  * Otherwise, the iotlb invalidation and page faulting uAPI must be hacked
  * to carry device routing information to indirectly mark the hidden I/O
  * address spaces.
  *
  * Input parameters:
  *	- child_ioasid;
  *	- address of PASID table;
  *	- formats (vendor, size, etc.);
  *
  * Return: 0 on success, -errno on failure.
  */
#define IOASID_BIND_PASID_TABLE	_IO(IOASID_TYPE, IOASID_BASE + 11)
#define IOASID_UNBIND_PASID_TABLE	_IO(IOASID_TYPE, IOASID_BASE + 12)


/*
  * Invalidate IOTLB for an user-managed I/O page table
  *
  * Unlike what's defined in include/uapi/linux/iommu.h, this command 
  * doesn't allow the user to specify cache type and likely support only
  * two granularities (all, or a specified range) in the I/O address space.
  *
  * Physical IOMMU have three cache types (iotlb, dev_iotlb and pasid
  * cache). If the IOASID represents an I/O address space, the invalidation
  * always applies to the iotlb (and dev_iotlb if enabled). If the IOASID
  * represents a vPASID space, then this command applies to the PASID
  * cache.
  *
  * Similarly this command doesn't provide IOMMU-like granularity
  * info (domain-wide, pasid-wide, range-based), since it's all about the
  * I/O address space itself. The ioasid driver walks the attached
  * routing information to match the IOMMU semantics under the
  * hood. 
  *
  * Input parameters:
  *	- child_ioasid;
  *	- granularity
  * 
  * Return: 0 on success, -errno on failure
  */
#define IOASID_INVALIDATE_CACHE	_IO(IOASID_TYPE, IOASID_BASE + 13)


/*
  * Page fault report and response
  *
  * This is TBD. Can be added after other parts are cleared up. Likely it 
  * will be a ring buffer shared between user/kernel, an eventfd to notify 
  * the user and an ioctl to complete the fault.
  *
  * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
  */


/*
  * Dirty page tracking 
  *
  * Track and report memory pages dirtied in I/O address spaces. There 
  * is an ongoing work by Kunkun Jiang by extending existing VFIO type1. 
  * It needs be adapted to /dev/ioasid later.
  */


2.2. /dev/vfio uAPI
++++++++++++++++

/*
  * Bind a vfio_device to the specified IOASID fd
  *
  * Multiple vfio devices can be bound to a single ioasid_fd, but a single 
  * vfio device should not be bound to multiple ioasid_fd's. 
  *
  * Input parameters:
  *	- ioasid_fd;
  *
  * Return: 0 on success, -errno on failure.
  */
#define VFIO_BIND_IOASID_FD		_IO(VFIO_TYPE, VFIO_BASE + 22)
#define VFIO_UNBIND_IOASID_FD	_IO(VFIO_TYPE, VFIO_BASE + 23)


/*
  * Attach a vfio device to the specified IOASID
  *
  * Multiple vfio devices can be attached to the same IOASID, and vice 
  * versa. 
  *
  * User may optionally provide a "virtual PASID" to mark an I/O page 
  * table on this vfio device. Whether the virtual PASID is physically used 
  * or converted to another kernel-allocated PASID is a policy in vfio device 
  * driver.
  *
  * There is no need to specify ioasid_fd in this call due to the assumption 
  * of 1:1 connection between vfio device and the bound fd.
  *
  * Input parameter:
  *	- ioasid;
  *	- flag;
  *	- user_pasid (if specified);
  * 
  * Return: 0 on success, -errno on failure.
  */
#define VFIO_ATTACH_IOASID		_IO(VFIO_TYPE, VFIO_BASE + 24)
#define VFIO_DETACH_IOASID		_IO(VFIO_TYPE, VFIO_BASE + 25)


2.3. KVM uAPI
++++++++++++

/*
  * Update CPU PASID mapping
  *
  * This is necessary when ENQCMD will be used in the guest while the
  * targeted device doesn't accept the vPASID saved in the CPU MSR.
  *
  * This command allows user to set/clear the vPASID->pPASID mapping
  * in the CPU, by providing the IOASID (and FD) information representing
  * the I/O address space marked by this vPASID.
  *
  * Input parameters:
  *	- user_pasid;
  *	- ioasid_fd;
  *	- ioasid;
  */
#define KVM_MAP_PASID	_IO(KVMIO, 0xf0)
#define KVM_UNMAP_PASID	_IO(KVMIO, 0xf1)


3. Sample structures and helper functions
--------------------------------------------------------

Three helper functions are provided to support VFIO_BIND_IOASID_FD:

	struct ioasid_ctx *ioasid_ctx_fdget(int fd);
	int ioasid_register_device(struct ioasid_ctx *ctx, struct ioasid_dev *dev);
	int ioasid_unregister_device(struct ioasid_dev *dev);

An ioasid_ctx is created for each fd:

	struct ioasid_ctx {
		// a list of allocated IOASID data's
		struct list_head		ioasid_list;
		// a list of registered devices
		struct list_head		dev_list;
		// a list of pre-registered virtual address ranges
		struct list_head		prereg_list;
	};

Each registered device is represented by ioasid_dev:

	struct ioasid_dev {
		struct list_head		next;
		struct ioasid_ctx	*ctx;
		// always be the physical device
		struct device 		*device;
		struct kref		kref;
	};

Because we assume one vfio_device connected to at most one ioasid_fd, 
here ioasid_dev could be embedded in vfio_device and then linked to 
ioasid_ctx->dev_list when registration succeeds. For mdev the struct
device should be the pointer to the parent device. PASID marking this
mdev is specified later when VFIO_ATTACH_IOASID.

An ioasid_data is created when IOASID_ALLOC, as the main object 
describing characteristics about an I/O page table:

	struct ioasid_data {
		// link to ioasid_ctx->ioasid_list
		struct list_head		next;

		// the IOASID number
		u32			ioasid;

		// the handle to convey iommu operations
		// hold the pgd (TBD until discussing iommu api)
		struct iommu_domain *domain;

		// map metadata (vfio type1 semantics)
		struct rb_node		dma_list;

		// pointer to user-managed pgtable (for nesting case)
		u64			user_pgd;

		// link to the parent ioasid (for nesting)
		struct ioasid_data	*parent;

		// cache the global PASID shared by ENQCMD-capable
		// devices (see below explanation in section 4)
		u32			pasid;

		// a list of device attach data (routing information)
		struct list_head		attach_data;

		// a list of partially-attached devices (group)
		struct list_head		partial_devices;

		// a list of fault_data reported from the iommu layer
		struct list_head		fault_data;

		...
	}

ioasid_data and iommu_domain have overlapping roles as both are 
introduced to represent an I/O address space. It is still a big TBD how 
the two should be corelated or even merged, and whether new iommu 
ops are required to handle RID+PASID explicitly. We leave this as open 
for now as this proposal is mainly about uAPI. For simplification 
purpose the two objects are kept separate in this context, assuming an 
1:1 connection in-between and the domain as the place-holder 
representing the 1st class object in the iommu ops. 

Two helper functions are provided to support VFIO_ATTACH_IOASID:

	struct attach_info {
		u32	ioasid;
		// If valid, the PASID to be used physically
		u32	pasid;
	};
	int ioasid_device_attach(struct ioasid_dev *dev, 
		struct attach_info info);
	int ioasid_device_detach(struct ioasid_dev *dev, u32 ioasid);

The pasid parameter is optionally provided based on the policy in vfio
device driver. It could be the PASID marking the default I/O address 
space for a mdev, or the user-provided PASID marking an user I/O page
table, or another kernel-allocated PASID backing the user-provided one.
Please check next section for detail explanation.

A new object is introduced and linked to ioasid_data->attach_data for 
each successful attach operation:

	struct ioasid_attach_data {
		struct list_head		next;
		struct ioasid_dev	*dev;
		u32 			pasid;
	}

As explained in the design section, there is no explicit group enforcement
in /dev/ioasid uAPI or helper functions. But the ioasid driver does
implicit group check - before every device within an iommu group is 
attached to this IOASID, the previously-attached devices in this group are
put in ioasid_data->partial_devices. The IOASID rejects any command if
the partial_devices list is not empty.

Then is the last helper function:
	u32 ioasid_get_global_pasid(struct ioasid_ctx *ctx, 
		u32 ioasid, bool alloc);

ioasid_get_global_pasid is necessary in scenarios where multiple devices 
want to share a same PASID value on the attached I/O page table (e.g. 
when ENQCMD is enabled, as explained in next section). We need a 
centralized place (ioasid_data->pasid) to hold this value (allocated when
first called with alloc=true). vfio device driver calls this function (alloc=
true) to get the global PASID for an ioasid before calling ioasid_device_
attach. KVM also calls this function (alloc=false) to setup PASID translation 
structure when user calls KVM_MAP_PASID.

4. PASID Virtualization
------------------------------

When guest SVA (vSVA) is enabled, multiple GVA address spaces are 
created on the assigned vfio device. This leads to the concepts of 
"virtual PASID" (vPASID) vs. "physical PASID" (pPASID). vPASID is assigned 
by the guest to mark an GVA address space while pPASID is the one 
selected by the host and actually routed in the wire.

vPASID is conveyed to the kernel when user calls VFIO_ATTACH_IOASID.

vfio device driver translates vPASID to pPASID before calling ioasid_attach_
device, with two factors to be considered:

-    Whether vPASID is directly used (vPASID==pPASID) in the wire, or 
     should be instead converted to a newly-allocated one (vPASID!=
     pPASID);

-    If vPASID!=pPASID, whether pPASID is allocated from per-RID PASID
     space or a global PASID space (implying sharing pPASID cross devices,
     e.g. when supporting Intel ENQCMD which puts PASID in a CPU MSR
     as part of the process context);

The actual policy depends on pdev vs. mdev, and whether ENQCMD is
supported. There are three possible scenarios:

(Note: /dev/ioasid uAPI is not affected by underlying PASID virtualization 
policies.)

1)  pdev (w/ or w/o ENQCMD): vPASID==pPASID

     vPASIDs are directly programmed by the guest to the assigned MMIO 
     bar, implying all DMAs out of this device having vPASID in the packet 
     header. This mandates vPASID==pPASID, sort of delegating the entire 
     per-RID PASID space to the guest.

     When ENQCMD is enabled, the CPU MSR when running a guest task
     contains a vPASID. In this case the CPU PASID translation capability 
     should be disabled so this vPASID in CPU MSR is directly sent to the
     wire.

     This ensures consistent vPASID usage on pdev regardless of the 
     workload submitted through a MMIO register or ENQCMD instruction.

2)  mdev: vPASID!=pPASID (per-RID if w/o ENQCMD, otherwise global)

     PASIDs are also used by kernel to mark the default I/O address space 
     for mdev, thus cannot be delegated to the guest. Instead, the mdev 
     driver must allocate a new pPASID for each vPASID (thus vPASID!=
     pPASID) and then use pPASID when attaching this mdev to an ioasid.

     The mdev driver needs cache the PASID mapping so in mediation 
     path vPASID programmed by the guest can be converted to pPASID 
     before updating the physical MMIO register. The mapping should
     also be saved in the CPU PASID translation structure (via KVM uAPI), 
     so the vPASID saved in the CPU MSR is auto-translated to pPASID 
     before sent to the wire, when ENQCMD is enabled. 

     Generally pPASID could be allocated from the per-RID PASID space
     if all mdev's created on the parent device don't support ENQCMD.

     However if the parent supports ENQCMD-capable mdev, pPASIDs
     must be allocated from a global pool because the CPU PASID 
     translation structure is per-VM. It implies that when an guest I/O 
     page table is attached to two mdevs with a single vPASID (i.e. bind 
     to the same guest process), a same pPASID should be used for 
     both mdevs even when they belong to different parents. Sharing
     pPASID cross mdevs is achieved by calling aforementioned ioasid_
     get_global_pasid().

3)  Mix pdev/mdev together

     Above policies are per device type thus are not affected when mixing 
     those device types together (when assigned to a single guest). However, 
     there is one exception - when both pdev/mdev support ENQCMD.

     Remember the two types have conflicting requirements on whether 
     CPU PASID translation should be enabled. This capability is per-VM, 
     and must be enabled for mdev isolation. When enabled, pdev will 
     receive a mdev pPASID violating its vPASID expectation.

     In previous thread a PASID range split scheme was discussed to support
     this combination, but we haven't worked out a clean uAPI design yet.
     Therefore in this proposal we decide to not support it, implying the 
     user should have some intelligence to avoid such scenario. It could be
     a TODO task for future.

In spite of those subtle considerations, the kernel implementation could
start simple, e.g.:

-    v==p for pdev;
-    v!=p and always use a global PASID pool for all mdev's;

Regardless of the kernel policy, the user policy is unchanged:

-    provide vPASID when calling VFIO_ATTACH_IOASID;
-    call KVM uAPI to setup CPU PASID translation if ENQCMD-capable mdev;
-    Don't expose ENQCMD capability on both pdev and mdev;

Sample user flow is described in section 5.5.

5. Use Cases and Flows
-------------------------------

Here assume VFIO will support a new model where every bound device
is explicitly listed under /dev/vfio thus a device fd can be acquired w/o 
going through legacy container/group interface. For illustration purpose
those devices are just called dev[1...N]:

	device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);

As explained earlier, one IOASID fd is sufficient for all intended use cases:

	ioasid_fd = open("/dev/ioasid", mode);

For simplicity below examples are all made for the virtualization story.
They are representative and could be easily adapted to a non-virtualization
scenario.

Three types of IOASIDs are considered:

	gpa_ioasid[1...N]: 	for GPA address space
	giova_ioasid[1...N]:	for guest IOVA address space
	gva_ioasid[1...N]:	for guest CPU VA address space

At least one gpa_ioasid must always be created per guest, while the other 
two are relevant as far as vIOMMU is concerned.

Examples here apply to both pdev and mdev, if not explicitly marked out
(e.g. in section 5.5). VFIO device driver in the kernel will figure out the 
associated routing information in the attaching operation.

For illustration simplicity, IOASID_CHECK_EXTENSION and IOASID_GET_
INFO are skipped in these examples.

5.1. A simple example
++++++++++++++++++

Dev1 is assigned to the guest. One gpa_ioasid is created. The GPA address
space is managed through DMA mapping protocol:

	/* Bind device to IOASID fd */
	device_fd = open("/dev/vfio/devices/dev1", mode);
	ioasid_fd = open("/dev/ioasid", mode);
	ioctl(device_fd, VFIO_BIND_IOASID_FD, ioasid_fd);

	/* Attach device to IOASID */
	gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
	at_data = { .ioasid = gpa_ioasid};
	ioctl(device_fd, VFIO_ATTACH_IOASID, &at_data);

	/* Setup GPA mapping */
	dma_map = {
		.ioasid	= gpa_ioasid;
		.iova	= 0;		// GPA
		.vaddr	= 0x40000000;	// HVA
		.size	= 1GB;
	};
	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);

If the guest is assigned with more than dev1, user follows above sequence
to attach other devices to the same gpa_ioasid i.e. sharing the GPA 
address space cross all assigned devices.

5.2. Multiple IOASIDs (no nesting)
++++++++++++++++++++++++++++

Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
both devices are attached to gpa_ioasid. After boot the guest creates 
an GIOVA address space (giova_ioasid) for dev2, leaving dev1 in pass
through mode (gpa_ioasid).

Suppose IOASID nesting is not supported in this case. Qemu need to
generate shadow mappings in userspace for giova_ioasid (like how
VFIO works today).

To avoid duplicated locked page accounting, it's recommended to pre-
register the virtual address range that will be used for DMA:

	device_fd1 = open("/dev/vfio/devices/dev1", mode);
	device_fd2 = open("/dev/vfio/devices/dev2", mode);
	ioasid_fd = open("/dev/ioasid", mode);
	ioctl(device_fd1, VFIO_BIND_IOASID_FD, ioasid_fd);
	ioctl(device_fd2, VFIO_BIND_IOASID_FD, ioasid_fd);

	/* pre-register the virtual address range for accounting */
	mem_info = { .vaddr = 0x40000000; .size = 1GB };
	ioctl(ioasid_fd, IOASID_REGISTER_MEMORY, &mem_info);

	/* Attach dev1 and dev2 to gpa_ioasid */
	gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
	at_data = { .ioasid = gpa_ioasid};
	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);

	/* Setup GPA mapping */
	dma_map = {
		.ioasid	= gpa_ioasid;
		.iova	= 0; 		// GPA
		.vaddr	= 0x40000000;	// HVA
		.size	= 1GB;
	};
	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);

	/* After boot, guest enables an GIOVA space for dev2 */
	giova_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);

	/* First detach dev2 from previous address space */
	at_data = { .ioasid = gpa_ioasid};
	ioctl(device_fd2, VFIO_DETACH_IOASID, &at_data);

	/* Then attach dev2 to the new address space */
	at_data = { .ioasid = giova_ioasid};
	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);

	/* Setup a shadow DMA mapping according to vIOMMU
	  * GIOVA (0x2000) -> GPA (0x1000) -> HVA (0x40001000)
	  */
	dma_map = {
		.ioasid	= giova_ioasid;
		.iova	= 0x2000; 	// GIOVA
		.vaddr	= 0x40001000;	// HVA
		.size	= 4KB;
	};
	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);

5.3. IOASID nesting (software)
+++++++++++++++++++++++++

Same usage scenario as 5.2, with software-based IOASID nesting 
available. In this mode it is the kernel instead of user to create the
shadow mapping.

The flow before guest boots is same as 5.2, except one point. Because 
giova_ioasid is nested on gpa_ioasid, locked accounting is only 
conducted for gpa_ioasid. So it's not necessary to pre-register virtual 
memory.

To save space we only list the steps after boots (i.e. both dev1/dev2
have been attached to gpa_ioasid before guest boots):

	/* After boots */
	/* Make GIOVA space nested on GPA space */
	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
				gpa_ioasid);

	/* Attach dev2 to the new address space (child)
	  * Note dev2 is still attached to gpa_ioasid (parent)
	  */
	at_data = { .ioasid = giova_ioasid};
	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);

	/* Setup a GIOVA->GPA mapping for giova_ioasid, which will be 
	  * merged by the kernel with GPA->HVA mapping of gpa_ioasid
	  * to form a shadow mapping.
	  */
	dma_map = {
		.ioasid	= giova_ioasid;
		.iova	= 0x2000;	// GIOVA
		.vaddr	= 0x1000;	// GPA
		.size	= 4KB;
	};
	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);

5.4. IOASID nesting (hardware)
+++++++++++++++++++++++++

Same usage scenario as 5.2, with hardware-based IOASID nesting
available. In this mode the pgtable binding protocol is used to 
bind the guest IOVA page table with the IOMMU:

	/* After boots */
	/* Make GIOVA space nested on GPA space */
	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
				gpa_ioasid);

	/* Attach dev2 to the new address space (child)
	  * Note dev2 is still attached to gpa_ioasid (parent)
	  */
	at_data = { .ioasid = giova_ioasid};
	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);

	/* Bind guest I/O page table  */
	bind_data = {
		.ioasid	= giova_ioasid;
		.addr	= giova_pgtable;
		// and format information
	};
	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);

	/* Invalidate IOTLB when required */
	inv_data = {
		.ioasid	= giova_ioasid;
		// granular information
	};
	ioctl(ioasid_fd, IOASID_INVALIDATE_CACHE, &inv_data);

	/* See 5.6 for I/O page fault handling */
	
5.5. Guest SVA (vSVA)
++++++++++++++++++

After boots the guest further create a GVA address spaces (gpasid1) on 
dev1. Dev2 is not affected (still attached to giova_ioasid).

As explained in section 4, user should avoid expose ENQCMD on both
pdev and mdev.

The sequence applies to all device types (being pdev or mdev), except
one additional step to call KVM for ENQCMD-capable mdev:

	/* After boots */
	/* Make GVA space nested on GPA space */
	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
				gpa_ioasid);

	/* Attach dev1 to the new address space and specify vPASID */
	at_data = {
		.ioasid		= gva_ioasid;
		.flag 		= IOASID_ATTACH_USER_PASID;
		.user_pasid	= gpasid1;
	};
	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);

	/* if dev1 is ENQCMD-capable mdev, update CPU PASID 
	  * translation structure through KVM
	  */
	pa_data = {
		.ioasid_fd	= ioasid_fd;
		.ioasid		= gva_ioasid;
		.guest_pasid	= gpasid1;
	};
	ioctl(kvm_fd, KVM_MAP_PASID, &pa_data);

	/* Bind guest I/O page table  */
	bind_data = {
		.ioasid	= gva_ioasid;
		.addr	= gva_pgtable1;
		// and format information
	};
	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);

	...


5.6. I/O page fault
+++++++++++++++

(uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
to guest IOMMU driver and backwards).

-   Host IOMMU driver receives a page request with raw fault_data {rid, 
    pasid, addr};

-   Host IOMMU driver identifies the faulting I/O page table according to
    information registered by IOASID fault handler;

-   IOASID fault handler is called with raw fault_data (rid, pasid, addr), which 
    is saved in ioasid_data->fault_data (used for response);

-   IOASID fault handler generates an user fault_data (ioasid, addr), links it 
    to the shared ring buffer and triggers eventfd to userspace;

-   Upon received event, Qemu needs to find the virtual routing information 
    (v_rid + v_pasid) of the device attached to the faulting ioasid. If there are 
    multiple, pick a random one. This should be fine since the purpose is to
    fix the I/O page table on the guest;

-   Qemu generates a virtual I/O page fault through vIOMMU into guest,
    carrying the virtual fault data (v_rid, v_pasid, addr);

-   Guest IOMMU driver fixes up the fault, updates the I/O page table, and
    then sends a page response with virtual completion data (v_rid, v_pasid, 
    response_code) to vIOMMU;

-   Qemu finds the pending fault event, converts virtual completion data 
    into (ioasid, response_code), and then calls a /dev/ioasid ioctl to 
    complete the pending fault;

-   /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved in 
    ioasid_data->fault_data, and then calls iommu api to complete it with
    {rid, pasid, response_code};

5.7. BIND_PASID_TABLE
++++++++++++++++++++

PASID table is put in the GPA space on some platform, thus must be updated
by the guest. It is treated as another user page table to be bound with the 
IOMMU.

As explained earlier, the user still needs to explicitly bind every user I/O 
page table to the kernel so the same pgtable binding protocol (bind, cache 
invalidate and fault handling) is unified cross platforms.

vIOMMUs may include a caching mode (or paravirtualized way) which, once 
enabled, requires the guest to invalidate PASID cache for any change on the 
PASID table. This allows Qemu to track the lifespan of guest I/O page tables.

In case of missing such capability, Qemu could enable write-protection on
the guest PASID table to achieve the same effect.

	/* After boots */
	/* Make vPASID space nested on GPA space */
	pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
				gpa_ioasid);

	/* Attach dev1 to pasidtbl_ioasid */
	at_data = { .ioasid = pasidtbl_ioasid};
	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);

	/* Bind PASID table */
	bind_data = {
		.ioasid	= pasidtbl_ioasid;
		.addr	= gpa_pasid_table;
		// and format information
	};
	ioctl(ioasid_fd, IOASID_BIND_PASID_TABLE, &bind_data);

	/* vIOMMU detects a new GVA I/O space created */
	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
				gpa_ioasid);

	/* Attach dev1 to the new address space, with gpasid1 */
	at_data = {
		.ioasid		= gva_ioasid;
		.flag 		= IOASID_ATTACH_USER_PASID;
		.user_pasid	= gpasid1;
	};
	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);

	/* Bind guest I/O page table. Because SET_PASID_TABLE has been
	  * used, the kernel will not update the PASID table. Instead, just
	  * track the bound I/O page table for handling invalidation and
	  * I/O page faults.
	  */
	bind_data = {
		.ioasid	= gva_ioasid;
		.addr	= gva_pgtable1;
		// and format information
	};
	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);

	...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-27  7:58 [RFC] /dev/ioasid uAPI proposal Tian, Kevin
@ 2021-05-28  2:24 ` Jason Wang
  2021-05-28 20:25   ` Jason Gunthorpe
       [not found]   ` <20210531164118.265789ee@yiliu-dev>
  2021-05-28 16:23 ` Jean-Philippe Brucker
                   ` (9 subsequent siblings)
  10 siblings, 2 replies; 258+ messages in thread
From: Jason Wang @ 2021-05-28  2:24 UTC (permalink / raw)
  To: Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe, Lu Baolu,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com)
  Cc: Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu, Hao,
	Jiang, Dave, Jacob Pan, Jean-Philippe Brucker, David Gibson,
	Kirti Wankhede, Robin Murphy


在 2021/5/27 下午3:58, Tian, Kevin 写道:
> /dev/ioasid provides an unified interface for managing I/O page tables for
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
> etc.) are expected to use this interface instead of creating their own logic to
> isolate untrusted device DMAs initiated by userspace.


Not a native speaker but /dev/ioas seems better?


>
> This proposal describes the uAPI of /dev/ioasid and also sample sequences
> with VFIO as example in typical usages. The driver-facing kernel API provided
> by the iommu layer is still TBD, which can be discussed after consensus is
> made on this uAPI.
>
> It's based on a lengthy discussion starting from here:
> 	https://lore.kernel.org/linux-iommu/20210330132830.GO2356281@nvidia.com/
>
> It ends up to be a long writing due to many things to be summarized and
> non-trivial effort required to connect them into a complete proposal.
> Hope it provides a clean base to converge.
>
> TOC
> ====
> 1. Terminologies and Concepts
> 2. uAPI Proposal
>      2.1. /dev/ioasid uAPI
>      2.2. /dev/vfio uAPI
>      2.3. /dev/kvm uAPI
> 3. Sample structures and helper functions
> 4. PASID virtualization
> 5. Use Cases and Flows
>      5.1. A simple example
>      5.2. Multiple IOASIDs (no nesting)
>      5.3. IOASID nesting (software)
>      5.4. IOASID nesting (hardware)
>      5.5. Guest SVA (vSVA)
>      5.6. I/O page fault
>      5.7. BIND_PASID_TABLE
> ====
>
> 1. Terminologies and Concepts
> -----------------------------------------
>
> IOASID FD is the container holding multiple I/O address spaces. User
> manages those address spaces through FD operations. Multiple FD's are
> allowed per process, but with this proposal one FD should be sufficient for
> all intended usages.
>
> IOASID is the FD-local software handle representing an I/O address space.
> Each IOASID is associated with a single I/O page table. IOASIDs can be
> nested together, implying the output address from one I/O page table
> (represented by child IOASID) must be further translated by another I/O
> page table (represented by parent IOASID).
>
> I/O address space can be managed through two protocols, according to
> whether the corresponding I/O page table is constructed by the kernel or
> the user. When kernel-managed, a dma mapping protocol (similar to
> existing VFIO iommu type1) is provided for the user to explicitly specify
> how the I/O address space is mapped. Otherwise, a different protocol is
> provided for the user to bind an user-managed I/O page table to the
> IOMMU, plus necessary commands for iotlb invalidation and I/O fault
> handling.
>
> Pgtable binding protocol can be used only on the child IOASID's, implying
> IOASID nesting must be enabled. This is because the kernel doesn't trust
> userspace. Nesting allows the kernel to enforce its DMA isolation policy
> through the parent IOASID.
>
> IOASID nesting can be implemented in two ways: hardware nesting and
> software nesting. With hardware support the child and parent I/O page
> tables are walked consecutively by the IOMMU to form a nested translation.
> When it's implemented in software, the ioasid driver


Need to explain what did "ioasid driver" mean.

I guess it's the module that implements the IOASID abstraction:

1) RID
2) RID+PASID
3) others

And if yes, does it allow the device for software specific implementation:

1) swiotlb or
2) device specific IOASID implementation


> is responsible for
> merging the two-level mappings into a single-level shadow I/O page table.
> Software nesting requires both child/parent page tables operated through
> the dma mapping protocol, so any change in either level can be captured
> by the kernel to update the corresponding shadow mapping.
>
> An I/O address space takes effect in the IOMMU only after it is attached
> to a device. The device in the /dev/ioasid context always refers to a
> physical one or 'pdev' (PF or VF).
>
> One I/O address space could be attached to multiple devices. In this case,
> /dev/ioasid uAPI applies to all attached devices under the specified IOASID.
>
> Based on the underlying IOMMU capability one device might be allowed
> to attach to multiple I/O address spaces, with DMAs accessing them by
> carrying different routing information. One of them is the default I/O
> address space routed by PCI Requestor ID (RID) or ARM Stream ID. The
> remaining are routed by RID + Process Address Space ID (PASID) or
> Stream+Substream ID. For simplicity the following context uses RID and
> PASID when talking about the routing information for I/O address spaces.
>
> Device attachment is initiated through passthrough framework uAPI (use
> VFIO for simplicity in following context). VFIO is responsible for identifying
> the routing information and registering it to the ioasid driver when calling
> ioasid attach helper function. It could be RID if the assigned device is
> pdev (PF/VF) or RID+PASID if the device is mediated (mdev). In addition,
> user might also provide its view of virtual routing information (vPASID) in
> the attach call, e.g. when multiple user-managed I/O address spaces are
> attached to the vfio_device. In this case VFIO must figure out whether
> vPASID should be directly used (for pdev) or converted to a kernel-
> allocated one (pPASID, for mdev) for physical routing (see section 4).
>
> Device must be bound to an IOASID FD before attach operation can be
> conducted. This is also through VFIO uAPI. In this proposal one device
> should not be bound to multiple FD's. Not sure about the gain of
> allowing it except adding unnecessary complexity. But if others have
> different view we can further discuss.
>
> VFIO must ensure its device composes DMAs with the routing information
> attached to the IOASID. For pdev it naturally happens since vPASID is
> directly programmed to the device by guest software. For mdev this
> implies any guest operation carrying a vPASID on this device must be
> trapped into VFIO and then converted to pPASID before sent to the
> device. A detail explanation about PASID virtualization policies can be
> found in section 4.
>
> Modern devices may support a scalable workload submission interface
> based on PCI DMWr capability, allowing a single work queue to access
> multiple I/O address spaces. One example is Intel ENQCMD, having
> PASID saved in the CPU MSR and carried in the instruction payload
> when sent out to the device. Then a single work queue shared by
> multiple processes can compose DMAs carrying different PASIDs.
>
> When executing ENQCMD in the guest, the CPU MSR includes a vPASID
> which, if targeting a mdev, must be converted to pPASID before sent
> to the wire. Intel CPU provides a hardware PASID translation capability
> for auto-conversion in the fast path. The user is expected to setup the
> PASID mapping through KVM uAPI, with information about {vpasid,
> ioasid_fd, ioasid}. The ioasid driver provides helper function for KVM
> to figure out the actual pPASID given an IOASID.
>
> With above design /dev/ioasid uAPI is all about I/O address spaces.
> It doesn't include any device routing information, which is only
> indirectly registered to the ioasid driver through VFIO uAPI. For
> example, I/O page fault is always reported to userspace per IOASID,
> although it's physically reported per device (RID+PASID). If there is a
> need of further relaying this fault into the guest, the user is responsible
> of identifying the device attached to this IOASID (randomly pick one if
> multiple attached devices) and then generates a per-device virtual I/O
> page fault into guest. Similarly the iotlb invalidation uAPI describes the
> granularity in the I/O address space (all, or a range), different from the
> underlying IOMMU semantics (domain-wide, PASID-wide, range-based).
>
> I/O page tables routed through PASID are installed in a per-RID PASID
> table structure.


I'm not sure this is true for all archs.


>   Some platforms implement the PASID table in the guest
> physical space (GPA), expecting it managed by the guest. The guest
> PASID table is bound to the IOMMU also by attaching to an IOASID,
> representing the per-RID vPASID space.
>
> We propose the host kernel needs to explicitly track  guest I/O page
> tables even on these platforms, i.e. the same pgtable binding protocol
> should be used universally on all platforms (with only difference on who
> actually writes the PASID table). One opinion from previous discussion
> was treating this special IOASID as a container for all guest I/O page
> tables i.e. hiding them from the host. However this way significantly
> violates the philosophy in this /dev/ioasid proposal. It is not one IOASID
> one address space any more. Device routing information (indirectly
> marking hidden I/O spaces) has to be carried in iotlb invalidation and
> page faulting uAPI to help connect vIOMMU with the underlying
> pIOMMU. This is one design choice to be confirmed with ARM guys.
>
> Devices may sit behind IOMMU's with incompatible capabilities. The
> difference may lie in the I/O page table format, or availability of an user
> visible uAPI (e.g. hardware nesting). /dev/ioasid is responsible for
> checking the incompatibility between newly-attached device and existing
> devices under the specific IOASID and, if found, returning error to user.
> Upon such error the user should create a new IOASID for the incompatible
> device.
>
> There is no explicit group enforcement in /dev/ioasid uAPI, due to no
> device notation in this interface as aforementioned. But the ioasid driver
> does implicit check to make sure that devices within an iommu group
> must be all attached to the same IOASID before this IOASID starts to
> accept any uAPI command. Otherwise error information is returned to
> the user.
>
> There was a long debate in previous discussion whether VFIO should keep
> explicit container/group semantics in its uAPI. Jason Gunthorpe proposes
> a simplified model where every device bound to VFIO is explicitly listed
> under /dev/vfio thus a device fd can be acquired w/o going through legacy
> container/group interface. In this case the user is responsible for
> understanding the group topology and meeting the implicit group check
> criteria enforced in /dev/ioasid. The use case examples in this proposal
> are based on the new model.
>
> Of course for backward compatibility VFIO still needs to keep the existing
> uAPI and vfio iommu type1 will become a shim layer connecting VFIO
> iommu ops to internal ioasid helper functions.
>
> Notes:
> -   It might be confusing as IOASID is also used in the kernel (drivers/
>      iommu/ioasid.c) to represent PCI PASID or ARM substream ID. We need
>      find a better name later to differentiate.
>
> -   PPC has not be considered yet as we haven't got time to fully understand
>      its semantics. According to previous discussion there is some generality
>      between PPC window-based scheme and VFIO type1 semantics. Let's
>      first make consensus on this proposal and then further discuss how to
>      extend it to cover PPC's requirement.
>
> -   There is a protocol between vfio group and kvm. Needs to think about
>      how it will be affected following this proposal.
>
> -   mdev in this context refers to mediated subfunctions (e.g. Intel SIOV)
>      which can be physically isolated in-between through PASID-granular
>      IOMMU protection. Historically people also discussed one usage by
>      mediating a pdev into a mdev. This usage is not covered here, and is
>      supposed to be replaced by Max's work which allows overriding various
>      VFIO operations in vfio-pci driver.
>
> 2. uAPI Proposal
> ----------------------
>
> /dev/ioasid uAPI covers everything about managing I/O address spaces.
>
> /dev/vfio uAPI builds connection between devices and I/O address spaces.
>
> /dev/kvm uAPI is optional required as far as ENQCMD is concerned.
>
>
> 2.1. /dev/ioasid uAPI
> +++++++++++++++++
>
> /*
>    * Check whether an uAPI extension is supported.
>    *
>    * This is for FD-level capabilities, such as locked page pre-registration.
>    * IOASID-level capabilities are reported through IOASID_GET_INFO.
>    *
>    * Return: 0 if not supported, 1 if supported.
>    */
> #define IOASID_CHECK_EXTENSION	_IO(IOASID_TYPE, IOASID_BASE + 0)
>
>
> /*
>    * Register user space memory where DMA is allowed.
>    *
>    * It pins user pages and does the locked memory accounting so sub-
>    * sequent IOASID_MAP/UNMAP_DMA calls get faster.
>    *
>    * When this ioctl is not used, one user page might be accounted
>    * multiple times when it is mapped by multiple IOASIDs which are
>    * not nested together.
>    *
>    * Input parameters:
>    *	- vaddr;
>    *	- size;
>    *
>    * Return: 0 on success, -errno on failure.
>    */
> #define IOASID_REGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 1)
> #define IOASID_UNREGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 2)
>
>
> /*
>    * Allocate an IOASID.
>    *
>    * IOASID is the FD-local software handle representing an I/O address
>    * space. Each IOASID is associated with a single I/O page table. User
>    * must call this ioctl to get an IOASID for every I/O address space that is
>    * intended to be enabled in the IOMMU.
>    *
>    * A newly-created IOASID doesn't accept any command before it is
>    * attached to a device. Once attached, an empty I/O page table is
>    * bound with the IOMMU then the user could use either DMA mapping
>    * or pgtable binding commands to manage this I/O page table.
>    *
>    * Device attachment is initiated through device driver uAPI (e.g. VFIO)
>    *
>    * Return: allocated ioasid on success, -errno on failure.
>    */
> #define IOASID_ALLOC	_IO(IOASID_TYPE, IOASID_BASE + 3)
> #define IOASID_FREE	_IO(IOASID_TYPE, IOASID_BASE + 4)


I would like to know the reason for such indirection.

It looks to me the ioasid fd is sufficient for performing any operations.

Such allocation only work if as ioas fd can have multiple ioasid which 
seems not the case you describe here.


>
>
> /*
>    * Get information about an I/O address space
>    *
>    * Supported capabilities:
>    *	- VFIO type1 map/unmap;
>    *	- pgtable/pasid_table binding
>    *	- hardware nesting vs. software nesting;
>    *	- ...
>    *
>    * Related attributes:
>    * 	- supported page sizes, reserved IOVA ranges (DMA mapping);
>    *	- vendor pgtable formats (pgtable binding);
>    *	- number of child IOASIDs (nesting);
>    *	- ...
>    *
>    * Above information is available only after one or more devices are
>    * attached to the specified IOASID. Otherwise the IOASID is just a
>    * number w/o any capability or attribute.
>    *
>    * Input parameters:
>    *	- u32 ioasid;
>    *
>    * Output parameters:
>    *	- many. TBD.
>    */
> #define IOASID_GET_INFO	_IO(IOASID_TYPE, IOASID_BASE + 5)
>
>
> /*
>    * Map/unmap process virtual addresses to I/O virtual addresses.
>    *
>    * Provide VFIO type1 equivalent semantics. Start with the same
>    * restriction e.g. the unmap size should match those used in the
>    * original mapping call.
>    *
>    * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
>    * must be already in the preregistered list.
>    *
>    * Input parameters:
>    *	- u32 ioasid;
>    *	- refer to vfio_iommu_type1_dma_{un}map
>    *
>    * Return: 0 on success, -errno on failure.
>    */
> #define IOASID_MAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 6)
> #define IOASID_UNMAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 7)
>
>
> /*
>    * Create a nesting IOASID (child) on an existing IOASID (parent)
>    *
>    * IOASIDs can be nested together, implying that the output address
>    * from one I/O page table (child) must be further translated by
>    * another I/O page table (parent).
>    *
>    * As the child adds essentially another reference to the I/O page table
>    * represented by the parent, any device attached to the child ioasid
>    * must be already attached to the parent.
>    *
>    * In concept there is no limit on the number of the nesting levels.
>    * However for the majority case one nesting level is sufficient. The
>    * user should check whether an IOASID supports nesting through
>    * IOASID_GET_INFO. For example, if only one nesting level is allowed,
>    * the nesting capability is reported only on the parent instead of the
>    * child.
>    *
>    * User also needs check (via IOASID_GET_INFO) whether the nesting
>    * is implemented in hardware or software. If software-based, DMA
>    * mapping protocol should be used on the child IOASID. Otherwise,
>    * the child should be operated with pgtable binding protocol.
>    *
>    * Input parameters:
>    *	- u32 parent_ioasid;
>    *
>    * Return: child_ioasid on success, -errno on failure;
>    */
> #define IOASID_CREATE_NESTING	_IO(IOASID_TYPE, IOASID_BASE + 8)
>
>
> /*
>    * Bind an user-managed I/O page table with the IOMMU
>    *
>    * Because user page table is untrusted, IOASID nesting must be enabled
>    * for this ioasid so the kernel can enforce its DMA isolation policy
>    * through the parent ioasid.
>    *
>    * Pgtable binding protocol is different from DMA mapping. The latter
>    * has the I/O page table constructed by the kernel and updated
>    * according to user MAP/UNMAP commands. With pgtable binding the
>    * whole page table is created and updated by userspace, thus different
>    * set of commands are required (bind, iotlb invalidation, page fault, etc.).
>    *
>    * Because the page table is directly walked by the IOMMU, the user
>    * must  use a format compatible to the underlying hardware. It can
>    * check the format information through IOASID_GET_INFO.
>    *
>    * The page table is bound to the IOMMU according to the routing
>    * information of each attached device under the specified IOASID. The
>    * routing information (RID and optional PASID) is registered when a
>    * device is attached to this IOASID through VFIO uAPI.
>    *
>    * Input parameters:
>    *	- child_ioasid;
>    *	- address of the user page table;
>    *	- formats (vendor, address_width, etc.);
>    *
>    * Return: 0 on success, -errno on failure.
>    */
> #define IOASID_BIND_PGTABLE		_IO(IOASID_TYPE, IOASID_BASE + 9)
> #define IOASID_UNBIND_PGTABLE	_IO(IOASID_TYPE, IOASID_BASE + 10)
>
>
> /*
>    * Bind an user-managed PASID table to the IOMMU
>    *
>    * This is required for platforms which place PASID table in the GPA space.
>    * In this case the specified IOASID represents the per-RID PASID space.
>    *
>    * Alternatively this may be replaced by IOASID_BIND_PGTABLE plus a
>    * special flag to indicate the difference from normal I/O address spaces.
>    *
>    * The format info of the PASID table is reported in IOASID_GET_INFO.
>    *
>    * As explained in the design section, user-managed I/O page tables must
>    * be explicitly bound to the kernel even on these platforms. It allows
>    * the kernel to uniformly manage I/O address spaces cross all platforms.
>    * Otherwise, the iotlb invalidation and page faulting uAPI must be hacked
>    * to carry device routing information to indirectly mark the hidden I/O
>    * address spaces.
>    *
>    * Input parameters:
>    *	- child_ioasid;
>    *	- address of PASID table;
>    *	- formats (vendor, size, etc.);
>    *
>    * Return: 0 on success, -errno on failure.
>    */
> #define IOASID_BIND_PASID_TABLE	_IO(IOASID_TYPE, IOASID_BASE + 11)
> #define IOASID_UNBIND_PASID_TABLE	_IO(IOASID_TYPE, IOASID_BASE + 12)
>
>
> /*
>    * Invalidate IOTLB for an user-managed I/O page table
>    *
>    * Unlike what's defined in include/uapi/linux/iommu.h, this command
>    * doesn't allow the user to specify cache type and likely support only
>    * two granularities (all, or a specified range) in the I/O address space.
>    *
>    * Physical IOMMU have three cache types (iotlb, dev_iotlb and pasid
>    * cache). If the IOASID represents an I/O address space, the invalidation
>    * always applies to the iotlb (and dev_iotlb if enabled). If the IOASID
>    * represents a vPASID space, then this command applies to the PASID
>    * cache.
>    *
>    * Similarly this command doesn't provide IOMMU-like granularity
>    * info (domain-wide, pasid-wide, range-based), since it's all about the
>    * I/O address space itself. The ioasid driver walks the attached
>    * routing information to match the IOMMU semantics under the
>    * hood.
>    *
>    * Input parameters:
>    *	- child_ioasid;
>    *	- granularity
>    *
>    * Return: 0 on success, -errno on failure
>    */
> #define IOASID_INVALIDATE_CACHE	_IO(IOASID_TYPE, IOASID_BASE + 13)
>
>
> /*
>    * Page fault report and response
>    *
>    * This is TBD. Can be added after other parts are cleared up. Likely it
>    * will be a ring buffer shared between user/kernel, an eventfd to notify
>    * the user and an ioctl to complete the fault.
>    *
>    * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
>    */
>
>
> /*
>    * Dirty page tracking
>    *
>    * Track and report memory pages dirtied in I/O address spaces. There
>    * is an ongoing work by Kunkun Jiang by extending existing VFIO type1.
>    * It needs be adapted to /dev/ioasid later.
>    */
>
>
> 2.2. /dev/vfio uAPI
> ++++++++++++++++
>
> /*
>    * Bind a vfio_device to the specified IOASID fd
>    *
>    * Multiple vfio devices can be bound to a single ioasid_fd, but a single
>    * vfio device should not be bound to multiple ioasid_fd's.
>    *
>    * Input parameters:
>    *	- ioasid_fd;
>    *
>    * Return: 0 on success, -errno on failure.
>    */
> #define VFIO_BIND_IOASID_FD		_IO(VFIO_TYPE, VFIO_BASE + 22)
> #define VFIO_UNBIND_IOASID_FD	_IO(VFIO_TYPE, VFIO_BASE + 23)
>
>
> /*
>    * Attach a vfio device to the specified IOASID
>    *
>    * Multiple vfio devices can be attached to the same IOASID, and vice
>    * versa.
>    *
>    * User may optionally provide a "virtual PASID" to mark an I/O page
>    * table on this vfio device. Whether the virtual PASID is physically used
>    * or converted to another kernel-allocated PASID is a policy in vfio device
>    * driver.
>    *
>    * There is no need to specify ioasid_fd in this call due to the assumption
>    * of 1:1 connection between vfio device and the bound fd.
>    *
>    * Input parameter:
>    *	- ioasid;
>    *	- flag;
>    *	- user_pasid (if specified);
>    *
>    * Return: 0 on success, -errno on failure.
>    */
> #define VFIO_ATTACH_IOASID		_IO(VFIO_TYPE, VFIO_BASE + 24)
> #define VFIO_DETACH_IOASID		_IO(VFIO_TYPE, VFIO_BASE + 25)
>
>
> 2.3. KVM uAPI
> ++++++++++++
>
> /*
>    * Update CPU PASID mapping
>    *
>    * This is necessary when ENQCMD will be used in the guest while the
>    * targeted device doesn't accept the vPASID saved in the CPU MSR.
>    *
>    * This command allows user to set/clear the vPASID->pPASID mapping
>    * in the CPU, by providing the IOASID (and FD) information representing
>    * the I/O address space marked by this vPASID.
>    *
>    * Input parameters:
>    *	- user_pasid;
>    *	- ioasid_fd;
>    *	- ioasid;
>    */
> #define KVM_MAP_PASID	_IO(KVMIO, 0xf0)
> #define KVM_UNMAP_PASID	_IO(KVMIO, 0xf1)
>
>
> 3. Sample structures and helper functions
> --------------------------------------------------------
>
> Three helper functions are provided to support VFIO_BIND_IOASID_FD:
>
> 	struct ioasid_ctx *ioasid_ctx_fdget(int fd);
> 	int ioasid_register_device(struct ioasid_ctx *ctx, struct ioasid_dev *dev);
> 	int ioasid_unregister_device(struct ioasid_dev *dev);
>
> An ioasid_ctx is created for each fd:
>
> 	struct ioasid_ctx {
> 		// a list of allocated IOASID data's
> 		struct list_head		ioasid_list;
> 		// a list of registered devices
> 		struct list_head		dev_list;
> 		// a list of pre-registered virtual address ranges
> 		struct list_head		prereg_list;
> 	};
>
> Each registered device is represented by ioasid_dev:
>
> 	struct ioasid_dev {
> 		struct list_head		next;
> 		struct ioasid_ctx	*ctx;
> 		// always be the physical device
> 		struct device 		*device;
> 		struct kref		kref;
> 	};
>
> Because we assume one vfio_device connected to at most one ioasid_fd,
> here ioasid_dev could be embedded in vfio_device and then linked to
> ioasid_ctx->dev_list when registration succeeds. For mdev the struct
> device should be the pointer to the parent device. PASID marking this
> mdev is specified later when VFIO_ATTACH_IOASID.
>
> An ioasid_data is created when IOASID_ALLOC, as the main object
> describing characteristics about an I/O page table:
>
> 	struct ioasid_data {
> 		// link to ioasid_ctx->ioasid_list
> 		struct list_head		next;
>
> 		// the IOASID number
> 		u32			ioasid;
>
> 		// the handle to convey iommu operations
> 		// hold the pgd (TBD until discussing iommu api)
> 		struct iommu_domain *domain;
>
> 		// map metadata (vfio type1 semantics)
> 		struct rb_node		dma_list;
>
> 		// pointer to user-managed pgtable (for nesting case)
> 		u64			user_pgd;
>
> 		// link to the parent ioasid (for nesting)
> 		struct ioasid_data	*parent;
>
> 		// cache the global PASID shared by ENQCMD-capable
> 		// devices (see below explanation in section 4)
> 		u32			pasid;
>
> 		// a list of device attach data (routing information)
> 		struct list_head		attach_data;
>
> 		// a list of partially-attached devices (group)
> 		struct list_head		partial_devices;
>
> 		// a list of fault_data reported from the iommu layer
> 		struct list_head		fault_data;
>
> 		...
> 	}
>
> ioasid_data and iommu_domain have overlapping roles as both are
> introduced to represent an I/O address space. It is still a big TBD how
> the two should be corelated or even merged, and whether new iommu
> ops are required to handle RID+PASID explicitly. We leave this as open
> for now as this proposal is mainly about uAPI. For simplification
> purpose the two objects are kept separate in this context, assuming an
> 1:1 connection in-between and the domain as the place-holder
> representing the 1st class object in the iommu ops.
>
> Two helper functions are provided to support VFIO_ATTACH_IOASID:
>
> 	struct attach_info {
> 		u32	ioasid;
> 		// If valid, the PASID to be used physically
> 		u32	pasid;
> 	};
> 	int ioasid_device_attach(struct ioasid_dev *dev,
> 		struct attach_info info);
> 	int ioasid_device_detach(struct ioasid_dev *dev, u32 ioasid);
>
> The pasid parameter is optionally provided based on the policy in vfio
> device driver. It could be the PASID marking the default I/O address
> space for a mdev, or the user-provided PASID marking an user I/O page
> table, or another kernel-allocated PASID backing the user-provided one.
> Please check next section for detail explanation.
>
> A new object is introduced and linked to ioasid_data->attach_data for
> each successful attach operation:
>
> 	struct ioasid_attach_data {
> 		struct list_head		next;
> 		struct ioasid_dev	*dev;
> 		u32 			pasid;
> 	}
>
> As explained in the design section, there is no explicit group enforcement
> in /dev/ioasid uAPI or helper functions. But the ioasid driver does
> implicit group check - before every device within an iommu group is
> attached to this IOASID, the previously-attached devices in this group are
> put in ioasid_data->partial_devices. The IOASID rejects any command if
> the partial_devices list is not empty.
>
> Then is the last helper function:
> 	u32 ioasid_get_global_pasid(struct ioasid_ctx *ctx,
> 		u32 ioasid, bool alloc);
>
> ioasid_get_global_pasid is necessary in scenarios where multiple devices
> want to share a same PASID value on the attached I/O page table (e.g.
> when ENQCMD is enabled, as explained in next section). We need a
> centralized place (ioasid_data->pasid) to hold this value (allocated when
> first called with alloc=true). vfio device driver calls this function (alloc=
> true) to get the global PASID for an ioasid before calling ioasid_device_
> attach. KVM also calls this function (alloc=false) to setup PASID translation
> structure when user calls KVM_MAP_PASID.
>
> 4. PASID Virtualization
> ------------------------------
>
> When guest SVA (vSVA) is enabled, multiple GVA address spaces are
> created on the assigned vfio device. This leads to the concepts of
> "virtual PASID" (vPASID) vs. "physical PASID" (pPASID). vPASID is assigned
> by the guest to mark an GVA address space while pPASID is the one
> selected by the host and actually routed in the wire.
>
> vPASID is conveyed to the kernel when user calls VFIO_ATTACH_IOASID.
>
> vfio device driver translates vPASID to pPASID before calling ioasid_attach_
> device, with two factors to be considered:
>
> -    Whether vPASID is directly used (vPASID==pPASID) in the wire, or
>       should be instead converted to a newly-allocated one (vPASID!=
>       pPASID);
>
> -    If vPASID!=pPASID, whether pPASID is allocated from per-RID PASID
>       space or a global PASID space (implying sharing pPASID cross devices,
>       e.g. when supporting Intel ENQCMD which puts PASID in a CPU MSR
>       as part of the process context);
>
> The actual policy depends on pdev vs. mdev, and whether ENQCMD is
> supported. There are three possible scenarios:
>
> (Note: /dev/ioasid uAPI is not affected by underlying PASID virtualization
> policies.)
>
> 1)  pdev (w/ or w/o ENQCMD): vPASID==pPASID
>
>       vPASIDs are directly programmed by the guest to the assigned MMIO
>       bar, implying all DMAs out of this device having vPASID in the packet
>       header. This mandates vPASID==pPASID, sort of delegating the entire
>       per-RID PASID space to the guest.
>
>       When ENQCMD is enabled, the CPU MSR when running a guest task
>       contains a vPASID. In this case the CPU PASID translation capability
>       should be disabled so this vPASID in CPU MSR is directly sent to the
>       wire.
>
>       This ensures consistent vPASID usage on pdev regardless of the
>       workload submitted through a MMIO register or ENQCMD instruction.
>
> 2)  mdev: vPASID!=pPASID (per-RID if w/o ENQCMD, otherwise global)
>
>       PASIDs are also used by kernel to mark the default I/O address space
>       for mdev, thus cannot be delegated to the guest. Instead, the mdev
>       driver must allocate a new pPASID for each vPASID (thus vPASID!=
>       pPASID) and then use pPASID when attaching this mdev to an ioasid.
>
>       The mdev driver needs cache the PASID mapping so in mediation
>       path vPASID programmed by the guest can be converted to pPASID
>       before updating the physical MMIO register. The mapping should
>       also be saved in the CPU PASID translation structure (via KVM uAPI),
>       so the vPASID saved in the CPU MSR is auto-translated to pPASID
>       before sent to the wire, when ENQCMD is enabled.
>
>       Generally pPASID could be allocated from the per-RID PASID space
>       if all mdev's created on the parent device don't support ENQCMD.
>
>       However if the parent supports ENQCMD-capable mdev, pPASIDs
>       must be allocated from a global pool because the CPU PASID
>       translation structure is per-VM. It implies that when an guest I/O
>       page table is attached to two mdevs with a single vPASID (i.e. bind
>       to the same guest process), a same pPASID should be used for
>       both mdevs even when they belong to different parents. Sharing
>       pPASID cross mdevs is achieved by calling aforementioned ioasid_
>       get_global_pasid().
>
> 3)  Mix pdev/mdev together
>
>       Above policies are per device type thus are not affected when mixing
>       those device types together (when assigned to a single guest). However,
>       there is one exception - when both pdev/mdev support ENQCMD.
>
>       Remember the two types have conflicting requirements on whether
>       CPU PASID translation should be enabled. This capability is per-VM,
>       and must be enabled for mdev isolation. When enabled, pdev will
>       receive a mdev pPASID violating its vPASID expectation.
>
>       In previous thread a PASID range split scheme was discussed to support
>       this combination, but we haven't worked out a clean uAPI design yet.
>       Therefore in this proposal we decide to not support it, implying the
>       user should have some intelligence to avoid such scenario. It could be
>       a TODO task for future.
>
> In spite of those subtle considerations, the kernel implementation could
> start simple, e.g.:
>
> -    v==p for pdev;
> -    v!=p and always use a global PASID pool for all mdev's;
>
> Regardless of the kernel policy, the user policy is unchanged:
>
> -    provide vPASID when calling VFIO_ATTACH_IOASID;
> -    call KVM uAPI to setup CPU PASID translation if ENQCMD-capable mdev;
> -    Don't expose ENQCMD capability on both pdev and mdev;
>
> Sample user flow is described in section 5.5.
>
> 5. Use Cases and Flows
> -------------------------------
>
> Here assume VFIO will support a new model where every bound device
> is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> going through legacy container/group interface. For illustration purpose
> those devices are just called dev[1...N]:
>
> 	device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
>
> As explained earlier, one IOASID fd is sufficient for all intended use cases:
>
> 	ioasid_fd = open("/dev/ioasid", mode);
>
> For simplicity below examples are all made for the virtualization story.
> They are representative and could be easily adapted to a non-virtualization
> scenario.
>
> Three types of IOASIDs are considered:
>
> 	gpa_ioasid[1...N]: 	for GPA address space
> 	giova_ioasid[1...N]:	for guest IOVA address space
> 	gva_ioasid[1...N]:	for guest CPU VA address space
>
> At least one gpa_ioasid must always be created per guest, while the other
> two are relevant as far as vIOMMU is concerned.
>
> Examples here apply to both pdev and mdev, if not explicitly marked out
> (e.g. in section 5.5). VFIO device driver in the kernel will figure out the
> associated routing information in the attaching operation.
>
> For illustration simplicity, IOASID_CHECK_EXTENSION and IOASID_GET_
> INFO are skipped in these examples.
>
> 5.1. A simple example
> ++++++++++++++++++
>
> Dev1 is assigned to the guest. One gpa_ioasid is created. The GPA address
> space is managed through DMA mapping protocol:
>
> 	/* Bind device to IOASID fd */
> 	device_fd = open("/dev/vfio/devices/dev1", mode);
> 	ioasid_fd = open("/dev/ioasid", mode);
> 	ioctl(device_fd, VFIO_BIND_IOASID_FD, ioasid_fd);
>
> 	/* Attach device to IOASID */
> 	gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> 	at_data = { .ioasid = gpa_ioasid};
> 	ioctl(device_fd, VFIO_ATTACH_IOASID, &at_data);
>
> 	/* Setup GPA mapping */
> 	dma_map = {
> 		.ioasid	= gpa_ioasid;
> 		.iova	= 0;		// GPA
> 		.vaddr	= 0x40000000;	// HVA
> 		.size	= 1GB;
> 	};
> 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
>
> If the guest is assigned with more than dev1, user follows above sequence
> to attach other devices to the same gpa_ioasid i.e. sharing the GPA
> address space cross all assigned devices.
>
> 5.2. Multiple IOASIDs (no nesting)
> ++++++++++++++++++++++++++++
>
> Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
> both devices are attached to gpa_ioasid. After boot the guest creates
> an GIOVA address space (giova_ioasid) for dev2, leaving dev1 in pass
> through mode (gpa_ioasid).
>
> Suppose IOASID nesting is not supported in this case. Qemu need to
> generate shadow mappings in userspace for giova_ioasid (like how
> VFIO works today).
>
> To avoid duplicated locked page accounting, it's recommended to pre-
> register the virtual address range that will be used for DMA:
>
> 	device_fd1 = open("/dev/vfio/devices/dev1", mode);
> 	device_fd2 = open("/dev/vfio/devices/dev2", mode);
> 	ioasid_fd = open("/dev/ioasid", mode);
> 	ioctl(device_fd1, VFIO_BIND_IOASID_FD, ioasid_fd);
> 	ioctl(device_fd2, VFIO_BIND_IOASID_FD, ioasid_fd);
>
> 	/* pre-register the virtual address range for accounting */
> 	mem_info = { .vaddr = 0x40000000; .size = 1GB };
> 	ioctl(ioasid_fd, IOASID_REGISTER_MEMORY, &mem_info);
>
> 	/* Attach dev1 and dev2 to gpa_ioasid */
> 	gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> 	at_data = { .ioasid = gpa_ioasid};
> 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
>
> 	/* Setup GPA mapping */
> 	dma_map = {
> 		.ioasid	= gpa_ioasid;
> 		.iova	= 0; 		// GPA
> 		.vaddr	= 0x40000000;	// HVA
> 		.size	= 1GB;
> 	};
> 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
>
> 	/* After boot, guest enables an GIOVA space for dev2 */
> 	giova_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
>
> 	/* First detach dev2 from previous address space */
> 	at_data = { .ioasid = gpa_ioasid};
> 	ioctl(device_fd2, VFIO_DETACH_IOASID, &at_data);
>
> 	/* Then attach dev2 to the new address space */
> 	at_data = { .ioasid = giova_ioasid};
> 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
>
> 	/* Setup a shadow DMA mapping according to vIOMMU
> 	  * GIOVA (0x2000) -> GPA (0x1000) -> HVA (0x40001000)
> 	  */
> 	dma_map = {
> 		.ioasid	= giova_ioasid;
> 		.iova	= 0x2000; 	// GIOVA
> 		.vaddr	= 0x40001000;	// HVA
> 		.size	= 4KB;
> 	};
> 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
>
> 5.3. IOASID nesting (software)
> +++++++++++++++++++++++++
>
> Same usage scenario as 5.2, with software-based IOASID nesting
> available. In this mode it is the kernel instead of user to create the
> shadow mapping.
>
> The flow before guest boots is same as 5.2, except one point. Because
> giova_ioasid is nested on gpa_ioasid, locked accounting is only
> conducted for gpa_ioasid. So it's not necessary to pre-register virtual
> memory.
>
> To save space we only list the steps after boots (i.e. both dev1/dev2
> have been attached to gpa_ioasid before guest boots):
>
> 	/* After boots */
> 	/* Make GIOVA space nested on GPA space */
> 	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
>
> 	/* Attach dev2 to the new address space (child)
> 	  * Note dev2 is still attached to gpa_ioasid (parent)
> 	  */
> 	at_data = { .ioasid = giova_ioasid};
> 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);


For vDPA, we need something similar. And in the future, vDPA may allow 
multiple ioasid to be attached to a single device. It should work with 
the current design.


>
> 	/* Setup a GIOVA->GPA mapping for giova_ioasid, which will be
> 	  * merged by the kernel with GPA->HVA mapping of gpa_ioasid
> 	  * to form a shadow mapping.
> 	  */
> 	dma_map = {
> 		.ioasid	= giova_ioasid;
> 		.iova	= 0x2000;	// GIOVA
> 		.vaddr	= 0x1000;	// GPA
> 		.size	= 4KB;
> 	};
> 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
>
> 5.4. IOASID nesting (hardware)
> +++++++++++++++++++++++++
>
> Same usage scenario as 5.2, with hardware-based IOASID nesting
> available. In this mode the pgtable binding protocol is used to
> bind the guest IOVA page table with the IOMMU:
>
> 	/* After boots */
> 	/* Make GIOVA space nested on GPA space */
> 	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
>
> 	/* Attach dev2 to the new address space (child)
> 	  * Note dev2 is still attached to gpa_ioasid (parent)
> 	  */
> 	at_data = { .ioasid = giova_ioasid};
> 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);


I guess VFIO_ATTACH_IOASID will fail if the underlayer doesn't support 
hardware nesting. Or is there way to detect the capability before?

I think GET_INFO only works after the ATTACH.


>
> 	/* Bind guest I/O page table  */
> 	bind_data = {
> 		.ioasid	= giova_ioasid;
> 		.addr	= giova_pgtable;
> 		// and format information
> 	};
> 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
>
> 	/* Invalidate IOTLB when required */
> 	inv_data = {
> 		.ioasid	= giova_ioasid;
> 		// granular information
> 	};
> 	ioctl(ioasid_fd, IOASID_INVALIDATE_CACHE, &inv_data);
>
> 	/* See 5.6 for I/O page fault handling */
> 	
> 5.5. Guest SVA (vSVA)
> ++++++++++++++++++
>
> After boots the guest further create a GVA address spaces (gpasid1) on
> dev1. Dev2 is not affected (still attached to giova_ioasid).
>
> As explained in section 4, user should avoid expose ENQCMD on both
> pdev and mdev.
>
> The sequence applies to all device types (being pdev or mdev), except
> one additional step to call KVM for ENQCMD-capable mdev:


My understanding is ENQCMD is Intel specific and not a requirement for 
having vSVA.


>
> 	/* After boots */
> 	/* Make GVA space nested on GPA space */
> 	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
>
> 	/* Attach dev1 to the new address space and specify vPASID */
> 	at_data = {
> 		.ioasid		= gva_ioasid;
> 		.flag 		= IOASID_ATTACH_USER_PASID;
> 		.user_pasid	= gpasid1;
> 	};
> 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
>
> 	/* if dev1 is ENQCMD-capable mdev, update CPU PASID
> 	  * translation structure through KVM
> 	  */
> 	pa_data = {
> 		.ioasid_fd	= ioasid_fd;
> 		.ioasid		= gva_ioasid;
> 		.guest_pasid	= gpasid1;
> 	};
> 	ioctl(kvm_fd, KVM_MAP_PASID, &pa_data);
>
> 	/* Bind guest I/O page table  */
> 	bind_data = {
> 		.ioasid	= gva_ioasid;
> 		.addr	= gva_pgtable1;
> 		// and format information
> 	};
> 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
>
> 	...
>
>
> 5.6. I/O page fault
> +++++++++++++++
>
> (uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
> to guest IOMMU driver and backwards).
>
> -   Host IOMMU driver receives a page request with raw fault_data {rid,
>      pasid, addr};
>
> -   Host IOMMU driver identifies the faulting I/O page table according to
>      information registered by IOASID fault handler;
>
> -   IOASID fault handler is called with raw fault_data (rid, pasid, addr), which
>      is saved in ioasid_data->fault_data (used for response);
>
> -   IOASID fault handler generates an user fault_data (ioasid, addr), links it
>      to the shared ring buffer and triggers eventfd to userspace;
>
> -   Upon received event, Qemu needs to find the virtual routing information
>      (v_rid + v_pasid) of the device attached to the faulting ioasid. If there are
>      multiple, pick a random one. This should be fine since the purpose is to
>      fix the I/O page table on the guest;
>
> -   Qemu generates a virtual I/O page fault through vIOMMU into guest,
>      carrying the virtual fault data (v_rid, v_pasid, addr);
>
> -   Guest IOMMU driver fixes up the fault, updates the I/O page table, and
>      then sends a page response with virtual completion data (v_rid, v_pasid,
>      response_code) to vIOMMU;
>
> -   Qemu finds the pending fault event, converts virtual completion data
>      into (ioasid, response_code), and then calls a /dev/ioasid ioctl to
>      complete the pending fault;
>
> -   /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved in
>      ioasid_data->fault_data, and then calls iommu api to complete it with
>      {rid, pasid, response_code};
>
> 5.7. BIND_PASID_TABLE
> ++++++++++++++++++++
>
> PASID table is put in the GPA space on some platform, thus must be updated
> by the guest. It is treated as another user page table to be bound with the
> IOMMU.
>
> As explained earlier, the user still needs to explicitly bind every user I/O
> page table to the kernel so the same pgtable binding protocol (bind, cache
> invalidate and fault handling) is unified cross platforms.
>
> vIOMMUs may include a caching mode (or paravirtualized way) which, once
> enabled, requires the guest to invalidate PASID cache for any change on the
> PASID table. This allows Qemu to track the lifespan of guest I/O page tables.
>
> In case of missing such capability, Qemu could enable write-protection on
> the guest PASID table to achieve the same effect.
>
> 	/* After boots */
> 	/* Make vPASID space nested on GPA space */
> 	pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
>
> 	/* Attach dev1 to pasidtbl_ioasid */
> 	at_data = { .ioasid = pasidtbl_ioasid};
> 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
>
> 	/* Bind PASID table */
> 	bind_data = {
> 		.ioasid	= pasidtbl_ioasid;
> 		.addr	= gpa_pasid_table;
> 		// and format information
> 	};
> 	ioctl(ioasid_fd, IOASID_BIND_PASID_TABLE, &bind_data);
>
> 	/* vIOMMU detects a new GVA I/O space created */
> 	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
>
> 	/* Attach dev1 to the new address space, with gpasid1 */
> 	at_data = {
> 		.ioasid		= gva_ioasid;
> 		.flag 		= IOASID_ATTACH_USER_PASID;
> 		.user_pasid	= gpasid1;
> 	};
> 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);


Do we need VFIO_DETACH_IOASID?

Thanks


>
> 	/* Bind guest I/O page table. Because SET_PASID_TABLE has been
> 	  * used, the kernel will not update the PASID table. Instead, just
> 	  * track the bound I/O page table for handling invalidation and
> 	  * I/O page faults.
> 	  */
> 	bind_data = {
> 		.ioasid	= gva_ioasid;
> 		.addr	= gva_pgtable1;
> 		// and format information
> 	};
> 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
>
> 	...
>
> Thanks
> Kevin
>


^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-27  7:58 [RFC] /dev/ioasid uAPI proposal Tian, Kevin
  2021-05-28  2:24 ` Jason Wang
@ 2021-05-28 16:23 ` Jean-Philippe Brucker
  2021-05-28 20:16   ` Jason Gunthorpe
  2021-06-01  7:50   ` Tian, Kevin
  2021-05-28 17:35 ` Jason Gunthorpe
                   ` (8 subsequent siblings)
  10 siblings, 2 replies; 258+ messages in thread
From: Jean-Philippe Brucker @ 2021-05-28 16:23 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: LKML, Joerg Roedel, Jason Gunthorpe, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, David Gibson, Kirti Wankhede,
	Robin Murphy

On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> /dev/ioasid provides an unified interface for managing I/O page tables for 
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, 
> etc.) are expected to use this interface instead of creating their own logic to 
> isolate untrusted device DMAs initiated by userspace. 
> 
> This proposal describes the uAPI of /dev/ioasid and also sample sequences 
> with VFIO as example in typical usages. The driver-facing kernel API provided 
> by the iommu layer is still TBD, which can be discussed after consensus is 
> made on this uAPI.
> 
> It's based on a lengthy discussion starting from here:
> 	https://lore.kernel.org/linux-iommu/20210330132830.GO2356281@nvidia.com/ 
> 
> It ends up to be a long writing due to many things to be summarized and
> non-trivial effort required to connect them into a complete proposal.
> Hope it provides a clean base to converge.

Firstly thanks for writing this up and for your patience. I've not read in
detail the second half yet, will take another look later.

> 1. Terminologies and Concepts
> -----------------------------------------
> 
> IOASID FD is the container holding multiple I/O address spaces. User 
> manages those address spaces through FD operations. Multiple FD's are 
> allowed per process, but with this proposal one FD should be sufficient for 
> all intended usages.
> 
> IOASID is the FD-local software handle representing an I/O address space. 
> Each IOASID is associated with a single I/O page table. IOASIDs can be 
> nested together, implying the output address from one I/O page table 
> (represented by child IOASID) must be further translated by another I/O 
> page table (represented by parent IOASID).
> 
> I/O address space can be managed through two protocols, according to 
> whether the corresponding I/O page table is constructed by the kernel or 
> the user. When kernel-managed, a dma mapping protocol (similar to 
> existing VFIO iommu type1) is provided for the user to explicitly specify 
> how the I/O address space is mapped. Otherwise, a different protocol is 
> provided for the user to bind an user-managed I/O page table to the 
> IOMMU, plus necessary commands for iotlb invalidation and I/O fault 
> handling. 
> 
> Pgtable binding protocol can be used only on the child IOASID's, implying 
> IOASID nesting must be enabled. This is because the kernel doesn't trust 
> userspace. Nesting allows the kernel to enforce its DMA isolation policy 
> through the parent IOASID.
> 
> IOASID nesting can be implemented in two ways: hardware nesting and 
> software nesting. With hardware support the child and parent I/O page 
> tables are walked consecutively by the IOMMU to form a nested translation. 
> When it's implemented in software, the ioasid driver is responsible for 
> merging the two-level mappings into a single-level shadow I/O page table. 
> Software nesting requires both child/parent page tables operated through 
> the dma mapping protocol, so any change in either level can be captured 
> by the kernel to update the corresponding shadow mapping.

Is there an advantage to moving software nesting into the kernel?
We could just have the guest do its usual combined map/unmap on the child
fd

> 
> An I/O address space takes effect in the IOMMU only after it is attached 
> to a device. The device in the /dev/ioasid context always refers to a 
> physical one or 'pdev' (PF or VF). 
> 
> One I/O address space could be attached to multiple devices. In this case, 
> /dev/ioasid uAPI applies to all attached devices under the specified IOASID.
> 
> Based on the underlying IOMMU capability one device might be allowed 
> to attach to multiple I/O address spaces, with DMAs accessing them by 
> carrying different routing information. One of them is the default I/O 
> address space routed by PCI Requestor ID (RID) or ARM Stream ID. The 
> remaining are routed by RID + Process Address Space ID (PASID) or 
> Stream+Substream ID. For simplicity the following context uses RID and
> PASID when talking about the routing information for I/O address spaces.
> 
> Device attachment is initiated through passthrough framework uAPI (use
> VFIO for simplicity in following context). VFIO is responsible for identifying 
> the routing information and registering it to the ioasid driver when calling 
> ioasid attach helper function. It could be RID if the assigned device is 
> pdev (PF/VF) or RID+PASID if the device is mediated (mdev). In addition, 
> user might also provide its view of virtual routing information (vPASID) in 
> the attach call, e.g. when multiple user-managed I/O address spaces are 
> attached to the vfio_device. In this case VFIO must figure out whether 
> vPASID should be directly used (for pdev) or converted to a kernel-
> allocated one (pPASID, for mdev) for physical routing (see section 4).
> 
> Device must be bound to an IOASID FD before attach operation can be
> conducted. This is also through VFIO uAPI. In this proposal one device 
> should not be bound to multiple FD's. Not sure about the gain of 
> allowing it except adding unnecessary complexity. But if others have 
> different view we can further discuss.
> 
> VFIO must ensure its device composes DMAs with the routing information
> attached to the IOASID. For pdev it naturally happens since vPASID is 
> directly programmed to the device by guest software. For mdev this 
> implies any guest operation carrying a vPASID on this device must be 
> trapped into VFIO and then converted to pPASID before sent to the 
> device. A detail explanation about PASID virtualization policies can be 
> found in section 4. 
> 
> Modern devices may support a scalable workload submission interface 
> based on PCI DMWr capability, allowing a single work queue to access
> multiple I/O address spaces. One example is Intel ENQCMD, having 
> PASID saved in the CPU MSR and carried in the instruction payload 
> when sent out to the device. Then a single work queue shared by 
> multiple processes can compose DMAs carrying different PASIDs. 
> 
> When executing ENQCMD in the guest, the CPU MSR includes a vPASID 
> which, if targeting a mdev, must be converted to pPASID before sent
> to the wire. Intel CPU provides a hardware PASID translation capability 
> for auto-conversion in the fast path. The user is expected to setup the 
> PASID mapping through KVM uAPI, with information about {vpasid, 
> ioasid_fd, ioasid}. The ioasid driver provides helper function for KVM 
> to figure out the actual pPASID given an IOASID.
> 
> With above design /dev/ioasid uAPI is all about I/O address spaces. 
> It doesn't include any device routing information, which is only 
> indirectly registered to the ioasid driver through VFIO uAPI. For 
> example, I/O page fault is always reported to userspace per IOASID, 
> although it's physically reported per device (RID+PASID). If there is a 
> need of further relaying this fault into the guest, the user is responsible 
> of identifying the device attached to this IOASID (randomly pick one if 
> multiple attached devices)

We need to report accurate information for faults. If the guest tells
device A to DMA, it shouldn't receive a fault report for device B. This is
important if the guest needs to kill a misbehaving device, or even just
for statistics and debugging. It may also simplify routing the page
response, which has to be fast.

> and then generates a per-device virtual I/O 
> page fault into guest. Similarly the iotlb invalidation uAPI describes the 
> granularity in the I/O address space (all, or a range), different from the 
> underlying IOMMU semantics (domain-wide, PASID-wide, range-based).
> 
> I/O page tables routed through PASID are installed in a per-RID PASID 
> table structure. Some platforms implement the PASID table in the guest 
> physical space (GPA), expecting it managed by the guest. The guest
> PASID table is bound to the IOMMU also by attaching to an IOASID, 
> representing the per-RID vPASID space. 
> 
> We propose the host kernel needs to explicitly track  guest I/O page 
> tables even on these platforms, i.e. the same pgtable binding protocol 
> should be used universally on all platforms (with only difference on who 
> actually writes the PASID table).

This adds significant complexity for Arm (and AMD). Userspace will now
need to walk the PASID table, serializing against invalidation. At least
the SMMU has caching mode for PASID tables so there is no need to trap,
but I'd rather avoid this. I really don't want to make virtio-iommu
devices walk PASID tables unless absolutely necessary, they need to stay
simple.

> One opinion from previous discussion 
> was treating this special IOASID as a container for all guest I/O page 
> tables i.e. hiding them from the host. However this way significantly 
> violates the philosophy in this /dev/ioasid proposal.

It does correspond better to the underlying architecture and hardware
implementation, of which userspace is well aware since it has to report
them to the guest and deal with different descriptor formats.

> It is not one IOASID 
> one address space any more. Device routing information (indirectly 
> marking hidden I/O spaces) has to be carried in iotlb invalidation and 
> page faulting uAPI to help connect vIOMMU with the underlying 
> pIOMMU.

As above, I think it's essential that we carry device information in fault
reports. In addition to the two previous reasons, on these platforms
userspace will route all faults through the same channel (vIOMMU event
queue) regardless of the PASID, so we do not need them split and tracked
by PASID. Given that IOPF will be a hot path we should make sure there is
no unnecessary indirection.

Regarding the invalidation, I think limiting it to IOASID may work but it
does bother me that we can't directly forward all invalidations received
on the vIOMMU: if the guest sends a device-wide invalidation, do we
iterate over all IOASIDs and issue one ioctl for each?  Sure the guest is
probably sending that because of detaching the PASID table, for which the
kernel did perform the invalidation, but we can't just assume that and
ignore the request, there may be a different reason. Iterating is going to
take a lot time, whereas with the current API we can send a single request
and issue a single command to the IOMMU hardware.

Similarly, if the guest sends an ATC invalidation for a whole device (in
the SMMU, that's an ATC_INV without SSID), we'll have to transform that
into multiple IOTLB invalidations?  We can't just send it on IOASID #0,
because it may not have been created by the guest.

Maybe we could at least have invalidation requests on the parent fd for
this kind of global case?  But I'd much rather avoid the PASID tracking
altogether and keep the existing cache invalidate API, let the pIOMMU
driver decode that stuff.

> This is one design choice to be confirmed with ARM guys.
> 
> Devices may sit behind IOMMU's with incompatible capabilities. The
> difference may lie in the I/O page table format, or availability of an user
> visible uAPI (e.g. hardware nesting). /dev/ioasid is responsible for 
> checking the incompatibility between newly-attached device and existing
> devices under the specific IOASID and, if found, returning error to user.
> Upon such error the user should create a new IOASID for the incompatible
> device. 
> 
> There is no explicit group enforcement in /dev/ioasid uAPI, due to no 
> device notation in this interface as aforementioned. But the ioasid driver 
> does implicit check to make sure that devices within an iommu group 
> must be all attached to the same IOASID before this IOASID starts to
> accept any uAPI command. Otherwise error information is returned to 
> the user.
> 
> There was a long debate in previous discussion whether VFIO should keep 
> explicit container/group semantics in its uAPI. Jason Gunthorpe proposes 
> a simplified model where every device bound to VFIO is explicitly listed 
> under /dev/vfio thus a device fd can be acquired w/o going through legacy
> container/group interface. In this case the user is responsible for 
> understanding the group topology and meeting the implicit group check 
> criteria enforced in /dev/ioasid. The use case examples in this proposal 
> are based on the new model.
> 
> Of course for backward compatibility VFIO still needs to keep the existing 
> uAPI and vfio iommu type1 will become a shim layer connecting VFIO 
> iommu ops to internal ioasid helper functions.
> 
> Notes:
> -   It might be confusing as IOASID is also used in the kernel (drivers/
>     iommu/ioasid.c) to represent PCI PASID or ARM substream ID. We need
>     find a better name later to differentiate.

Yes this isn't just about allocating PASIDs anymore. /dev/iommu or
/dev/ioas would make more sense.

> 
> -   PPC has not be considered yet as we haven't got time to fully understand
>     its semantics. According to previous discussion there is some generality 
>     between PPC window-based scheme and VFIO type1 semantics. Let's 
>     first make consensus on this proposal and then further discuss how to 
>     extend it to cover PPC's requirement.
> 
> -   There is a protocol between vfio group and kvm. Needs to think about
>     how it will be affected following this proposal.

(Arm also needs this, obtaining the VMID allocated by KVM and write it to
the SMMU descriptor when installing the PASID table
https://lore.kernel.org/linux-iommu/20210222155338.26132-1-shameerali.kolothum.thodi@huawei.com/)

> 
> -   mdev in this context refers to mediated subfunctions (e.g. Intel SIOV) 
>     which can be physically isolated in-between through PASID-granular
>     IOMMU protection. Historically people also discussed one usage by 
>     mediating a pdev into a mdev. This usage is not covered here, and is 
>     supposed to be replaced by Max's work which allows overriding various 
>     VFIO operations in vfio-pci driver.
> 
> 2. uAPI Proposal
> ----------------------
[...]

> /*
>   * Get information about an I/O address space
>   *
>   * Supported capabilities:
>   *	- VFIO type1 map/unmap;
>   *	- pgtable/pasid_table binding
>   *	- hardware nesting vs. software nesting;
>   *	- ...
>   *
>   * Related attributes:
>   * 	- supported page sizes, reserved IOVA ranges (DMA mapping);
>   *	- vendor pgtable formats (pgtable binding);
>   *	- number of child IOASIDs (nesting);
>   *	- ...
>   *
>   * Above information is available only after one or more devices are
>   * attached to the specified IOASID. Otherwise the IOASID is just a
>   * number w/o any capability or attribute.
>   *
>   * Input parameters:
>   *	- u32 ioasid;
>   *
>   * Output parameters:
>   *	- many. TBD.

We probably need a capability format similar to PCI and VFIO.

>   */
> #define IOASID_GET_INFO	_IO(IOASID_TYPE, IOASID_BASE + 5)
[...]

> 2.2. /dev/vfio uAPI
> ++++++++++++++++
> 
> /*
>   * Bind a vfio_device to the specified IOASID fd
>   *
>   * Multiple vfio devices can be bound to a single ioasid_fd, but a single 
>   * vfio device should not be bound to multiple ioasid_fd's. 
>   *
>   * Input parameters:
>   *	- ioasid_fd;

How about adding a 32-bit "virtual RID" at this point, that the kernel can
provide to userspace during fault reporting?

Thanks,
Jean

>   *
>   * Return: 0 on success, -errno on failure.
>   */
> #define VFIO_BIND_IOASID_FD		_IO(VFIO_TYPE, VFIO_BASE + 22)
> #define VFIO_UNBIND_IOASID_FD	_IO(VFIO_TYPE, VFIO_BASE + 23)


^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-27  7:58 [RFC] /dev/ioasid uAPI proposal Tian, Kevin
  2021-05-28  2:24 ` Jason Wang
  2021-05-28 16:23 ` Jean-Philippe Brucker
@ 2021-05-28 17:35 ` Jason Gunthorpe
  2021-06-01  8:10   ` Tian, Kevin
  2021-06-02  6:32   ` David Gibson
  2021-05-28 19:58 ` Jason Gunthorpe
                   ` (7 subsequent siblings)
  10 siblings, 2 replies; 258+ messages in thread
From: Jason Gunthorpe @ 2021-05-28 17:35 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:

> IOASID nesting can be implemented in two ways: hardware nesting and 
> software nesting. With hardware support the child and parent I/O page 
> tables are walked consecutively by the IOMMU to form a nested translation. 
> When it's implemented in software, the ioasid driver is responsible for 
> merging the two-level mappings into a single-level shadow I/O page table. 
> Software nesting requires both child/parent page tables operated through 
> the dma mapping protocol, so any change in either level can be captured 
> by the kernel to update the corresponding shadow mapping.

Why? A SW emulation could do this synchronization during invalidation
processing if invalidation contained an IOVA range.

I think this document would be stronger to include some "Rational"
statements in key places

> Based on the underlying IOMMU capability one device might be allowed 
> to attach to multiple I/O address spaces, with DMAs accessing them by 
> carrying different routing information. One of them is the default I/O 
> address space routed by PCI Requestor ID (RID) or ARM Stream ID. The 
> remaining are routed by RID + Process Address Space ID (PASID) or 
> Stream+Substream ID. For simplicity the following context uses RID and
> PASID when talking about the routing information for I/O address spaces.

I wonder if we should just adopt the ARM naming as the API
standard. It is general and doesn't have the SVA connotation that
"Process Address Space ID" carries.
 
> Device must be bound to an IOASID FD before attach operation can be
> conducted. This is also through VFIO uAPI. In this proposal one device 
> should not be bound to multiple FD's. Not sure about the gain of 
> allowing it except adding unnecessary complexity. But if others have 
> different view we can further discuss.

Unless there is some internal kernel design reason to block it, I
wouldn't go out of my way to prevent it.

> VFIO must ensure its device composes DMAs with the routing information
> attached to the IOASID. For pdev it naturally happens since vPASID is 
> directly programmed to the device by guest software. For mdev this 
> implies any guest operation carrying a vPASID on this device must be 
> trapped into VFIO and then converted to pPASID before sent to the 
> device. A detail explanation about PASID virtualization policies can be 
> found in section 4. 

vPASID and related seems like it needs other IOMMU vendors to take a
very careful look. I'm really glad to see this starting to be spelled
out in such a clear way, as it was hard to see from the patches there
is vendor variation.

> With above design /dev/ioasid uAPI is all about I/O address spaces. 
> It doesn't include any device routing information, which is only 
> indirectly registered to the ioasid driver through VFIO uAPI. For
> example, I/O page fault is always reported to userspace per IOASID,
> although it's physically reported per device (RID+PASID). 

I agree with Jean-Philippe - at the very least erasing this
information needs a major rational - but I don't really see why it
must be erased? The HW reports the originating device, is it just a
matter of labeling the devices attached to the /dev/ioasid FD so it
can be reported to userspace?

> multiple attached devices) and then generates a per-device virtual I/O 
> page fault into guest. Similarly the iotlb invalidation uAPI describes the 
> granularity in the I/O address space (all, or a range), different from the 
> underlying IOMMU semantics (domain-wide, PASID-wide, range-based).

This seems OK though, I can't think of a reason to allow an IOASID to
be left partially invalidated???
 
> I/O page tables routed through PASID are installed in a per-RID PASID 
> table structure. Some platforms implement the PASID table in the guest 
> physical space (GPA), expecting it managed by the guest. The guest
> PASID table is bound to the IOMMU also by attaching to an IOASID, 
> representing the per-RID vPASID space. 
> 
> We propose the host kernel needs to explicitly track  guest I/O page 
> tables even on these platforms, i.e. the same pgtable binding protocol 
> should be used universally on all platforms (with only difference on who
> actually writes the PASID table). One opinion from previous discussion 
> was treating this special IOASID as a container for all guest I/O page 
> tables i.e. hiding them from the host. 

> However this way significantly 
> violates the philosophy in this /dev/ioasid proposal. It is not one IOASID 
> one address space any more. Device routing information (indirectly 
> marking hidden I/O spaces) has to be carried in iotlb invalidation and 
> page faulting uAPI to help connect vIOMMU with the underlying 
> pIOMMU. This is one design choice to be confirmed with ARM guys.

I'm confused by this rational.

For a vIOMMU that has IO page tables in the guest the basic
choices are:
 - Do we have a hypervisor trap to bind the page table or not? (RID
   and PASID may differ here)
 - Do we have a hypervisor trap to invaliate the page tables or not?

If the first is a hypervisor trap then I agree it makes sense to create a
child IOASID that points to each guest page table and manage it
directly. This should not require walking guest page tables as it is
really just informing the HW where the page table lives. HW will walk
them.

If there are no hypervisor traps (does this exist?) then there is no
way to involve the hypervisor here and the child IOASID should simply
be a pointer to the guest's data structure that describes binding. In
this case that IOASID should claim all PASIDs when bound to a
RID. 

Invalidation should be passed up the to the IOMMU driver in terms of
the guest tables information and either the HW or software has to walk
to guest tables to make sense of it.

Events from the IOMMU to userspace should be tagged with the attached
device label and the PASID/substream ID. This means there is no issue
to have a a 'all PASID' IOASID.

> Notes:
> -   It might be confusing as IOASID is also used in the kernel (drivers/
>     iommu/ioasid.c) to represent PCI PASID or ARM substream ID. We need
>     find a better name later to differentiate.

+1 on Jean-Philippe's remarks

> -   PPC has not be considered yet as we haven't got time to fully understand
>     its semantics. According to previous discussion there is some generality 
>     between PPC window-based scheme and VFIO type1 semantics. Let's 
>     first make consensus on this proposal and then further discuss how to 
>     extend it to cover PPC's requirement.

From what I understood PPC is not so bad, Nesting IOASID's did its
preload feature and it needed a way to specify/query the IOVA range a
IOASID will cover.

> -   There is a protocol between vfio group and kvm. Needs to think about
>     how it will be affected following this proposal.

Ugh, I always stop looking when I reach that boundary. Can anyone
summarize what is going on there?

Most likely passing the /dev/ioasid into KVM's FD (or vicevera) is the
right answer. Eg if ARM needs to get the VMID from KVM and set it to
ioasid then a KVM "ioctl set_arm_vmid(/dev/ioasid)" call is
reasonable. Certainly better than the symbol get sutff we have right
now.

I will read through the detail below in another email

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-27  7:58 [RFC] /dev/ioasid uAPI proposal Tian, Kevin
                   ` (2 preceding siblings ...)
  2021-05-28 17:35 ` Jason Gunthorpe
@ 2021-05-28 19:58 ` Jason Gunthorpe
  2021-06-01  8:38   ` Tian, Kevin
  2021-06-02  6:48   ` David Gibson
  2021-05-28 20:03 ` Jason Gunthorpe
                   ` (6 subsequent siblings)
  10 siblings, 2 replies; 258+ messages in thread
From: Jason Gunthorpe @ 2021-05-28 19:58 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> 
> 5. Use Cases and Flows
> 
> Here assume VFIO will support a new model where every bound device
> is explicitly listed under /dev/vfio thus a device fd can be acquired w/o 
> going through legacy container/group interface. For illustration purpose
> those devices are just called dev[1...N]:
> 
> 	device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> 
> As explained earlier, one IOASID fd is sufficient for all intended use cases:
> 
> 	ioasid_fd = open("/dev/ioasid", mode);
> 
> For simplicity below examples are all made for the virtualization story.
> They are representative and could be easily adapted to a non-virtualization
> scenario.

For others, I don't think this is *strictly* necessary, we can
probably still get to the device_fd using the group_fd and fit in
/dev/ioasid. It does make the rest of this more readable though.


> Three types of IOASIDs are considered:
> 
> 	gpa_ioasid[1...N]: 	for GPA address space
> 	giova_ioasid[1...N]:	for guest IOVA address space
> 	gva_ioasid[1...N]:	for guest CPU VA address space
> 
> At least one gpa_ioasid must always be created per guest, while the other 
> two are relevant as far as vIOMMU is concerned.
> 
> Examples here apply to both pdev and mdev, if not explicitly marked out
> (e.g. in section 5.5). VFIO device driver in the kernel will figure out the 
> associated routing information in the attaching operation.
> 
> For illustration simplicity, IOASID_CHECK_EXTENSION and IOASID_GET_
> INFO are skipped in these examples.
> 
> 5.1. A simple example
> ++++++++++++++++++
> 
> Dev1 is assigned to the guest. One gpa_ioasid is created. The GPA address
> space is managed through DMA mapping protocol:
> 
> 	/* Bind device to IOASID fd */
> 	device_fd = open("/dev/vfio/devices/dev1", mode);
> 	ioasid_fd = open("/dev/ioasid", mode);
> 	ioctl(device_fd, VFIO_BIND_IOASID_FD, ioasid_fd);
> 
> 	/* Attach device to IOASID */
> 	gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> 	at_data = { .ioasid = gpa_ioasid};
> 	ioctl(device_fd, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Setup GPA mapping */
> 	dma_map = {
> 		.ioasid	= gpa_ioasid;
> 		.iova	= 0;		// GPA
> 		.vaddr	= 0x40000000;	// HVA
> 		.size	= 1GB;
> 	};
> 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> 
> If the guest is assigned with more than dev1, user follows above sequence
> to attach other devices to the same gpa_ioasid i.e. sharing the GPA 
> address space cross all assigned devices.

eg

 	device2_fd = open("/dev/vfio/devices/dev1", mode);
 	ioctl(device2_fd, VFIO_BIND_IOASID_FD, ioasid_fd);
 	ioctl(device2_fd, VFIO_ATTACH_IOASID, &at_data);

Right?

> 
> 5.2. Multiple IOASIDs (no nesting)
> ++++++++++++++++++++++++++++
> 
> Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
> both devices are attached to gpa_ioasid. After boot the guest creates 
> an GIOVA address space (giova_ioasid) for dev2, leaving dev1 in pass
> through mode (gpa_ioasid).
> 
> Suppose IOASID nesting is not supported in this case. Qemu need to
> generate shadow mappings in userspace for giova_ioasid (like how
> VFIO works today).
> 
> To avoid duplicated locked page accounting, it's recommended to pre-
> register the virtual address range that will be used for DMA:
> 
> 	device_fd1 = open("/dev/vfio/devices/dev1", mode);
> 	device_fd2 = open("/dev/vfio/devices/dev2", mode);
> 	ioasid_fd = open("/dev/ioasid", mode);
> 	ioctl(device_fd1, VFIO_BIND_IOASID_FD, ioasid_fd);
> 	ioctl(device_fd2, VFIO_BIND_IOASID_FD, ioasid_fd);
> 
> 	/* pre-register the virtual address range for accounting */
> 	mem_info = { .vaddr = 0x40000000; .size = 1GB };
> 	ioctl(ioasid_fd, IOASID_REGISTER_MEMORY, &mem_info);
> 
> 	/* Attach dev1 and dev2 to gpa_ioasid */
> 	gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> 	at_data = { .ioasid = gpa_ioasid};
> 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Setup GPA mapping */
> 	dma_map = {
> 		.ioasid	= gpa_ioasid;
> 		.iova	= 0; 		// GPA
> 		.vaddr	= 0x40000000;	// HVA
> 		.size	= 1GB;
> 	};
> 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> 
> 	/* After boot, guest enables an GIOVA space for dev2 */
> 	giova_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> 
> 	/* First detach dev2 from previous address space */
> 	at_data = { .ioasid = gpa_ioasid};
> 	ioctl(device_fd2, VFIO_DETACH_IOASID, &at_data);
> 
> 	/* Then attach dev2 to the new address space */
> 	at_data = { .ioasid = giova_ioasid};
> 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Setup a shadow DMA mapping according to vIOMMU
> 	  * GIOVA (0x2000) -> GPA (0x1000) -> HVA (0x40001000)
> 	  */

Here "shadow DMA" means relay the guest's vIOMMU page tables to the HW
IOMMU?

> 	dma_map = {
> 		.ioasid	= giova_ioasid;
> 		.iova	= 0x2000; 	// GIOVA
> 		.vaddr	= 0x40001000;	// HVA

eg HVA came from reading the guest's page tables and finding it wanted
GPA 0x1000 mapped to IOVA 0x2000?


> 5.3. IOASID nesting (software)
> +++++++++++++++++++++++++
> 
> Same usage scenario as 5.2, with software-based IOASID nesting 
> available. In this mode it is the kernel instead of user to create the
> shadow mapping.
> 
> The flow before guest boots is same as 5.2, except one point. Because 
> giova_ioasid is nested on gpa_ioasid, locked accounting is only 
> conducted for gpa_ioasid. So it's not necessary to pre-register virtual 
> memory.
> 
> To save space we only list the steps after boots (i.e. both dev1/dev2
> have been attached to gpa_ioasid before guest boots):
> 
> 	/* After boots */
> 	/* Make GIOVA space nested on GPA space */
> 	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
> 
> 	/* Attach dev2 to the new address space (child)
> 	  * Note dev2 is still attached to gpa_ioasid (parent)
> 	  */
> 	at_data = { .ioasid = giova_ioasid};
> 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Setup a GIOVA->GPA mapping for giova_ioasid, which will be 
> 	  * merged by the kernel with GPA->HVA mapping of gpa_ioasid
> 	  * to form a shadow mapping.
> 	  */
> 	dma_map = {
> 		.ioasid	= giova_ioasid;
> 		.iova	= 0x2000;	// GIOVA
> 		.vaddr	= 0x1000;	// GPA
> 		.size	= 4KB;
> 	};
> 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);

And in this version the kernel reaches into the parent IOASID's page
tables to translate 0x1000 to 0x40001000 to physical page? So we
basically remove the qemu process address space entirely from this
translation. It does seem convenient

> 5.4. IOASID nesting (hardware)
> +++++++++++++++++++++++++
> 
> Same usage scenario as 5.2, with hardware-based IOASID nesting
> available. In this mode the pgtable binding protocol is used to 
> bind the guest IOVA page table with the IOMMU:
> 
> 	/* After boots */
> 	/* Make GIOVA space nested on GPA space */
> 	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
> 
> 	/* Attach dev2 to the new address space (child)
> 	  * Note dev2 is still attached to gpa_ioasid (parent)
> 	  */
> 	at_data = { .ioasid = giova_ioasid};
> 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Bind guest I/O page table  */
> 	bind_data = {
> 		.ioasid	= giova_ioasid;
> 		.addr	= giova_pgtable;
> 		// and format information
> 	};
> 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);

I really think you need to use consistent language. Things that
allocate a new IOASID should be calle IOASID_ALLOC_IOASID. If multiple
IOCTLs are needed then it is IOASID_ALLOC_IOASID_PGTABLE, etc.
alloc/create/bind is too confusing.

> 5.5. Guest SVA (vSVA)
> ++++++++++++++++++
> 
> After boots the guest further create a GVA address spaces (gpasid1) on 
> dev1. Dev2 is not affected (still attached to giova_ioasid).
> 
> As explained in section 4, user should avoid expose ENQCMD on both
> pdev and mdev.
> 
> The sequence applies to all device types (being pdev or mdev), except
> one additional step to call KVM for ENQCMD-capable mdev:
> 
> 	/* After boots */
> 	/* Make GVA space nested on GPA space */
> 	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
> 
> 	/* Attach dev1 to the new address space and specify vPASID */
> 	at_data = {
> 		.ioasid		= gva_ioasid;
> 		.flag 		= IOASID_ATTACH_USER_PASID;
> 		.user_pasid	= gpasid1;
> 	};
> 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);

Still a little unsure why the vPASID is here not on the gva_ioasid. Is
there any scenario where we want different vpasid's for the same
IOASID? I guess it is OK like this. Hum.

> 	/* if dev1 is ENQCMD-capable mdev, update CPU PASID 
> 	  * translation structure through KVM
> 	  */
> 	pa_data = {
> 		.ioasid_fd	= ioasid_fd;
> 		.ioasid		= gva_ioasid;
> 		.guest_pasid	= gpasid1;
> 	};
> 	ioctl(kvm_fd, KVM_MAP_PASID, &pa_data);

Make sense

> 	/* Bind guest I/O page table  */
> 	bind_data = {
> 		.ioasid	= gva_ioasid;
> 		.addr	= gva_pgtable1;
> 		// and format information
> 	};
> 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);

Again I do wonder if this should just be part of alloc_ioasid. Is
there any reason to split these things? The only advantage to the
split is the device is known, but the device shouldn't impact
anything..

> 5.6. I/O page fault
> +++++++++++++++
> 
> (uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
> to guest IOMMU driver and backwards).
> 
> -   Host IOMMU driver receives a page request with raw fault_data {rid, 
>     pasid, addr};
> 
> -   Host IOMMU driver identifies the faulting I/O page table according to
>     information registered by IOASID fault handler;
> 
> -   IOASID fault handler is called with raw fault_data (rid, pasid, addr), which 
>     is saved in ioasid_data->fault_data (used for response);
> 
> -   IOASID fault handler generates an user fault_data (ioasid, addr), links it 
>     to the shared ring buffer and triggers eventfd to userspace;

Here rid should be translated to a labeled device and return the
device label from VFIO_BIND_IOASID_FD. Depending on how the device
bound the label might match to a rid or to a rid,pasid

> -   Upon received event, Qemu needs to find the virtual routing information 
>     (v_rid + v_pasid) of the device attached to the faulting ioasid. If there are 
>     multiple, pick a random one. This should be fine since the purpose is to
>     fix the I/O page table on the guest;

The device label should fix this
 
> -   Qemu finds the pending fault event, converts virtual completion data 
>     into (ioasid, response_code), and then calls a /dev/ioasid ioctl to 
>     complete the pending fault;
> 
> -   /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved in 
>     ioasid_data->fault_data, and then calls iommu api to complete it with
>     {rid, pasid, response_code};

So resuming a fault on an ioasid will resume all devices pending on
the fault?

> 5.7. BIND_PASID_TABLE
> ++++++++++++++++++++
> 
> PASID table is put in the GPA space on some platform, thus must be updated
> by the guest. It is treated as another user page table to be bound with the 
> IOMMU.
> 
> As explained earlier, the user still needs to explicitly bind every user I/O 
> page table to the kernel so the same pgtable binding protocol (bind, cache 
> invalidate and fault handling) is unified cross platforms.
>
> vIOMMUs may include a caching mode (or paravirtualized way) which, once 
> enabled, requires the guest to invalidate PASID cache for any change on the 
> PASID table. This allows Qemu to track the lifespan of guest I/O page tables.
>
> In case of missing such capability, Qemu could enable write-protection on
> the guest PASID table to achieve the same effect.
> 
> 	/* After boots */
> 	/* Make vPASID space nested on GPA space */
> 	pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
> 
> 	/* Attach dev1 to pasidtbl_ioasid */
> 	at_data = { .ioasid = pasidtbl_ioasid};
> 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Bind PASID table */
> 	bind_data = {
> 		.ioasid	= pasidtbl_ioasid;
> 		.addr	= gpa_pasid_table;
> 		// and format information
> 	};
> 	ioctl(ioasid_fd, IOASID_BIND_PASID_TABLE, &bind_data);
> 
> 	/* vIOMMU detects a new GVA I/O space created */
> 	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
> 
> 	/* Attach dev1 to the new address space, with gpasid1 */
> 	at_data = {
> 		.ioasid		= gva_ioasid;
> 		.flag 		= IOASID_ATTACH_USER_PASID;
> 		.user_pasid	= gpasid1;
> 	};
> 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Bind guest I/O page table. Because SET_PASID_TABLE has been
> 	  * used, the kernel will not update the PASID table. Instead, just
> 	  * track the bound I/O page table for handling invalidation and
> 	  * I/O page faults.
> 	  */
> 	bind_data = {
> 		.ioasid	= gva_ioasid;
> 		.addr	= gva_pgtable1;
> 		// and format information
> 	};
> 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);

I still don't quite get the benifit from doing this.

The idea to create an all PASID IOASID seems to work better with less
fuss on HW that is directly parsing the guest's PASID table.

Cache invalidate seems easy enough to support

Fault handling needs to return the (ioasid, device_label, pasid) when
working with this kind of ioasid.

It is true that it does create an additional flow qemu has to
implement, but it does directly mirror the HW.

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-27  7:58 [RFC] /dev/ioasid uAPI proposal Tian, Kevin
                   ` (3 preceding siblings ...)
  2021-05-28 19:58 ` Jason Gunthorpe
@ 2021-05-28 20:03 ` Jason Gunthorpe
  2021-06-01  7:01   ` Tian, Kevin
  2021-05-28 23:36 ` Jason Gunthorpe
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 258+ messages in thread
From: Jason Gunthorpe @ 2021-05-28 20:03 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> /dev/ioasid provides an unified interface for managing I/O page tables for 
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, 
> etc.) are expected to use this interface instead of creating their own logic to 
> isolate untrusted device DMAs initiated by userspace. 

It is very long, but I think this has turned out quite well. It
certainly matches the basic sketch I had in my head when we were
talking about how to create vDPA devices a few years ago.

When you get down to the operations they all seem pretty common sense
and straightfoward. Create an IOASID. Connect to a device. Fill the
IOASID with pages somehow. Worry about PASID labeling.

It really is critical to get all the vendor IOMMU people to go over it
and see how their HW features map into this.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-28 16:23 ` Jean-Philippe Brucker
@ 2021-05-28 20:16   ` Jason Gunthorpe
  2021-06-01  7:50   ` Tian, Kevin
  1 sibling, 0 replies; 258+ messages in thread
From: Jason Gunthorpe @ 2021-05-28 20:16 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, David Gibson, Kirti Wankhede,
	Robin Murphy

On Fri, May 28, 2021 at 06:23:07PM +0200, Jean-Philippe Brucker wrote:

> Regarding the invalidation, I think limiting it to IOASID may work but it
> does bother me that we can't directly forward all invalidations received
> on the vIOMMU: if the guest sends a device-wide invalidation, do we
> iterate over all IOASIDs and issue one ioctl for each?  Sure the guest is
> probably sending that because of detaching the PASID table, for which the
> kernel did perform the invalidation, but we can't just assume that and
> ignore the request, there may be a different reason. Iterating is going to
> take a lot time, whereas with the current API we can send a single request
> and issue a single command to the IOMMU hardware.

I think the invalidation could stand some improvement, but that also
feels basically incremental to the essence of the proposal.

I agree with the general goal that the uAPI should be able to issue
invalidates that directly map to HW invalidations.

> Similarly, if the guest sends an ATC invalidation for a whole device (in
> the SMMU, that's an ATC_INV without SSID), we'll have to transform that
> into multiple IOTLB invalidations?  We can't just send it on IOASID #0,
> because it may not have been created by the guest.

For instance adding device labels allows an invalidate device
operation to exist and the "generic" kernel driver can iterate over
all IOASIDs hooked to the device. Overridable by the IOMMU driver.

> > Notes:
> > -   It might be confusing as IOASID is also used in the kernel (drivers/
> >     iommu/ioasid.c) to represent PCI PASID or ARM substream ID. We need
> >     find a better name later to differentiate.
> 
> Yes this isn't just about allocating PASIDs anymore. /dev/iommu or
> /dev/ioas would make more sense.

Either makes sense to me

/dev/iommu and the internal IOASID objects can be called IOAS (==
iommu_domain) is not bad

> >   * Get information about an I/O address space
> >   *
> >   * Supported capabilities:
> >   *	- VFIO type1 map/unmap;
> >   *	- pgtable/pasid_table binding
> >   *	- hardware nesting vs. software nesting;
> >   *	- ...
> >   *
> >   * Related attributes:
> >   * 	- supported page sizes, reserved IOVA ranges (DMA mapping);
> >   *	- vendor pgtable formats (pgtable binding);
> >   *	- number of child IOASIDs (nesting);
> >   *	- ...
> >   *
> >   * Above information is available only after one or more devices are
> >   * attached to the specified IOASID. Otherwise the IOASID is just a
> >   * number w/o any capability or attribute.
> >   *
> >   * Input parameters:
> >   *	- u32 ioasid;
> >   *
> >   * Output parameters:
> >   *	- many. TBD.
> 
> We probably need a capability format similar to PCI and VFIO.

Designing this kind of uAPI where it is half HW and half generic is
really tricky to get right. Probably best to take the detailed design
of the IOCTL structs later.

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-28  2:24 ` Jason Wang
@ 2021-05-28 20:25   ` Jason Gunthorpe
       [not found]   ` <20210531164118.265789ee@yiliu-dev>
  1 sibling, 0 replies; 258+ messages in thread
From: Jason Gunthorpe @ 2021-05-28 20:25 UTC (permalink / raw)
  To: Jason Wang
  Cc: Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu, Hao,
	Jiang, Dave, Jacob Pan, Jean-Philippe Brucker, David Gibson,
	Kirti Wankhede, Robin Murphy

On Fri, May 28, 2021 at 10:24:56AM +0800, Jason Wang wrote:
> > IOASID nesting can be implemented in two ways: hardware nesting and
> > software nesting. With hardware support the child and parent I/O page
> > tables are walked consecutively by the IOMMU to form a nested translation.
> > When it's implemented in software, the ioasid driver
> 
> Need to explain what did "ioasid driver" mean.

I think it means "drivers/iommu"

> And if yes, does it allow the device for software specific implementation:
> 
> 1) swiotlb or

I think it is necessary to have a 'software page table' which is
required to do all the mdevs we have today.

> 2) device specific IOASID implementation

"drivers/iommu" is pluggable, so I guess it can exist? I've never seen
it done before though

If we'd want this to drive an on-device translation table is an
interesting question. I don't have an answer

> > I/O page tables routed through PASID are installed in a per-RID PASID
> > table structure.
> 
> I'm not sure this is true for all archs.

It must be true. For security reasons access to a PASID must be
limited by RID.

RID_A assigned to guest A should not be able to access a PASID being
used by RID_B in guest B. Only a per-RID restriction can accomplish
this.

> I would like to know the reason for such indirection.
> 
> It looks to me the ioasid fd is sufficient for performing any operations.
> 
> Such allocation only work if as ioas fd can have multiple ioasid which seems
> not the case you describe here.

It is the case, read the examples section. One had 3 interrelated
IOASID objects inside the same FD.
 
> > 5.3. IOASID nesting (software)
> > +++++++++++++++++++++++++
> > 
> > Same usage scenario as 5.2, with software-based IOASID nesting
> > available. In this mode it is the kernel instead of user to create the
> > shadow mapping.
> > 
> > The flow before guest boots is same as 5.2, except one point. Because
> > giova_ioasid is nested on gpa_ioasid, locked accounting is only
> > conducted for gpa_ioasid. So it's not necessary to pre-register virtual
> > memory.
> > 
> > To save space we only list the steps after boots (i.e. both dev1/dev2
> > have been attached to gpa_ioasid before guest boots):
> > 
> > 	/* After boots */
> > 	/* Make GIOVA space nested on GPA space */
> > 	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > 				gpa_ioasid);
> > 
> > 	/* Attach dev2 to the new address space (child)
> > 	  * Note dev2 is still attached to gpa_ioasid (parent)
> > 	  */
> > 	at_data = { .ioasid = giova_ioasid};
> > 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> 
> 
> For vDPA, we need something similar. And in the future, vDPA may allow
> multiple ioasid to be attached to a single device. It should work with the
> current design.

What do you imagine multiple IOASID's being used for in VDPA?

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-27  7:58 [RFC] /dev/ioasid uAPI proposal Tian, Kevin
                   ` (4 preceding siblings ...)
  2021-05-28 20:03 ` Jason Gunthorpe
@ 2021-05-28 23:36 ` Jason Gunthorpe
  2021-05-31 11:31   ` Liu Yi L
                     ` (3 more replies)
  2021-05-31 17:37 ` Parav Pandit
                   ` (4 subsequent siblings)
  10 siblings, 4 replies; 258+ messages in thread
From: Jason Gunthorpe @ 2021-05-28 23:36 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:

> 2.1. /dev/ioasid uAPI
> +++++++++++++++++
> 
> /*
>   * Check whether an uAPI extension is supported. 
>   *
>   * This is for FD-level capabilities, such as locked page pre-registration. 
>   * IOASID-level capabilities are reported through IOASID_GET_INFO.
>   *
>   * Return: 0 if not supported, 1 if supported.
>   */
> #define IOASID_CHECK_EXTENSION	_IO(IOASID_TYPE, IOASID_BASE + 0)

 
> /*
>   * Register user space memory where DMA is allowed.
>   *
>   * It pins user pages and does the locked memory accounting so sub-
>   * sequent IOASID_MAP/UNMAP_DMA calls get faster.
>   *
>   * When this ioctl is not used, one user page might be accounted
>   * multiple times when it is mapped by multiple IOASIDs which are
>   * not nested together.
>   *
>   * Input parameters:
>   *	- vaddr;
>   *	- size;
>   *
>   * Return: 0 on success, -errno on failure.
>   */
> #define IOASID_REGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 1)
> #define IOASID_UNREGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 2)

So VA ranges are pinned and stored in a tree and later references to
those VA ranges by any other IOASID use the pin cached in the tree?

It seems reasonable and is similar to the ioasid parent/child I
suggested for PPC.

IMHO this should be merged with the all SW IOASID that is required for
today's mdev drivers. If this can be done while keeping this uAPI then
great, otherwise I don't think it is so bad to weakly nest a physical
IOASID under a SW one just to optimize page pinning.

Either way this seems like a smart direction

> /*
>   * Allocate an IOASID. 
>   *
>   * IOASID is the FD-local software handle representing an I/O address 
>   * space. Each IOASID is associated with a single I/O page table. User 
>   * must call this ioctl to get an IOASID for every I/O address space that is
>   * intended to be enabled in the IOMMU.
>   *
>   * A newly-created IOASID doesn't accept any command before it is 
>   * attached to a device. Once attached, an empty I/O page table is 
>   * bound with the IOMMU then the user could use either DMA mapping 
>   * or pgtable binding commands to manage this I/O page table.

Can the IOASID can be populated before being attached?

>   * Device attachment is initiated through device driver uAPI (e.g. VFIO)
>   *
>   * Return: allocated ioasid on success, -errno on failure.
>   */
> #define IOASID_ALLOC	_IO(IOASID_TYPE, IOASID_BASE + 3)
> #define IOASID_FREE	_IO(IOASID_TYPE, IOASID_BASE + 4)

I assume alloc will include quite a big structure to satisfy the
various vendor needs?

> /*
>   * Get information about an I/O address space
>   *
>   * Supported capabilities:
>   *	- VFIO type1 map/unmap;
>   *	- pgtable/pasid_table binding
>   *	- hardware nesting vs. software nesting;
>   *	- ...
>   *
>   * Related attributes:
>   * 	- supported page sizes, reserved IOVA ranges (DMA mapping);
>   *	- vendor pgtable formats (pgtable binding);
>   *	- number of child IOASIDs (nesting);
>   *	- ...
>   *
>   * Above information is available only after one or more devices are
>   * attached to the specified IOASID. Otherwise the IOASID is just a
>   * number w/o any capability or attribute.

This feels wrong to learn most of these attributes of the IOASID after
attaching to a device.

The user should have some idea how it intends to use the IOASID when
it creates it and the rest of the system should match the intention.

For instance if the user is creating a IOASID to cover the guest GPA
with the intention of making children it should indicate this during
alloc.

If the user is intending to point a child IOASID to a guest page table
in a certain descriptor format then it should indicate it during
alloc.

device bind should fail if the device somehow isn't compatible with
the scheme the user is tring to use.

> /*
>   * Map/unmap process virtual addresses to I/O virtual addresses.
>   *
>   * Provide VFIO type1 equivalent semantics. Start with the same 
>   * restriction e.g. the unmap size should match those used in the 
>   * original mapping call. 
>   *
>   * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
>   * must be already in the preregistered list.
>   *
>   * Input parameters:
>   *	- u32 ioasid;
>   *	- refer to vfio_iommu_type1_dma_{un}map
>   *
>   * Return: 0 on success, -errno on failure.
>   */
> #define IOASID_MAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 6)
> #define IOASID_UNMAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 7)

What about nested IOASIDs?

> /*
>   * Create a nesting IOASID (child) on an existing IOASID (parent)
>   *
>   * IOASIDs can be nested together, implying that the output address 
>   * from one I/O page table (child) must be further translated by 
>   * another I/O page table (parent).
>   *
>   * As the child adds essentially another reference to the I/O page table 
>   * represented by the parent, any device attached to the child ioasid 
>   * must be already attached to the parent.
>   *
>   * In concept there is no limit on the number of the nesting levels. 
>   * However for the majority case one nesting level is sufficient. The
>   * user should check whether an IOASID supports nesting through 
>   * IOASID_GET_INFO. For example, if only one nesting level is allowed,
>   * the nesting capability is reported only on the parent instead of the
>   * child.
>   *
>   * User also needs check (via IOASID_GET_INFO) whether the nesting 
>   * is implemented in hardware or software. If software-based, DMA 
>   * mapping protocol should be used on the child IOASID. Otherwise, 
>   * the child should be operated with pgtable binding protocol.
>   *
>   * Input parameters:
>   *	- u32 parent_ioasid;
>   *
>   * Return: child_ioasid on success, -errno on failure;
>   */
> #define IOASID_CREATE_NESTING	_IO(IOASID_TYPE, IOASID_BASE + 8)

Do you think another ioctl is best? Should this just be another
parameter to alloc?

> /*
>   * Bind an user-managed I/O page table with the IOMMU
>   *
>   * Because user page table is untrusted, IOASID nesting must be enabled 
>   * for this ioasid so the kernel can enforce its DMA isolation policy 
>   * through the parent ioasid.
>   *
>   * Pgtable binding protocol is different from DMA mapping. The latter 
>   * has the I/O page table constructed by the kernel and updated 
>   * according to user MAP/UNMAP commands. With pgtable binding the 
>   * whole page table is created and updated by userspace, thus different 
>   * set of commands are required (bind, iotlb invalidation, page fault, etc.).
>   *
>   * Because the page table is directly walked by the IOMMU, the user 
>   * must  use a format compatible to the underlying hardware. It can 
>   * check the format information through IOASID_GET_INFO.
>   *
>   * The page table is bound to the IOMMU according to the routing 
>   * information of each attached device under the specified IOASID. The
>   * routing information (RID and optional PASID) is registered when a 
>   * device is attached to this IOASID through VFIO uAPI. 
>   *
>   * Input parameters:
>   *	- child_ioasid;
>   *	- address of the user page table;
>   *	- formats (vendor, address_width, etc.);
>   * 
>   * Return: 0 on success, -errno on failure.
>   */
> #define IOASID_BIND_PGTABLE		_IO(IOASID_TYPE, IOASID_BASE + 9)
> #define IOASID_UNBIND_PGTABLE	_IO(IOASID_TYPE, IOASID_BASE + 10)

Also feels backwards, why wouldn't we specify this, and the required
page table format, during alloc time?

> /*
>   * Bind an user-managed PASID table to the IOMMU
>   *
>   * This is required for platforms which place PASID table in the GPA space.
>   * In this case the specified IOASID represents the per-RID PASID space.
>   *
>   * Alternatively this may be replaced by IOASID_BIND_PGTABLE plus a
>   * special flag to indicate the difference from normal I/O address spaces.
>   *
>   * The format info of the PASID table is reported in IOASID_GET_INFO.
>   *
>   * As explained in the design section, user-managed I/O page tables must
>   * be explicitly bound to the kernel even on these platforms. It allows
>   * the kernel to uniformly manage I/O address spaces cross all platforms.
>   * Otherwise, the iotlb invalidation and page faulting uAPI must be hacked
>   * to carry device routing information to indirectly mark the hidden I/O
>   * address spaces.
>   *
>   * Input parameters:
>   *	- child_ioasid;
>   *	- address of PASID table;
>   *	- formats (vendor, size, etc.);
>   *
>   * Return: 0 on success, -errno on failure.
>   */
> #define IOASID_BIND_PASID_TABLE	_IO(IOASID_TYPE, IOASID_BASE + 11)
> #define IOASID_UNBIND_PASID_TABLE	_IO(IOASID_TYPE, IOASID_BASE + 12)

Ditto

> 
> /*
>   * Invalidate IOTLB for an user-managed I/O page table
>   *
>   * Unlike what's defined in include/uapi/linux/iommu.h, this command 
>   * doesn't allow the user to specify cache type and likely support only
>   * two granularities (all, or a specified range) in the I/O address space.
>   *
>   * Physical IOMMU have three cache types (iotlb, dev_iotlb and pasid
>   * cache). If the IOASID represents an I/O address space, the invalidation
>   * always applies to the iotlb (and dev_iotlb if enabled). If the IOASID
>   * represents a vPASID space, then this command applies to the PASID
>   * cache.
>   *
>   * Similarly this command doesn't provide IOMMU-like granularity
>   * info (domain-wide, pasid-wide, range-based), since it's all about the
>   * I/O address space itself. The ioasid driver walks the attached
>   * routing information to match the IOMMU semantics under the
>   * hood. 
>   *
>   * Input parameters:
>   *	- child_ioasid;
>   *	- granularity
>   * 
>   * Return: 0 on success, -errno on failure
>   */
> #define IOASID_INVALIDATE_CACHE	_IO(IOASID_TYPE, IOASID_BASE + 13)

This should have an IOVA range too?

> /*
>   * Page fault report and response
>   *
>   * This is TBD. Can be added after other parts are cleared up. Likely it 
>   * will be a ring buffer shared between user/kernel, an eventfd to notify 
>   * the user and an ioctl to complete the fault.
>   *
>   * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
>   */

Any reason not to just use read()?
  
>
> 2.2. /dev/vfio uAPI
> ++++++++++++++++

To be clear you mean the 'struct vfio_device' API, these are not
IOCTLs on the container or group?

> /*
>    * Bind a vfio_device to the specified IOASID fd
>    *
>    * Multiple vfio devices can be bound to a single ioasid_fd, but a single
>    * vfio device should not be bound to multiple ioasid_fd's.
>    *
>    * Input parameters:
>    *  - ioasid_fd;
>    *
>    * Return: 0 on success, -errno on failure.
>    */
> #define VFIO_BIND_IOASID_FD           _IO(VFIO_TYPE, VFIO_BASE + 22)
> #define VFIO_UNBIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 23)

This is where it would make sense to have an output "device id" that
allows /dev/ioasid to refer to this "device" by number in events and
other related things.

> 
> 2.3. KVM uAPI
> ++++++++++++
> 
> /*
>   * Update CPU PASID mapping
>   *
>   * This is necessary when ENQCMD will be used in the guest while the
>   * targeted device doesn't accept the vPASID saved in the CPU MSR.
>   *
>   * This command allows user to set/clear the vPASID->pPASID mapping
>   * in the CPU, by providing the IOASID (and FD) information representing
>   * the I/O address space marked by this vPASID.
>   *
>   * Input parameters:
>   *	- user_pasid;
>   *	- ioasid_fd;
>   *	- ioasid;
>   */
> #define KVM_MAP_PASID	_IO(KVMIO, 0xf0)
> #define KVM_UNMAP_PASID	_IO(KVMIO, 0xf1)

It seems simple enough.. So the physical PASID can only be assigned if
the user has an IOASID that points at it? Thus it is secure?
 
> 3. Sample structures and helper functions
> 
> Three helper functions are provided to support VFIO_BIND_IOASID_FD:
> 
> 	struct ioasid_ctx *ioasid_ctx_fdget(int fd);
> 	int ioasid_register_device(struct ioasid_ctx *ctx, struct ioasid_dev *dev);
> 	int ioasid_unregister_device(struct ioasid_dev *dev);
> 
> An ioasid_ctx is created for each fd:
> 
> 	struct ioasid_ctx {
> 		// a list of allocated IOASID data's
> 		struct list_head		ioasid_list;

Would expect an xarray

> 		// a list of registered devices
> 		struct list_head		dev_list;

xarray of device_id

> 		// a list of pre-registered virtual address ranges
> 		struct list_head		prereg_list;

Should re-use the existing SW IOASID table, and be an interval tree.

> Each registered device is represented by ioasid_dev:
> 
> 	struct ioasid_dev {
> 		struct list_head		next;
> 		struct ioasid_ctx	*ctx;
> 		// always be the physical device
> 		struct device 		*device;
> 		struct kref		kref;
> 	};
> 
> Because we assume one vfio_device connected to at most one ioasid_fd, 
> here ioasid_dev could be embedded in vfio_device and then linked to 
> ioasid_ctx->dev_list when registration succeeds. For mdev the struct
> device should be the pointer to the parent device. PASID marking this
> mdev is specified later when VFIO_ATTACH_IOASID.

Don't embed a struct like this in something with vfio_device - that
just makes a mess of reference counting by having multiple krefs in
the same memory block. Keep it as a pointer, the attach operation
should return a pointer to the above struct.

> An ioasid_data is created when IOASID_ALLOC, as the main object 
> describing characteristics about an I/O page table:
> 
> 	struct ioasid_data {
> 		// link to ioasid_ctx->ioasid_list
> 		struct list_head		next;
> 
> 		// the IOASID number
> 		u32			ioasid;
> 
> 		// the handle to convey iommu operations
> 		// hold the pgd (TBD until discussing iommu api)
> 		struct iommu_domain *domain;

But at least for the first coding draft I would expect to see this API
presented with no PASID support and a simple 1:1 with iommu_domain. How
PASID gets modeled is the big TBD, right?

> ioasid_data and iommu_domain have overlapping roles as both are 
> introduced to represent an I/O address space. It is still a big TBD how 
> the two should be corelated or even merged, and whether new iommu 
> ops are required to handle RID+PASID explicitly.

I think it is OK that the uapi and kernel api have different
structs. The uapi focused one should hold the uapi related data, which
is what you've shown here, I think.

> Two helper functions are provided to support VFIO_ATTACH_IOASID:
> 
> 	struct attach_info {
> 		u32	ioasid;
> 		// If valid, the PASID to be used physically
> 		u32	pasid;
> 	};
> 	int ioasid_device_attach(struct ioasid_dev *dev, 
> 		struct attach_info info);
> 	int ioasid_device_detach(struct ioasid_dev *dev, u32 ioasid);

Honestly, I still prefer this to be highly explicit as this is where
all device driver authors get invovled:

ioasid_pci_device_attach(struct pci_device *pdev, struct ioasid_dev *dev, u32 ioasid);
ioasid_pci_device_pasid_attach(struct pci_device *pdev, u32 *physical_pasid, struct ioasid_dev *dev, u32 ioasid);

And presumably a variant for ARM non-PCI platform (?) devices.

This could boil down to a __ioasid_device_attach() as you've shown.

> A new object is introduced and linked to ioasid_data->attach_data for 
> each successful attach operation:
> 
> 	struct ioasid_attach_data {
> 		struct list_head		next;
> 		struct ioasid_dev	*dev;
> 		u32 			pasid;
> 	}

This should be returned as a pointer and detatch should be:

int ioasid_device_detach(struct ioasid_attach_data *);
 
> As explained in the design section, there is no explicit group enforcement
> in /dev/ioasid uAPI or helper functions. But the ioasid driver does
> implicit group check - before every device within an iommu group is 
> attached to this IOASID, the previously-attached devices in this group are
> put in ioasid_data->partial_devices. The IOASID rejects any command if
> the partial_devices list is not empty.

It is simple enough. Would be good to design in a diagnostic string so
userspace can make sense of the failure. Eg return something like
-EDEADLK and provide an ioctl 'why did EDEADLK happen' ?


> Then is the last helper function:
> 	u32 ioasid_get_global_pasid(struct ioasid_ctx *ctx, 
> 		u32 ioasid, bool alloc);
> 
> ioasid_get_global_pasid is necessary in scenarios where multiple devices 
> want to share a same PASID value on the attached I/O page table (e.g. 
> when ENQCMD is enabled, as explained in next section). We need a 
> centralized place (ioasid_data->pasid) to hold this value (allocated when
> first called with alloc=true). vfio device driver calls this function (alloc=
> true) to get the global PASID for an ioasid before calling ioasid_device_
> attach. KVM also calls this function (alloc=false) to setup PASID translation 
> structure when user calls KVM_MAP_PASID.

When/why would the VFIO driver do this? isn't this just some varient
of pasid_attach?

ioasid_pci_device_enqcmd_attach(struct pci_device *pdev, u32 *physical_pasid, struct ioasid_dev *dev, u32 ioasid);

?

> 4. PASID Virtualization
> 
> When guest SVA (vSVA) is enabled, multiple GVA address spaces are 
> created on the assigned vfio device. This leads to the concepts of 
> "virtual PASID" (vPASID) vs. "physical PASID" (pPASID). vPASID is assigned 
> by the guest to mark an GVA address space while pPASID is the one 
> selected by the host and actually routed in the wire.
> 
> vPASID is conveyed to the kernel when user calls VFIO_ATTACH_IOASID.

Should the vPASID programmed into the IOASID before calling
VFIO_ATTACH_IOASID?

> vfio device driver translates vPASID to pPASID before calling ioasid_attach_
> device, with two factors to be considered:
> 
> -    Whether vPASID is directly used (vPASID==pPASID) in the wire, or 
>      should be instead converted to a newly-allocated one (vPASID!=
>      pPASID);
> 
> -    If vPASID!=pPASID, whether pPASID is allocated from per-RID PASID
>      space or a global PASID space (implying sharing pPASID cross devices,
>      e.g. when supporting Intel ENQCMD which puts PASID in a CPU MSR
>      as part of the process context);

This whole section is 4 really confusing. I think it would be more
understandable to focus on the list below and minimize the vPASID

> The actual policy depends on pdev vs. mdev, and whether ENQCMD is
> supported. There are three possible scenarios:
> 
> (Note: /dev/ioasid uAPI is not affected by underlying PASID virtualization 
> policies.)

This has become unclear. I think this should start by identifying the
6 main type of devices and how they can use pPASID/vPASID:

0) Device is a RID and cannot issue PASID
1) Device is a mdev and cannot issue PASID
2) Device is a mdev and programs a single fixed PASID during bind,
   does not accept PASID from the guest

3) Device accepts any PASIDs from the guest. No
   vPASID/pPASID translation is possible. (classic vfio_pci)
4) Device accepts any PASID from the guest and has an
   internal vPASID/pPASID translation (enhanced vfio_pci)
5) Device accepts and PASID from the guest and relys on
   external vPASID/pPASID translation via ENQCMD (Intel SIOV mdev)

0-2 don't use vPASID at all

3-5 consume a vPASID but handle it differently.

I think the 3-5 map into what you are trying to explain in the table
below, which is the rules for allocating the vPASID depending on which
of device types 3-5 are present and or mixed.

For instance device type 3 requires vPASID == pPASID because it can't
do translation at all.

This probably all needs to come through clearly in the /dev/ioasid
interface. Once the attached devices are labled it would make sense to
have a 'query device' /dev/ioasid IOCTL to report the details based on
how the device attached and other information.

> 2)  mdev: vPASID!=pPASID (per-RID if w/o ENQCMD, otherwise global)
> 
>      PASIDs are also used by kernel to mark the default I/O address space 
>      for mdev, thus cannot be delegated to the guest. Instead, the mdev 
>      driver must allocate a new pPASID for each vPASID (thus vPASID!=
>      pPASID) and then use pPASID when attaching this mdev to an ioasid.

I don't understand this at all.. What does "PASIDs are also used by
the kernel" mean?

>      The mdev driver needs cache the PASID mapping so in mediation 
>      path vPASID programmed by the guest can be converted to pPASID 
>      before updating the physical MMIO register.

This is my scenario #4 above. Device and internally virtualize
vPASID/pPASID - how that is done is up to the device. But this is all
just labels, when such a device attaches, it should use some specific
API:

ioasid_pci_device_vpasid_attach(struct pci_device *pdev,
 u32 *physical_pasid, u32 *virtual_pasid, struct ioasid_dev *dev, u32 ioasid);

And then maintain its internal translation

>      In previous thread a PASID range split scheme was discussed to support
>      this combination, but we haven't worked out a clean uAPI design yet.
>      Therefore in this proposal we decide to not support it, implying the 
>      user should have some intelligence to avoid such scenario. It could be
>      a TODO task for future.

It really just boils down to how to allocate the PASIDs to get around
the bad viommu interface that assumes all PASIDs are usable by all
devices.
 
> In spite of those subtle considerations, the kernel implementation could
> start simple, e.g.:
> 
> -    v==p for pdev;
> -    v!=p and always use a global PASID pool for all mdev's;

Regardless all this mess needs to be hidden from the consuming drivers
with some simple APIs as above. The driver should indicate what its HW
can do and the PASID #'s that magically come out of /dev/ioasid should
be appropriate.

Will resume on another email..

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-28 23:36 ` Jason Gunthorpe
@ 2021-05-31 11:31   ` Liu Yi L
  2021-05-31 18:09     ` Jason Gunthorpe
  2021-06-01  1:25     ` Lu Baolu
  2021-06-01 11:09   ` Lu Baolu
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 258+ messages in thread
From: Liu Yi L @ 2021-05-31 11:31 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: yi.l.liu, Tian, Kevin, Jean-Philippe Brucker,
	Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, David Gibson, Jiang, Dave, Wu, Hao,
	David Woodhouse, Jason Wang

On Fri, 28 May 2021 20:36:49 -0300, Jason Gunthorpe wrote:

> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> 
> > 2.1. /dev/ioasid uAPI
> > +++++++++++++++++
> > 
> > /*
> >   * Check whether an uAPI extension is supported. 
> >   *
> >   * This is for FD-level capabilities, such as locked page pre-registration. 
> >   * IOASID-level capabilities are reported through IOASID_GET_INFO.
> >   *
> >   * Return: 0 if not supported, 1 if supported.
> >   */
> > #define IOASID_CHECK_EXTENSION	_IO(IOASID_TYPE, IOASID_BASE + 0)  
> 
>  
> > /*
> >   * Register user space memory where DMA is allowed.
> >   *
> >   * It pins user pages and does the locked memory accounting so sub-
> >   * sequent IOASID_MAP/UNMAP_DMA calls get faster.
> >   *
> >   * When this ioctl is not used, one user page might be accounted
> >   * multiple times when it is mapped by multiple IOASIDs which are
> >   * not nested together.
> >   *
> >   * Input parameters:
> >   *	- vaddr;
> >   *	- size;
> >   *
> >   * Return: 0 on success, -errno on failure.
> >   */
> > #define IOASID_REGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 1)
> > #define IOASID_UNREGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 2)  
> 
> So VA ranges are pinned and stored in a tree and later references to
> those VA ranges by any other IOASID use the pin cached in the tree?
> 
> It seems reasonable and is similar to the ioasid parent/child I
> suggested for PPC.
> 
> IMHO this should be merged with the all SW IOASID that is required for
> today's mdev drivers. If this can be done while keeping this uAPI then
> great, otherwise I don't think it is so bad to weakly nest a physical
> IOASID under a SW one just to optimize page pinning.
> 
> Either way this seems like a smart direction
> 
> > /*
> >   * Allocate an IOASID. 
> >   *
> >   * IOASID is the FD-local software handle representing an I/O address 
> >   * space. Each IOASID is associated with a single I/O page table. User 
> >   * must call this ioctl to get an IOASID for every I/O address space that is
> >   * intended to be enabled in the IOMMU.
> >   *
> >   * A newly-created IOASID doesn't accept any command before it is 
> >   * attached to a device. Once attached, an empty I/O page table is 
> >   * bound with the IOMMU then the user could use either DMA mapping 
> >   * or pgtable binding commands to manage this I/O page table.  
> 
> Can the IOASID can be populated before being attached?

perhaps a MAP/UNMAP operation on a gpa_ioasid?

> 
> >   * Device attachment is initiated through device driver uAPI (e.g. VFIO)
> >   *
> >   * Return: allocated ioasid on success, -errno on failure.
> >   */
> > #define IOASID_ALLOC	_IO(IOASID_TYPE, IOASID_BASE + 3)
> > #define IOASID_FREE	_IO(IOASID_TYPE, IOASID_BASE + 4)  
> 
> I assume alloc will include quite a big structure to satisfy the
> various vendor needs?
>
> 
> > /*
> >   * Get information about an I/O address space
> >   *
> >   * Supported capabilities:
> >   *	- VFIO type1 map/unmap;
> >   *	- pgtable/pasid_table binding
> >   *	- hardware nesting vs. software nesting;
> >   *	- ...
> >   *
> >   * Related attributes:
> >   * 	- supported page sizes, reserved IOVA ranges (DMA mapping);
> >   *	- vendor pgtable formats (pgtable binding);
> >   *	- number of child IOASIDs (nesting);
> >   *	- ...
> >   *
> >   * Above information is available only after one or more devices are
> >   * attached to the specified IOASID. Otherwise the IOASID is just a
> >   * number w/o any capability or attribute.  
> 
> This feels wrong to learn most of these attributes of the IOASID after
> attaching to a device.

but an IOASID is just a software handle before attached to a specific
device. e.g. before attaching to a device, we have no idea about the
supported page size in underlying iommu, coherent etc.

> The user should have some idea how it intends to use the IOASID when
> it creates it and the rest of the system should match the intention.
> 
> For instance if the user is creating a IOASID to cover the guest GPA
> with the intention of making children it should indicate this during
> alloc.
> 
> If the user is intending to point a child IOASID to a guest page table
> in a certain descriptor format then it should indicate it during
> alloc.

Actually, we have only two kinds of IOASIDs so far. One is used as parent
and another is child. For child, this proposal has defined IOASID_CREATE_NESTING
for it. But yeah, I think it is doable to indicate the type in ALLOC. But
for child IOASID, there require one more step to config its parent IOASID
or may include such info in the ioctl input as well.
 
> device bind should fail if the device somehow isn't compatible with
> the scheme the user is tring to use.

yeah, I guess you mean to fail the device attach when the IOASID is a
nesting IOASID but the device is behind an iommu without nesting support.
right?

> 
> > /*
> >   * Map/unmap process virtual addresses to I/O virtual addresses.
> >   *
> >   * Provide VFIO type1 equivalent semantics. Start with the same 
> >   * restriction e.g. the unmap size should match those used in the 
> >   * original mapping call. 
> >   *
> >   * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
> >   * must be already in the preregistered list.
> >   *
> >   * Input parameters:
> >   *	- u32 ioasid;
> >   *	- refer to vfio_iommu_type1_dma_{un}map
> >   *
> >   * Return: 0 on success, -errno on failure.
> >   */
> > #define IOASID_MAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 6)
> > #define IOASID_UNMAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 7)  
> 
> What about nested IOASIDs?

at first glance, it looks like we should prevent the MAP/UNMAP usage on
nested IOASIDs. At least hardware nested translation only allows MAP/UNMAP
on the parent IOASIDs and page table bind on nested IOASIDs. But considering
about software nesting, it seems still useful to allow MAP/UNMAP usage
on nested IOASIDs. This is how I understand it, how about your opinion
on it? do you think it's better to allow MAP/UNMAP usage only on parent
IOASIDs as a start?

> 
> > /*
> >   * Create a nesting IOASID (child) on an existing IOASID (parent)
> >   *
> >   * IOASIDs can be nested together, implying that the output address 
> >   * from one I/O page table (child) must be further translated by 
> >   * another I/O page table (parent).
> >   *
> >   * As the child adds essentially another reference to the I/O page table 
> >   * represented by the parent, any device attached to the child ioasid 
> >   * must be already attached to the parent.
> >   *
> >   * In concept there is no limit on the number of the nesting levels. 
> >   * However for the majority case one nesting level is sufficient. The
> >   * user should check whether an IOASID supports nesting through 
> >   * IOASID_GET_INFO. For example, if only one nesting level is allowed,
> >   * the nesting capability is reported only on the parent instead of the
> >   * child.
> >   *
> >   * User also needs check (via IOASID_GET_INFO) whether the nesting 
> >   * is implemented in hardware or software. If software-based, DMA 
> >   * mapping protocol should be used on the child IOASID. Otherwise, 
> >   * the child should be operated with pgtable binding protocol.
> >   *
> >   * Input parameters:
> >   *	- u32 parent_ioasid;
> >   *
> >   * Return: child_ioasid on success, -errno on failure;
> >   */
> > #define IOASID_CREATE_NESTING	_IO(IOASID_TYPE, IOASID_BASE + 8)  
> 
> Do you think another ioctl is best? Should this just be another
> parameter to alloc?

either is fine. This ioctl is following one of your previous comment.

https://lore.kernel.org/linux-iommu/20210422121020.GT1370958@nvidia.com/

> 
> > /*
> >   * Bind an user-managed I/O page table with the IOMMU
> >   *
> >   * Because user page table is untrusted, IOASID nesting must be enabled 
> >   * for this ioasid so the kernel can enforce its DMA isolation policy 
> >   * through the parent ioasid.
> >   *
> >   * Pgtable binding protocol is different from DMA mapping. The latter 
> >   * has the I/O page table constructed by the kernel and updated 
> >   * according to user MAP/UNMAP commands. With pgtable binding the 
> >   * whole page table is created and updated by userspace, thus different 
> >   * set of commands are required (bind, iotlb invalidation, page fault, etc.).
> >   *
> >   * Because the page table is directly walked by the IOMMU, the user 
> >   * must  use a format compatible to the underlying hardware. It can 
> >   * check the format information through IOASID_GET_INFO.
> >   *
> >   * The page table is bound to the IOMMU according to the routing 
> >   * information of each attached device under the specified IOASID. The
> >   * routing information (RID and optional PASID) is registered when a 
> >   * device is attached to this IOASID through VFIO uAPI. 
> >   *
> >   * Input parameters:
> >   *	- child_ioasid;
> >   *	- address of the user page table;
> >   *	- formats (vendor, address_width, etc.);
> >   * 
> >   * Return: 0 on success, -errno on failure.
> >   */
> > #define IOASID_BIND_PGTABLE		_IO(IOASID_TYPE, IOASID_BASE + 9)
> > #define IOASID_UNBIND_PGTABLE	_IO(IOASID_TYPE, IOASID_BASE + 10)  
> 
> Also feels backwards, why wouldn't we specify this, and the required
> page table format, during alloc time?

here the model is user-space gets the page table format from kernel and
decide if it can proceed. So what you are suggesting is user-space should
tell kernel the page table format it has in ALLOC and kenrel should fail
the ALLOC if the user-space page table format is not compatible with underlying
iommu?

> 
> > /*
> >   * Bind an user-managed PASID table to the IOMMU
> >   *
> >   * This is required for platforms which place PASID table in the GPA space.
> >   * In this case the specified IOASID represents the per-RID PASID space.
> >   *
> >   * Alternatively this may be replaced by IOASID_BIND_PGTABLE plus a
> >   * special flag to indicate the difference from normal I/O address spaces.
> >   *
> >   * The format info of the PASID table is reported in IOASID_GET_INFO.
> >   *
> >   * As explained in the design section, user-managed I/O page tables must
> >   * be explicitly bound to the kernel even on these platforms. It allows
> >   * the kernel to uniformly manage I/O address spaces cross all platforms.
> >   * Otherwise, the iotlb invalidation and page faulting uAPI must be hacked
> >   * to carry device routing information to indirectly mark the hidden I/O
> >   * address spaces.
> >   *
> >   * Input parameters:
> >   *	- child_ioasid;
> >   *	- address of PASID table;
> >   *	- formats (vendor, size, etc.);
> >   *
> >   * Return: 0 on success, -errno on failure.
> >   */
> > #define IOASID_BIND_PASID_TABLE	_IO(IOASID_TYPE, IOASID_BASE + 11)
> > #define IOASID_UNBIND_PASID_TABLE	_IO(IOASID_TYPE, IOASID_BASE + 12)  
> 
> Ditto
> 
> > 
> > /*
> >   * Invalidate IOTLB for an user-managed I/O page table
> >   *
> >   * Unlike what's defined in include/uapi/linux/iommu.h, this command 
> >   * doesn't allow the user to specify cache type and likely support only
> >   * two granularities (all, or a specified range) in the I/O address space.
> >   *
> >   * Physical IOMMU have three cache types (iotlb, dev_iotlb and pasid
> >   * cache). If the IOASID represents an I/O address space, the invalidation
> >   * always applies to the iotlb (and dev_iotlb if enabled). If the IOASID
> >   * represents a vPASID space, then this command applies to the PASID
> >   * cache.
> >   *
> >   * Similarly this command doesn't provide IOMMU-like granularity
> >   * info (domain-wide, pasid-wide, range-based), since it's all about the
> >   * I/O address space itself. The ioasid driver walks the attached
> >   * routing information to match the IOMMU semantics under the
> >   * hood. 
> >   *
> >   * Input parameters:
> >   *	- child_ioasid;
> >   *	- granularity
> >   * 
> >   * Return: 0 on success, -errno on failure
> >   */
> > #define IOASID_INVALIDATE_CACHE	_IO(IOASID_TYPE, IOASID_BASE + 13)  
> 
> This should have an IOVA range too?
> 
> > /*
> >   * Page fault report and response
> >   *
> >   * This is TBD. Can be added after other parts are cleared up. Likely it 
> >   * will be a ring buffer shared between user/kernel, an eventfd to notify 
> >   * the user and an ioctl to complete the fault.
> >   *
> >   * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
> >   */  
> 
> Any reason not to just use read()?

a ring buffer may be mmap to user-space, thus reading fault data from kernel
would be faster. This is also how Eric's fault reporting is doing today.

https://lore.kernel.org/linux-iommu/20210411114659.15051-5-eric.auger@redhat.com/

> >
> > 2.2. /dev/vfio uAPI
> > ++++++++++++++++  
> 
> To be clear you mean the 'struct vfio_device' API, these are not
> IOCTLs on the container or group?
> 
> > /*
> >    * Bind a vfio_device to the specified IOASID fd
> >    *
> >    * Multiple vfio devices can be bound to a single ioasid_fd, but a single
> >    * vfio device should not be bound to multiple ioasid_fd's.
> >    *
> >    * Input parameters:
> >    *  - ioasid_fd;
> >    *
> >    * Return: 0 on success, -errno on failure.
> >    */
> > #define VFIO_BIND_IOASID_FD           _IO(VFIO_TYPE, VFIO_BASE + 22)
> > #define VFIO_UNBIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 23)  
> 
> This is where it would make sense to have an output "device id" that
> allows /dev/ioasid to refer to this "device" by number in events and
> other related things.

perhaps this is the device info Jean Philippe wants in page fault reporting
path?

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 258+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-05-27  7:58 [RFC] /dev/ioasid uAPI proposal Tian, Kevin
                   ` (5 preceding siblings ...)
  2021-05-28 23:36 ` Jason Gunthorpe
@ 2021-05-31 17:37 ` Parav Pandit
  2021-05-31 18:12   ` Jason Gunthorpe
  2021-06-02  8:38   ` Enrico Weigelt, metux IT consult
  2021-06-01  4:31 ` Shenming Lu
                   ` (3 subsequent siblings)
  10 siblings, 2 replies; 258+ messages in thread
From: Parav Pandit @ 2021-05-31 17:37 UTC (permalink / raw)
  To: Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe, Lu Baolu,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang
  Cc: Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu, Hao,
	Jiang, Dave, Jacob Pan, Jean-Philippe Brucker, David Gibson,
	Kirti Wankhede, Robin Murphy



> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Thursday, May 27, 2021 1:28 PM
> /dev/ioasid provides an unified interface for managing I/O page tables for
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
> etc.) are expected to use this interface instead of creating their own logic to
> isolate untrusted device DMAs initiated by userspace.
> 
> This proposal describes the uAPI of /dev/ioasid and also sample sequences
> with VFIO as example in typical usages. The driver-facing kernel API provided
> by the iommu layer is still TBD, which can be discussed after consensus is
> made on this uAPI.
> 
> It's based on a lengthy discussion starting from here:
> 	https://lore.kernel.org/linux-
> iommu/20210330132830.GO2356281@nvidia.com/
> 
> It ends up to be a long writing due to many things to be summarized and
> non-trivial effort required to connect them into a complete proposal.
> Hope it provides a clean base to converge.

Thanks for the detailed RFC. Digesting it...

[..]
> 2.1. /dev/ioasid uAPI
> +++++++++++++++++
> /*
>   * Register user space memory where DMA is allowed.
>   *
>   * It pins user pages and does the locked memory accounting so sub-
>   * sequent IOASID_MAP/UNMAP_DMA calls get faster.
>   *
>   * When this ioctl is not used, one user page might be accounted
>   * multiple times when it is mapped by multiple IOASIDs which are
>   * not nested together.
>   *
>   * Input parameters:
>   *	- vaddr;
>   *	- size;
>   *
>   * Return: 0 on success, -errno on failure.
>   */	
It appears that this is only to make map ioctl faster apart from accounting.
It doesn't have any ioasid handle input either.

In that case, can it be a new system call? Why does it have to be under /dev/ioasid?
For example few years back such system call mpin() thought was proposed in [1].

Or a new MAP_PINNED flag is better approach to achieve in single mmap() call?

> #define IOASID_REGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 1)
> #define IOASID_UNREGISTER_MEMORY	_IO(IOASID_TYPE,
> IOASID_BASE + 2)

[1] https://lwn.net/Articles/600502/

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-31 11:31   ` Liu Yi L
@ 2021-05-31 18:09     ` Jason Gunthorpe
  2021-06-01  3:08       ` Lu Baolu
  2021-06-01  1:25     ` Lu Baolu
  1 sibling, 1 reply; 258+ messages in thread
From: Jason Gunthorpe @ 2021-05-31 18:09 UTC (permalink / raw)
  To: Liu Yi L
  Cc: yi.l.liu, Tian, Kevin, Jean-Philippe Brucker,
	Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, David Gibson, Jiang, Dave, Wu, Hao,
	David Woodhouse, Jason Wang

On Mon, May 31, 2021 at 07:31:57PM +0800, Liu Yi L wrote:
> > > /*
> > >   * Get information about an I/O address space
> > >   *
> > >   * Supported capabilities:
> > >   *	- VFIO type1 map/unmap;
> > >   *	- pgtable/pasid_table binding
> > >   *	- hardware nesting vs. software nesting;
> > >   *	- ...
> > >   *
> > >   * Related attributes:
> > >   * 	- supported page sizes, reserved IOVA ranges (DMA mapping);
> > >   *	- vendor pgtable formats (pgtable binding);
> > >   *	- number of child IOASIDs (nesting);
> > >   *	- ...
> > >   *
> > >   * Above information is available only after one or more devices are
> > >   * attached to the specified IOASID. Otherwise the IOASID is just a
> > >   * number w/o any capability or attribute.  
> > 
> > This feels wrong to learn most of these attributes of the IOASID after
> > attaching to a device.
> 
> but an IOASID is just a software handle before attached to a specific
> device. e.g. before attaching to a device, we have no idea about the
> supported page size in underlying iommu, coherent etc.

The idea is you attach the device to the /dev/ioasid FD and this
action is what crystalizes the iommu driver that is being used:

        device_fd = open("/dev/vfio/devices/dev1", mode);
        ioasid_fd = open("/dev/ioasid", mode);
        ioctl(device_fd, VFIO_BIND_IOASID_FD, ioasid_fd);

After this sequence we should have most of the information about the
IOMMU.

One /dev/ioasid FD has one iommu driver. Design what an "iommu driver"
means so that the system should only have one. Eg the coherent/not
coherent distinction should not be a different "iommu driver".

Device attach to the _IOASID_ is a different thing, and I think it
puts the whole sequence out of order because we loose the option to
customize the IOASID before it has to be realized into HW format.

> > The user should have some idea how it intends to use the IOASID when
> > it creates it and the rest of the system should match the intention.
> > 
> > For instance if the user is creating a IOASID to cover the guest GPA
> > with the intention of making children it should indicate this during
> > alloc.
> > 
> > If the user is intending to point a child IOASID to a guest page table
> > in a certain descriptor format then it should indicate it during
> > alloc.
> 
> Actually, we have only two kinds of IOASIDs so far. 

Maybe at a very very high level, but it looks like there is alot of
IOMMU specific configuration that goes into an IOASD.


> > device bind should fail if the device somehow isn't compatible with
> > the scheme the user is tring to use.
> 
> yeah, I guess you mean to fail the device attach when the IOASID is a
> nesting IOASID but the device is behind an iommu without nesting support.
> right?

Right..
 
> > 
> > > /*
> > >   * Map/unmap process virtual addresses to I/O virtual addresses.
> > >   *
> > >   * Provide VFIO type1 equivalent semantics. Start with the same 
> > >   * restriction e.g. the unmap size should match those used in the 
> > >   * original mapping call. 
> > >   *
> > >   * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
> > >   * must be already in the preregistered list.
> > >   *
> > >   * Input parameters:
> > >   *	- u32 ioasid;
> > >   *	- refer to vfio_iommu_type1_dma_{un}map
> > >   *
> > >   * Return: 0 on success, -errno on failure.
> > >   */
> > > #define IOASID_MAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 6)
> > > #define IOASID_UNMAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 7)  
> > 
> > What about nested IOASIDs?
> 
> at first glance, it looks like we should prevent the MAP/UNMAP usage on
> nested IOASIDs. At least hardware nested translation only allows MAP/UNMAP
> on the parent IOASIDs and page table bind on nested IOASIDs. But considering
> about software nesting, it seems still useful to allow MAP/UNMAP usage
> on nested IOASIDs. This is how I understand it, how about your opinion
> on it? do you think it's better to allow MAP/UNMAP usage only on parent
> IOASIDs as a start?

If the only form of nested IOASID is the "read the page table from
my process memory" then MAP/UNMAP won't make sense on that..

MAP/UNMAP is only useful if the page table is stored in kernel memory.

> > > #define IOASID_CREATE_NESTING	_IO(IOASID_TYPE, IOASID_BASE + 8)  
> > 
> > Do you think another ioctl is best? Should this just be another
> > parameter to alloc?
> 
> either is fine. This ioctl is following one of your previous comment.

Sometimes I say things in a way that is ment to be easier to
understand conecpts not necessarily good API design :)

> > > #define IOASID_BIND_PGTABLE		_IO(IOASID_TYPE, IOASID_BASE + 9)
> > > #define IOASID_UNBIND_PGTABLE	_IO(IOASID_TYPE, IOASID_BASE + 10)  
> > 
> > Also feels backwards, why wouldn't we specify this, and the required
> > page table format, during alloc time?
> 
> here the model is user-space gets the page table format from kernel and
> decide if it can proceed. So what you are suggesting is user-space should
> tell kernel the page table format it has in ALLOC and kenrel should fail
> the ALLOC if the user-space page table format is not compatible with underlying
> iommu?

Yes, the action should be
   Alloc an IOASID that points at a page table in this user memory,
   that is stored in this specific format.

The supported formats should be discoverable after VFIO_BIND_IOASID_FD

> > > /*
> > >   * Page fault report and response
> > >   *
> > >   * This is TBD. Can be added after other parts are cleared up. Likely it 
> > >   * will be a ring buffer shared between user/kernel, an eventfd to notify 
> > >   * the user and an ioctl to complete the fault.
> > >   *
> > >   * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
> > >   */  
> > 
> > Any reason not to just use read()?
> 
> a ring buffer may be mmap to user-space, thus reading fault data from kernel
> would be faster. This is also how Eric's fault reporting is doing today.

Okay, if it is performance sensitive.. mmap rings are just tricky beasts

> > >    * Bind a vfio_device to the specified IOASID fd
> > >    *
> > >    * Multiple vfio devices can be bound to a single ioasid_fd, but a single
> > >    * vfio device should not be bound to multiple ioasid_fd's.
> > >    *
> > >    * Input parameters:
> > >    *  - ioasid_fd;
> > >    *
> > >    * Return: 0 on success, -errno on failure.
> > >    */
> > > #define VFIO_BIND_IOASID_FD           _IO(VFIO_TYPE, VFIO_BASE + 22)
> > > #define VFIO_UNBIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 23)  
> > 
> > This is where it would make sense to have an output "device id" that
> > allows /dev/ioasid to refer to this "device" by number in events and
> > other related things.
> 
> perhaps this is the device info Jean Philippe wants in page fault reporting
> path?

Yes, it is

Jason
 

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-31 17:37 ` Parav Pandit
@ 2021-05-31 18:12   ` Jason Gunthorpe
  2021-06-01 12:04     ` Parav Pandit
  2021-06-02  8:38   ` Enrico Weigelt, metux IT consult
  1 sibling, 1 reply; 258+ messages in thread
From: Jason Gunthorpe @ 2021-05-31 18:12 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

On Mon, May 31, 2021 at 05:37:35PM +0000, Parav Pandit wrote:

> In that case, can it be a new system call? Why does it have to be under /dev/ioasid?
> For example few years back such system call mpin() thought was proposed in [1].

Reference counting of the overall pins are required

So when a pinned pages is incorporated into an IOASID page table in a
later IOCTL it means it cannot be unpinned while the IOASID page table
is using it.

This is some trick to organize the pinning into groups and then
refcount each group, thus avoiding needing per-page refcounts.

The data structure would be an interval tree of pins in general

The ioasid itself would have an interval tree of its own mappings,
each entry in this tree would reference count against an element in
the above tree

Then the ioasid's interval tree would be mapped into a page table tree
in HW format.

The redundant storages are needed to keep track of the refencing and
the CPU page table values for later unpinning.

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-31 11:31   ` Liu Yi L
  2021-05-31 18:09     ` Jason Gunthorpe
@ 2021-06-01  1:25     ` Lu Baolu
  1 sibling, 0 replies; 258+ messages in thread
From: Lu Baolu @ 2021-06-01  1:25 UTC (permalink / raw)
  To: Liu Yi L, Jason Gunthorpe
  Cc: baolu.lu, Jean-Philippe Brucker, Tian, Kevin, Jiang, Dave, Raj,
	Ashok, kvm, Jonathan Corbet, David Woodhouse, Jason Wang, LKML,
	Kirti Wankhede, Alex Williamson (alex.williamson@redhat.com),
	iommu, Robin Murphy, David Gibson

On 5/31/21 7:31 PM, Liu Yi L wrote:
> On Fri, 28 May 2021 20:36:49 -0300, Jason Gunthorpe wrote:
> 
>> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
>>
>>> 2.1. /dev/ioasid uAPI
>>> +++++++++++++++++

[---cut for short---]

>>> /*
>>>    * Allocate an IOASID.
>>>    *
>>>    * IOASID is the FD-local software handle representing an I/O address
>>>    * space. Each IOASID is associated with a single I/O page table. User
>>>    * must call this ioctl to get an IOASID for every I/O address space that is
>>>    * intended to be enabled in the IOMMU.
>>>    *
>>>    * A newly-created IOASID doesn't accept any command before it is
>>>    * attached to a device. Once attached, an empty I/O page table is
>>>    * bound with the IOMMU then the user could use either DMA mapping
>>>    * or pgtable binding commands to manage this I/O page table.
>> Can the IOASID can be populated before being attached?
> perhaps a MAP/UNMAP operation on a gpa_ioasid?
> 

But before attaching to any device, there's no connection between an
IOASID and the underlying IOMMU. How do you know the supported page
sizes and cache coherency?

The restriction of iommu_group is implicitly expressed as only after all
devices belonging to an iommu_group are attached, the operations of the
page table can be performed.

Best regards,
baolu

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
       [not found]   ` <20210531164118.265789ee@yiliu-dev>
@ 2021-06-01  2:36     ` Jason Wang
  2021-06-01  4:27       ` Shenming Lu
       [not found]       ` <20210601113152.6d09e47b@yiliu-dev>
  0 siblings, 2 replies; 258+ messages in thread
From: Jason Wang @ 2021-06-01  2:36 UTC (permalink / raw)
  To: Liu Yi L
  Cc: yi.l.liu, Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe,
	Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com)",
	Eric Auger, Jonathan Corbet


在 2021/5/31 下午4:41, Liu Yi L 写道:
>> I guess VFIO_ATTACH_IOASID will fail if the underlayer doesn't support
>> hardware nesting. Or is there way to detect the capability before?
> I think it could fail in the IOASID_CREATE_NESTING. If the gpa_ioasid
> is not able to support nesting, then should fail it.
>
>> I think GET_INFO only works after the ATTACH.
> yes. After attaching to gpa_ioasid, userspace could GET_INFO on the
> gpa_ioasid and check if nesting is supported or not. right?


Some more questions:

1) Is the handle returned by IOASID_ALLOC an fd?
2) If yes, what's the reason for not simply use the fd opened from 
/dev/ioas. (This is the question that is not answered) and what happens 
if we call GET_INFO for the ioasid_fd?
3) If not, how GET_INFO work?


>
>>> 	/* Bind guest I/O page table  */
>>> 	bind_data = {
>>> 		.ioasid	= giova_ioasid;
>>> 		.addr	= giova_pgtable;
>>> 		// and format information
>>> 	};
>>> 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
>>>
>>> 	/* Invalidate IOTLB when required */
>>> 	inv_data = {
>>> 		.ioasid	= giova_ioasid;
>>> 		// granular information
>>> 	};
>>> 	ioctl(ioasid_fd, IOASID_INVALIDATE_CACHE, &inv_data);
>>>
>>> 	/* See 5.6 for I/O page fault handling */
>>> 	
>>> 5.5. Guest SVA (vSVA)
>>> ++++++++++++++++++
>>>
>>> After boots the guest further create a GVA address spaces (gpasid1) on
>>> dev1. Dev2 is not affected (still attached to giova_ioasid).
>>>
>>> As explained in section 4, user should avoid expose ENQCMD on both
>>> pdev and mdev.
>>>
>>> The sequence applies to all device types (being pdev or mdev), except
>>> one additional step to call KVM for ENQCMD-capable mdev:
>> My understanding is ENQCMD is Intel specific and not a requirement for
>> having vSVA.
> ENQCMD is not really Intel specific although only Intel supports it today.
> The PCIe DMWr capability is the capability for software to enumerate the
> ENQCMD support in device side. yes, it is not a requirement for vSVA. They
> are orthogonal.


Right, then it's better to mention DMWr instead of a vendor specific 
instruction in a general framework like ioasid.

Thanks


>


^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-31 18:09     ` Jason Gunthorpe
@ 2021-06-01  3:08       ` Lu Baolu
  2021-06-01 17:24         ` Jason Gunthorpe
  0 siblings, 1 reply; 258+ messages in thread
From: Lu Baolu @ 2021-06-01  3:08 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu Yi L
  Cc: baolu.lu, Jean-Philippe Brucker, Tian, Kevin, Jiang, Dave, Raj,
	Ashok, kvm, Jonathan Corbet, David Woodhouse, Jason Wang, LKML,
	Kirti Wankhede, Alex Williamson (alex.williamson@redhat.com),
	iommu, Robin Murphy, David Gibson

On 6/1/21 2:09 AM, Jason Gunthorpe wrote:
>>> device bind should fail if the device somehow isn't compatible with
>>> the scheme the user is tring to use.
>> yeah, I guess you mean to fail the device attach when the IOASID is a
>> nesting IOASID but the device is behind an iommu without nesting support.
>> right?
> Right..
>   

Just want to confirm...

Does this mean that we only support hardware nesting and don't want to
have soft nesting (shadowed page table in kernel) in IOASID?

Best regards,
baolu

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  2:36     ` Jason Wang
@ 2021-06-01  4:27       ` Shenming Lu
  2021-06-01  5:10         ` Jason Wang
       [not found]       ` <20210601113152.6d09e47b@yiliu-dev>
  1 sibling, 1 reply; 258+ messages in thread
From: Shenming Lu @ 2021-06-01  4:27 UTC (permalink / raw)
  To: Jason Wang, Liu Yi L
  Cc: yi.l.liu, Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe,
	Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com)",
	Eric Auger, Jonathan Corbet, Zenghui Yu, wanghaibin.wang

On 2021/6/1 10:36, Jason Wang wrote:
> 
> 在 2021/5/31 下午4:41, Liu Yi L 写道:
>>> I guess VFIO_ATTACH_IOASID will fail if the underlayer doesn't support
>>> hardware nesting. Or is there way to detect the capability before?
>> I think it could fail in the IOASID_CREATE_NESTING. If the gpa_ioasid
>> is not able to support nesting, then should fail it.
>>
>>> I think GET_INFO only works after the ATTACH.
>> yes. After attaching to gpa_ioasid, userspace could GET_INFO on the
>> gpa_ioasid and check if nesting is supported or not. right?
> 
> 
> Some more questions:
> 
> 1) Is the handle returned by IOASID_ALLOC an fd?
> 2) If yes, what's the reason for not simply use the fd opened from /dev/ioas. (This is the question that is not answered) and what happens if we call GET_INFO for the ioasid_fd?
> 3) If not, how GET_INFO work?

It seems that the return value from IOASID_ALLOC is an IOASID number in the
ioasid_data struct, then when calling GET_INFO, we should convey this IOASID
number to get the associated I/O address space attributes (depend on the
physical IOMMU, which could be discovered when attaching a device to the
IOASID fd or number), right?

Thanks,
Shenming

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-27  7:58 [RFC] /dev/ioasid uAPI proposal Tian, Kevin
                   ` (6 preceding siblings ...)
  2021-05-31 17:37 ` Parav Pandit
@ 2021-06-01  4:31 ` Shenming Lu
  2021-06-01  5:10   ` Lu Baolu
  2021-06-01 17:30 ` Parav Pandit
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 258+ messages in thread
From: Shenming Lu @ 2021-06-01  4:31 UTC (permalink / raw)
  To: Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe, Lu Baolu,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang
  Cc: Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu, Hao,
	Jiang, Dave, Jacob Pan, Jean-Philippe Brucker, David Gibson,
	Kirti Wankhede, Robin Murphy, Zenghui Yu, wanghaibin.wang

On 2021/5/27 15:58, Tian, Kevin wrote:
> /dev/ioasid provides an unified interface for managing I/O page tables for 
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, 
> etc.) are expected to use this interface instead of creating their own logic to 
> isolate untrusted device DMAs initiated by userspace. 
> 
> This proposal describes the uAPI of /dev/ioasid and also sample sequences 
> with VFIO as example in typical usages. The driver-facing kernel API provided 
> by the iommu layer is still TBD, which can be discussed after consensus is 
> made on this uAPI.
> 
> It's based on a lengthy discussion starting from here:
> 	https://lore.kernel.org/linux-iommu/20210330132830.GO2356281@nvidia.com/ 
> 
> It ends up to be a long writing due to many things to be summarized and
> non-trivial effort required to connect them into a complete proposal.
> Hope it provides a clean base to converge.
> 

[..]

> 
> /*
>   * Page fault report and response
>   *
>   * This is TBD. Can be added after other parts are cleared up. Likely it 
>   * will be a ring buffer shared between user/kernel, an eventfd to notify 
>   * the user and an ioctl to complete the fault.
>   *
>   * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
>   */

Hi,

It seems that the ioasid has different usage in different situation, it could
be directly used in the physical routing, or just a virtual handle that indicates
a page table or a vPASID table (such as the GPA address space, in the simple
passthrough case, the DMA input to IOMMU will just contain a Stream ID, no
Substream ID), right?

And Baolu suggested that since one device might consume multiple page tables,
it's more reasonable to have one fault handler per page table. By this, do we
have to maintain such an ioasid info list in the IOMMU layer?

Then if we add host IOPF support (for the GPA address space) in the future
(I have sent a series for this but it aimed for VFIO, I will convert it for
IOASID later [1] :-)), how could we find the handler for the received fault
event which only contains a Stream ID... Do we also have to maintain a
dev(vPASID)->ioasid mapping in the IOMMU layer?

[1] https://lore.kernel.org/patchwork/cover/1410223/

Thanks,
Shenming

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
       [not found]       ` <20210601113152.6d09e47b@yiliu-dev>
@ 2021-06-01  5:08         ` Jason Wang
  2021-06-01  5:23           ` Lu Baolu
  0 siblings, 1 reply; 258+ messages in thread
From: Jason Wang @ 2021-06-01  5:08 UTC (permalink / raw)
  To: Liu Yi L
  Cc: yi.l.liu, Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe,
	Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com)"",
	Eric Auger, Jonathan Corbet


在 2021/6/1 上午11:31, Liu Yi L 写道:
> On Tue, 1 Jun 2021 10:36:36 +0800, Jason Wang wrote:
>
>> 在 2021/5/31 下午4:41, Liu Yi L 写道:
>>>> I guess VFIO_ATTACH_IOASID will fail if the underlayer doesn't support
>>>> hardware nesting. Or is there way to detect the capability before?
>>> I think it could fail in the IOASID_CREATE_NESTING. If the gpa_ioasid
>>> is not able to support nesting, then should fail it.
>>>   
>>>> I think GET_INFO only works after the ATTACH.
>>> yes. After attaching to gpa_ioasid, userspace could GET_INFO on the
>>> gpa_ioasid and check if nesting is supported or not. right?
>>
>> Some more questions:
>>
>> 1) Is the handle returned by IOASID_ALLOC an fd?
> it's an ID so far in this proposal.


Ok.


>
>> 2) If yes, what's the reason for not simply use the fd opened from
>> /dev/ioas. (This is the question that is not answered) and what happens
>> if we call GET_INFO for the ioasid_fd?
>> 3) If not, how GET_INFO work?
> oh, missed this question in prior reply. Personally, no special reason
> yet. But using ID may give us opportunity to customize the management
> of the handle. For one, better lookup efficiency by using xarray to
> store the allocated IDs. For two, could categorize the allocated IDs
> (parent or nested). GET_INFO just works with an input FD and an ID.


I'm not sure I get this, for nesting cases you can still make the child 
an fd.

And a question still, under what case we need to create multiple ioasids 
on a single ioasid fd?

(This case is not demonstrated in your examples).

Thanks


>
>>>   
>>>>> 	/* Bind guest I/O page table  */
>>>>> 	bind_data = {
>>>>> 		.ioasid	= giova_ioasid;
>>>>> 		.addr	= giova_pgtable;
>>>>> 		// and format information
>>>>> 	};
>>>>> 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
>>>>>
>>>>> 	/* Invalidate IOTLB when required */
>>>>> 	inv_data = {
>>>>> 		.ioasid	= giova_ioasid;
>>>>> 		// granular information
>>>>> 	};
>>>>> 	ioctl(ioasid_fd, IOASID_INVALIDATE_CACHE, &inv_data);
>>>>>
>>>>> 	/* See 5.6 for I/O page fault handling */
>>>>> 	
>>>>> 5.5. Guest SVA (vSVA)
>>>>> ++++++++++++++++++
>>>>>
>>>>> After boots the guest further create a GVA address spaces (gpasid1) on
>>>>> dev1. Dev2 is not affected (still attached to giova_ioasid).
>>>>>
>>>>> As explained in section 4, user should avoid expose ENQCMD on both
>>>>> pdev and mdev.
>>>>>
>>>>> The sequence applies to all device types (being pdev or mdev), except
>>>>> one additional step to call KVM for ENQCMD-capable mdev:
>>>> My understanding is ENQCMD is Intel specific and not a requirement for
>>>> having vSVA.
>>> ENQCMD is not really Intel specific although only Intel supports it today.
>>> The PCIe DMWr capability is the capability for software to enumerate the
>>> ENQCMD support in device side. yes, it is not a requirement for vSVA. They
>>> are orthogonal.
>>
>> Right, then it's better to mention DMWr instead of a vendor specific
>> instruction in a general framework like ioasid.
> good suggestion. :)
>


^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  4:27       ` Shenming Lu
@ 2021-06-01  5:10         ` Jason Wang
  0 siblings, 0 replies; 258+ messages in thread
From: Jason Wang @ 2021-06-01  5:10 UTC (permalink / raw)
  To: Shenming Lu, Liu Yi L
  Cc: yi.l.liu, Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe,
	Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com)",
	Eric Auger, Jonathan Corbet, Zenghui Yu, wanghaibin.wang


在 2021/6/1 下午12:27, Shenming Lu 写道:
> On 2021/6/1 10:36, Jason Wang wrote:
>> 在 2021/5/31 下�4:41, Liu Yi L 写�:
>>>> I guess VFIO_ATTACH_IOASID will fail if the underlayer doesn't support
>>>> hardware nesting. Or is there way to detect the capability before?
>>> I think it could fail in the IOASID_CREATE_NESTING. If the gpa_ioasid
>>> is not able to support nesting, then should fail it.
>>>
>>>> I think GET_INFO only works after the ATTACH.
>>> yes. After attaching to gpa_ioasid, userspace could GET_INFO on the
>>> gpa_ioasid and check if nesting is supported or not. right?
>>
>> Some more questions:
>>
>> 1) Is the handle returned by IOASID_ALLOC an fd?
>> 2) If yes, what's the reason for not simply use the fd opened from /dev/ioas. (This is the question that is not answered) and what happens if we call GET_INFO for the ioasid_fd?
>> 3) If not, how GET_INFO work?
> It seems that the return value from IOASID_ALLOC is an IOASID number in the
> ioasid_data struct, then when calling GET_INFO, we should convey this IOASID
> number to get the associated I/O address space attributes (depend on the
> physical IOMMU, which could be discovered when attaching a device to the
> IOASID fd or number), right?


Right, but the question is why need such indirection? Unless there's a 
case that you need to create multiple IOASIDs per ioasid fd. It's more 
simpler to attach the metadata into the ioasid fd itself.

Thanks


>
> Thanks,
> Shenming
>


^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  4:31 ` Shenming Lu
@ 2021-06-01  5:10   ` Lu Baolu
  2021-06-01  7:15     ` Shenming Lu
  0 siblings, 1 reply; 258+ messages in thread
From: Lu Baolu @ 2021-06-01  5:10 UTC (permalink / raw)
  To: Shenming Lu, Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang
  Cc: baolu.lu, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu,
	Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker, David Gibson,
	Kirti Wankhede, Robin Murphy, Zenghui Yu, wanghaibin.wang

Hi Shenming,

On 6/1/21 12:31 PM, Shenming Lu wrote:
> On 2021/5/27 15:58, Tian, Kevin wrote:
>> /dev/ioasid provides an unified interface for managing I/O page tables for
>> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
>> etc.) are expected to use this interface instead of creating their own logic to
>> isolate untrusted device DMAs initiated by userspace.
>>
>> This proposal describes the uAPI of /dev/ioasid and also sample sequences
>> with VFIO as example in typical usages. The driver-facing kernel API provided
>> by the iommu layer is still TBD, which can be discussed after consensus is
>> made on this uAPI.
>>
>> It's based on a lengthy discussion starting from here:
>> 	https://lore.kernel.org/linux-iommu/20210330132830.GO2356281@nvidia.com/
>>
>> It ends up to be a long writing due to many things to be summarized and
>> non-trivial effort required to connect them into a complete proposal.
>> Hope it provides a clean base to converge.
>>
> 
> [..]
> 
>>
>> /*
>>    * Page fault report and response
>>    *
>>    * This is TBD. Can be added after other parts are cleared up. Likely it
>>    * will be a ring buffer shared between user/kernel, an eventfd to notify
>>    * the user and an ioctl to complete the fault.
>>    *
>>    * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
>>    */
> 
> Hi,
> 
> It seems that the ioasid has different usage in different situation, it could
> be directly used in the physical routing, or just a virtual handle that indicates
> a page table or a vPASID table (such as the GPA address space, in the simple
> passthrough case, the DMA input to IOMMU will just contain a Stream ID, no
> Substream ID), right?
> 
> And Baolu suggested that since one device might consume multiple page tables,
> it's more reasonable to have one fault handler per page table. By this, do we
> have to maintain such an ioasid info list in the IOMMU layer?

As discussed earlier, the I/O page fault and cache invalidation paths
will have "device labels" so that the information could be easily
translated and routed.

So it's likely the per-device fault handler registering API in iommu
core can be kept, but /dev/ioasid will be grown with a layer to
translate and propagate I/O page fault information to the right
consumers.

If things evolve in this way, probably the SVA I/O page fault also needs
to be ported to /dev/ioasid.

> 
> Then if we add host IOPF support (for the GPA address space) in the future
> (I have sent a series for this but it aimed for VFIO, I will convert it for
> IOASID later [1] :-)), how could we find the handler for the received fault
> event which only contains a Stream ID... Do we also have to maintain a
> dev(vPASID)->ioasid mapping in the IOMMU layer?
> 
> [1] https://lore.kernel.org/patchwork/cover/1410223/

Best regards,
baolu

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  5:08         ` Jason Wang
@ 2021-06-01  5:23           ` Lu Baolu
  2021-06-01  5:29             ` Jason Wang
  0 siblings, 1 reply; 258+ messages in thread
From: Lu Baolu @ 2021-06-01  5:23 UTC (permalink / raw)
  To: Jason Wang, Liu Yi L
  Cc: baolu.lu, yi.l.liu, Tian, Kevin, LKML, Joerg Roedel,
	Jason Gunthorpe, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com)"",
	Eric Auger, Jonathan Corbet

Hi Jason W,

On 6/1/21 1:08 PM, Jason Wang wrote:
>>> 2) If yes, what's the reason for not simply use the fd opened from
>>> /dev/ioas. (This is the question that is not answered) and what happens
>>> if we call GET_INFO for the ioasid_fd?
>>> 3) If not, how GET_INFO work?
>> oh, missed this question in prior reply. Personally, no special reason
>> yet. But using ID may give us opportunity to customize the management
>> of the handle. For one, better lookup efficiency by using xarray to
>> store the allocated IDs. For two, could categorize the allocated IDs
>> (parent or nested). GET_INFO just works with an input FD and an ID.
> 
> 
> I'm not sure I get this, for nesting cases you can still make the child 
> an fd.
> 
> And a question still, under what case we need to create multiple ioasids 
> on a single ioasid fd?

One possible situation where multiple IOASIDs per FD could be used is
that devices with different underlying IOMMU capabilities are sharing a
single FD. In this case, only devices with consistent underlying IOMMU
capabilities could be put in an IOASID and multiple IOASIDs per FD could
be applied.

Though, I still not sure about "multiple IOASID per-FD" vs "multiple
IOASID FDs" for such case.

Best regards,
baolu

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  5:23           ` Lu Baolu
@ 2021-06-01  5:29             ` Jason Wang
  2021-06-01  5:42               ` Tian, Kevin
  0 siblings, 1 reply; 258+ messages in thread
From: Jason Wang @ 2021-06-01  5:29 UTC (permalink / raw)
  To: Lu Baolu, Liu Yi L
  Cc: yi.l.liu, Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com)"",
	Eric Auger, Jonathan Corbet


在 2021/6/1 下午1:23, Lu Baolu 写道:
> Hi Jason W,
>
> On 6/1/21 1:08 PM, Jason Wang wrote:
>>>> 2) If yes, what's the reason for not simply use the fd opened from
>>>> /dev/ioas. (This is the question that is not answered) and what 
>>>> happens
>>>> if we call GET_INFO for the ioasid_fd?
>>>> 3) If not, how GET_INFO work?
>>> oh, missed this question in prior reply. Personally, no special reason
>>> yet. But using ID may give us opportunity to customize the management
>>> of the handle. For one, better lookup efficiency by using xarray to
>>> store the allocated IDs. For two, could categorize the allocated IDs
>>> (parent or nested). GET_INFO just works with an input FD and an ID.
>>
>>
>> I'm not sure I get this, for nesting cases you can still make the 
>> child an fd.
>>
>> And a question still, under what case we need to create multiple 
>> ioasids on a single ioasid fd?
>
> One possible situation where multiple IOASIDs per FD could be used is
> that devices with different underlying IOMMU capabilities are sharing a
> single FD. In this case, only devices with consistent underlying IOMMU
> capabilities could be put in an IOASID and multiple IOASIDs per FD could
> be applied.
>
> Though, I still not sure about "multiple IOASID per-FD" vs "multiple
> IOASID FDs" for such case.


Right, that's exactly my question. The latter seems much more easier to 
be understood and implemented.

Thanks


>
> Best regards,
> baolu
>


^ permalink raw reply	[flat|nested] 258+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  5:29             ` Jason Wang
@ 2021-06-01  5:42               ` Tian, Kevin
  2021-06-01  6:07                 ` Jason Wang
  0 siblings, 1 reply; 258+ messages in thread
From: Tian, Kevin @ 2021-06-01  5:42 UTC (permalink / raw)
  To: Jason Wang, Lu Baolu, Liu Yi L
  Cc: Alex Williamson (alex.williamson@redhat.com)"",
	kvm, Jonathan Corbet, LKML, iommu, Jason Gunthorpe,
	David Woodhouse

> From: Jason Wang
> Sent: Tuesday, June 1, 2021 1:30 PM
> 
> 在 2021/6/1 下午1:23, Lu Baolu 写道:
> > Hi Jason W,
> >
> > On 6/1/21 1:08 PM, Jason Wang wrote:
> >>>> 2) If yes, what's the reason for not simply use the fd opened from
> >>>> /dev/ioas. (This is the question that is not answered) and what
> >>>> happens
> >>>> if we call GET_INFO for the ioasid_fd?
> >>>> 3) If not, how GET_INFO work?
> >>> oh, missed this question in prior reply. Personally, no special reason
> >>> yet. But using ID may give us opportunity to customize the management
> >>> of the handle. For one, better lookup efficiency by using xarray to
> >>> store the allocated IDs. For two, could categorize the allocated IDs
> >>> (parent or nested). GET_INFO just works with an input FD and an ID.
> >>
> >>
> >> I'm not sure I get this, for nesting cases you can still make the
> >> child an fd.
> >>
> >> And a question still, under what case we need to create multiple
> >> ioasids on a single ioasid fd?
> >
> > One possible situation where multiple IOASIDs per FD could be used is
> > that devices with different underlying IOMMU capabilities are sharing a
> > single FD. In this case, only devices with consistent underlying IOMMU
> > capabilities could be put in an IOASID and multiple IOASIDs per FD could
> > be applied.
> >
> > Though, I still not sure about "multiple IOASID per-FD" vs "multiple
> > IOASID FDs" for such case.
> 
> 
> Right, that's exactly my question. The latter seems much more easier to
> be understood and implemented.
> 

A simple reason discussed in previous thread - there could be 1M's 
I/O address spaces per device while #FD's are precious resource.
So this RFC treats fd as a container of address spaces which is each
tagged by an IOASID.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  5:42               ` Tian, Kevin
@ 2021-06-01  6:07                 ` Jason Wang
  2021-06-01  6:16                   ` Tian, Kevin
  2021-06-01 17:29                   ` Jason Gunthorpe
  0 siblings, 2 replies; 258+ messages in thread
From: Jason Wang @ 2021-06-01  6:07 UTC (permalink / raw)
  To: Tian, Kevin, Lu Baolu, Liu Yi L
  Cc: Alex Williamson (alex.williamson@redhat.com)"",
	kvm, Jonathan Corbet, LKML, iommu, Jason Gunthorpe,
	David Woodhouse


在 2021/6/1 下午1:42, Tian, Kevin 写道:
>> From: Jason Wang
>> Sent: Tuesday, June 1, 2021 1:30 PM
>>
>> 在 2021/6/1 下午1:23, Lu Baolu 写道:
>>> Hi Jason W,
>>>
>>> On 6/1/21 1:08 PM, Jason Wang wrote:
>>>>>> 2) If yes, what's the reason for not simply use the fd opened from
>>>>>> /dev/ioas. (This is the question that is not answered) and what
>>>>>> happens
>>>>>> if we call GET_INFO for the ioasid_fd?
>>>>>> 3) If not, how GET_INFO work?
>>>>> oh, missed this question in prior reply. Personally, no special reason
>>>>> yet. But using ID may give us opportunity to customize the management
>>>>> of the handle. For one, better lookup efficiency by using xarray to
>>>>> store the allocated IDs. For two, could categorize the allocated IDs
>>>>> (parent or nested). GET_INFO just works with an input FD and an ID.
>>>>
>>>> I'm not sure I get this, for nesting cases you can still make the
>>>> child an fd.
>>>>
>>>> And a question still, under what case we need to create multiple
>>>> ioasids on a single ioasid fd?
>>> One possible situation where multiple IOASIDs per FD could be used is
>>> that devices with different underlying IOMMU capabilities are sharing a
>>> single FD. In this case, only devices with consistent underlying IOMMU
>>> capabilities could be put in an IOASID and multiple IOASIDs per FD could
>>> be applied.
>>>
>>> Though, I still not sure about "multiple IOASID per-FD" vs "multiple
>>> IOASID FDs" for such case.
>>
>> Right, that's exactly my question. The latter seems much more easier to
>> be understood and implemented.
>>
> A simple reason discussed in previous thread - there could be 1M's
> I/O address spaces per device while #FD's are precious resource.


Is the concern for ulimit or performance? Note that we had

#define NR_OPEN_MAX ~0U

And with the fd semantic, you can do a lot of other stuffs: close on 
exec, passing via SCM_RIGHTS.

For the case of 1M, I would like to know what's the use case for a 
single process to handle 1M+ address spaces?


> So this RFC treats fd as a container of address spaces which is each
> tagged by an IOASID.


If the container and address space is 1:1 then the container seems useless.

Thanks


>
> Thanks
> Kevin


^ permalink raw reply	[flat|nested] 258+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  6:07                 ` Jason Wang
@ 2021-06-01  6:16                   ` Tian, Kevin
  2021-06-01  8:47                     ` Jason Wang
  2021-06-01 17:29                   ` Jason Gunthorpe
  1 sibling, 1 reply; 258+ messages in thread
From: Tian, Kevin @ 2021-06-01  6:16 UTC (permalink / raw)
  To: Jason Wang, Lu Baolu, Liu Yi L
  Cc: kvm, Jonathan Corbet, iommu, LKML,
	Alex Williamson (alex.williamson@redhat.com)"",
	Jason Gunthorpe, David Woodhouse

> From: Jason Wang
> Sent: Tuesday, June 1, 2021 2:07 PM
> 
> 在 2021/6/1 下午1:42, Tian, Kevin 写道:
> >> From: Jason Wang
> >> Sent: Tuesday, June 1, 2021 1:30 PM
> >>
> >> 在 2021/6/1 下午1:23, Lu Baolu 写道:
> >>> Hi Jason W,
> >>>
> >>> On 6/1/21 1:08 PM, Jason Wang wrote:
> >>>>>> 2) If yes, what's the reason for not simply use the fd opened from
> >>>>>> /dev/ioas. (This is the question that is not answered) and what
> >>>>>> happens
> >>>>>> if we call GET_INFO for the ioasid_fd?
> >>>>>> 3) If not, how GET_INFO work?
> >>>>> oh, missed this question in prior reply. Personally, no special reason
> >>>>> yet. But using ID may give us opportunity to customize the
> management
> >>>>> of the handle. For one, better lookup efficiency by using xarray to
> >>>>> store the allocated IDs. For two, could categorize the allocated IDs
> >>>>> (parent or nested). GET_INFO just works with an input FD and an ID.
> >>>>
> >>>> I'm not sure I get this, for nesting cases you can still make the
> >>>> child an fd.
> >>>>
> >>>> And a question still, under what case we need to create multiple
> >>>> ioasids on a single ioasid fd?
> >>> One possible situation where multiple IOASIDs per FD could be used is
> >>> that devices with different underlying IOMMU capabilities are sharing a
> >>> single FD. In this case, only devices with consistent underlying IOMMU
> >>> capabilities could be put in an IOASID and multiple IOASIDs per FD could
> >>> be applied.
> >>>
> >>> Though, I still not sure about "multiple IOASID per-FD" vs "multiple
> >>> IOASID FDs" for such case.
> >>
> >> Right, that's exactly my question. The latter seems much more easier to
> >> be understood and implemented.
> >>
> > A simple reason discussed in previous thread - there could be 1M's
> > I/O address spaces per device while #FD's are precious resource.
> 
> 
> Is the concern for ulimit or performance? Note that we had
> 
> #define NR_OPEN_MAX ~0U
> 
> And with the fd semantic, you can do a lot of other stuffs: close on
> exec, passing via SCM_RIGHTS.

yes, fd has its merits.

> 
> For the case of 1M, I would like to know what's the use case for a
> single process to handle 1M+ address spaces?

This single process is Qemu with an assigned device. Within the guest 
there could be many guest processes. Though in reality I didn't see
such 1M processes on a single device, better not restrict it in uAPI?

> 
> 
> > So this RFC treats fd as a container of address spaces which is each
> > tagged by an IOASID.
> 
> 
> If the container and address space is 1:1 then the container seems useless.
> 

yes, 1:1 then container is useless. But here it's assumed 1:M then 
even a single fd is sufficient for all intended usages. 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 258+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-05-28 20:03 ` Jason Gunthorpe
@ 2021-06-01  7:01   ` Tian, Kevin
  2021-06-01 20:28     ` Jason Gunthorpe
  2021-06-01 22:22     ` Alex Williamson
  0 siblings, 2 replies; 258+ messages in thread
From: Tian, Kevin @ 2021-06-01  7:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Saturday, May 29, 2021 4:03 AM
> 
> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > /dev/ioasid provides an unified interface for managing I/O page tables for
> > devices assigned to userspace. Device passthrough frameworks (VFIO,
> vDPA,
> > etc.) are expected to use this interface instead of creating their own logic to
> > isolate untrusted device DMAs initiated by userspace.
> 
> It is very long, but I think this has turned out quite well. It
> certainly matches the basic sketch I had in my head when we were
> talking about how to create vDPA devices a few years ago.
> 
> When you get down to the operations they all seem pretty common sense
> and straightfoward. Create an IOASID. Connect to a device. Fill the
> IOASID with pages somehow. Worry about PASID labeling.
> 
> It really is critical to get all the vendor IOMMU people to go over it
> and see how their HW features map into this.
> 

Agree. btw I feel it might be good to have several design opens 
centrally discussed after going through all the comments. Otherwise 
they may be buried in different sub-threads and potentially with 
insufficient care (especially for people who haven't completed the
reading).

I summarized five opens here, about:

1)  Finalizing the name to replace /dev/ioasid;
2)  Whether one device is allowed to bind to multiple IOASID fd's;
3)  Carry device information in invalidation/fault reporting uAPI;
4)  What should/could be specified when allocating an IOASID;
5)  The protocol between vfio group and kvm;

For 1), two alternative names are mentioned: /dev/iommu and 
/dev/ioas. I don't have a strong preference and would like to hear 
votes from all stakeholders. /dev/iommu is slightly better imho for 
two reasons. First, per AMD's presentation in last KVM forum they 
implement vIOMMU in hardware thus need to support user-managed 
domains. An iommu uAPI notation might make more sense moving 
forward. Second, it makes later uAPI naming easier as 'IOASID' can 
be always put as an object, e.g. IOMMU_ALLOC_IOASID instead of 
IOASID_ALLOC_IOASID. :)

Another naming open is about IOASID (the software handle for ioas) 
and the associated hardware ID (PASID or substream ID). Jason thought 
PASID is defined more from SVA angle while ARM's convention sounds 
clearer from device p.o.v. Following this direction then SID/SSID will be 
used to replace RID/PASID in this RFC (and possibly also implying that 
the kernel IOASID allocator should also be renamed to SSID allocator). 
I don't have better alternative. If no one objects, I'll change to this new
naming in next version. 

For 2), Jason prefers to not blocking it if no kernel design reason. If 
one device is allowed to bind multiple IOASID fd's, the main problem
is about cross-fd IOASID nesting, e.g. having gpa_ioasid created in fd1 
and giova_ioasid created in fd2 and then nesting them together (and
whether any cross-fd notification required when handling invalidation
etc.). We thought that this just adds some complexity while not sure 
about the value of supporting it (when one fd can already afford all 
discussed usages). Therefore this RFC proposes a device only bound 
to at most one IOASID fd. Does this rationale make sense?

To the other end there was also thought whether we should make
a single I/O address space per IOASID fd. This was discussed in previous
thread that #fd's are insufficient to afford theoretical 1M's address
spaces per device. But let's have another revisit and draw a clear
conclusion whether this option is viable.

For 3), Jason/Jean both think it's cleaner to carry device info in the 
uAPI. Actually this was one option we developed in earlier internal
versions of this RFC. Later on we changed it to the current way based
on misinterpretation of previous discussion. Thinking more we will
adopt this suggestion in next version, due to both efficiency (I/O page
fault is already a long path ) and security reason (some faults are 
unrecoverable thus the faulting device must be identified/isolated).

This implies that VFIO_BOUND_IOASID will be extended to allow user
specify a device label. This label will be recorded in /dev/iommu to
serve per-device invalidation request from and report per-device 
fault data to the user. In addition, vPASID (if provided by user) will
be also recorded in /dev/iommu so vPASID<->pPASID conversion 
is conducted properly. e.g. invalidation request from user carries
a vPASID which must be converted into pPASID before calling iommu
driver. Vice versa for raw fault data which carries pPASID while the
user expects a vPASID.

For 4), There are two options for specifying the IOASID attributes:

    In this RFC, an IOASID has no attribute before it's attached to any
    device. After device attach, user queries capability/format info
    about the IOMMU which the device belongs to, and then call
    different ioctl commands to set the attributes for an IOASID (e.g.
    map/unmap, bind/unbind user pgtable, nesting, etc.). This follows
    how the underlying iommu-layer API is designed: a domain reports
    capability/format info and serves iommu ops only after it's attached 
    to a device.

    Jason suggests having user to specify all attributes about how an
    IOASID is expected to work when creating this IOASID. This requires
    /dev/iommu to provide capability/format info once a device is bound
    to ioasid fd (before creating any IOASID). In concept this should work, 
    since given a device we can always find its IOMMU. The only gap is
    aforementioned: current iommu API is designed per domain instead 
    of per-device. 

Seems to close this design open we have to touch the kAPI design. and 
Joerg's input is highly appreciated here.

For 5), I'd expect Alex to chime in. Per my understanding looks the
original purpose of this protocol is not about I/O address space. It's
for KVM to know whether any device is assigned to this VM and then
do something special (e.g. posted interrupt, EPT cache attribute, etc.).
Because KVM deduces some policy based on the fact of assigned device, 
it needs to hold a reference to related vfio group. this part is irrelevant
to this RFC. 

But ARM's VMID usage is related to I/O address space thus needs some
consideration. Another strange thing is about PPC. Looks it also leverages
this protocol to do iommu group attach: kvm_spapr_tce_attach_iommu_
group. I don't know why it's done through KVM instead of VFIO uAPI in
the first place.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  5:10   ` Lu Baolu
@ 2021-06-01  7:15     ` Shenming Lu
  2021-06-01 12:30       ` Lu Baolu
  0 siblings, 1 reply; 258+ messages in thread
From: Shenming Lu @ 2021-06-01  7:15 UTC (permalink / raw)
  To: Lu Baolu, Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang
  Cc: Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu, Hao,
	Jiang, Dave, Jacob Pan, Jean-Philippe Brucker, David Gibson,
	Kirti Wankhede, Robin Murphy, Zenghui Yu, wanghaibin.wang

On 2021/6/1 13:10, Lu Baolu wrote:
> Hi Shenming,
> 
> On 6/1/21 12:31 PM, Shenming Lu wrote:
>> On 2021/5/27 15:58, Tian, Kevin wrote:
>>> /dev/ioasid provides an unified interface for managing I/O page tables for
>>> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
>>> etc.) are expected to use this interface instead of creating their own logic to
>>> isolate untrusted device DMAs initiated by userspace.
>>>
>>> This proposal describes the uAPI of /dev/ioasid and also sample sequences
>>> with VFIO as example in typical usages. The driver-facing kernel API provided
>>> by the iommu layer is still TBD, which can be discussed after consensus is
>>> made on this uAPI.
>>>
>>> It's based on a lengthy discussion starting from here:
>>>     https://lore.kernel.org/linux-iommu/20210330132830.GO2356281@nvidia.com/
>>>
>>> It ends up to be a long writing due to many things to be summarized and
>>> non-trivial effort required to connect them into a complete proposal.
>>> Hope it provides a clean base to converge.
>>>
>>
>> [..]
>>
>>>
>>> /*
>>>    * Page fault report and response
>>>    *
>>>    * This is TBD. Can be added after other parts are cleared up. Likely it
>>>    * will be a ring buffer shared between user/kernel, an eventfd to notify
>>>    * the user and an ioctl to complete the fault.
>>>    *
>>>    * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
>>>    */
>>
>> Hi,
>>
>> It seems that the ioasid has different usage in different situation, it could
>> be directly used in the physical routing, or just a virtual handle that indicates
>> a page table or a vPASID table (such as the GPA address space, in the simple
>> passthrough case, the DMA input to IOMMU will just contain a Stream ID, no
>> Substream ID), right?
>>
>> And Baolu suggested that since one device might consume multiple page tables,
>> it's more reasonable to have one fault handler per page table. By this, do we
>> have to maintain such an ioasid info list in the IOMMU layer?
> 
> As discussed earlier, the I/O page fault and cache invalidation paths
> will have "device labels" so that the information could be easily
> translated and routed.
> 
> So it's likely the per-device fault handler registering API in iommu
> core can be kept, but /dev/ioasid will be grown with a layer to
> translate and propagate I/O page fault information to the right
> consumers.

Yeah, having a general preprocessing of the faults in IOASID seems to be
a doable direction. But since there may be more than one consumer at the
same time, who is responsible for registering the per-device fault handler?

Thanks,
Shenming

> 
> If things evolve in this way, probably the SVA I/O page fault also needs
> to be ported to /dev/ioasid.
> 
>>
>> Then if we add host IOPF support (for the GPA address space) in the future
>> (I have sent a series for this but it aimed for VFIO, I will convert it for
>> IOASID later [1] :-)), how could we find the handler for the received fault
>> event which only contains a Stream ID... Do we also have to maintain a
>> dev(vPASID)->ioasid mapping in the IOMMU layer?
>>
>> [1] https://lore.kernel.org/patchwork/cover/1410223/
> 
> Best regards,
> baolu
> .

^ permalink raw reply	[flat|nested] 258+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-05-28 16:23 ` Jean-Philippe Brucker
  2021-05-28 20:16   ` Jason Gunthorpe
@ 2021-06-01  7:50   ` Tian, Kevin
  1 sibling, 0 replies; 258+ messages in thread
From: Tian, Kevin @ 2021-06-01  7:50 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: LKML, Joerg Roedel, Jason Gunthorpe, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, David Gibson, Kirti Wankhede,
	Robin Murphy

> From: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Sent: Saturday, May 29, 2021 12:23 AM
> >
> > IOASID nesting can be implemented in two ways: hardware nesting and
> > software nesting. With hardware support the child and parent I/O page
> > tables are walked consecutively by the IOMMU to form a nested translation.
> > When it's implemented in software, the ioasid driver is responsible for
> > merging the two-level mappings into a single-level shadow I/O page table.
> > Software nesting requires both child/parent page tables operated through
> > the dma mapping protocol, so any change in either level can be captured
> > by the kernel to update the corresponding shadow mapping.
> 
> Is there an advantage to moving software nesting into the kernel?
> We could just have the guest do its usual combined map/unmap on the child
> fd
> 

There are at least two intended usages:

1) From previous discussion looks PPC's window-based scheme can be
better supported with software nesting (a shared IOVA address space
as the parent (shared by all devices) which is nested by multiple windows
as the children (per-device);

2) Some mdev drivers (e.g. kvmgt) may want to do write-protection on 
guest data structures (base address programmed to mediated MMIO
register). The base address is IOVA while  KVM page-tracking API is 
based on GPA. nesting allows finding GPA according to IOVA.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 258+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-05-28 17:35 ` Jason Gunthorpe
@ 2021-06-01  8:10   ` Tian, Kevin
  2021-06-01 17:42     ` Jason Gunthorpe
  2021-06-02  6:32   ` David Gibson
  1 sibling, 1 reply; 258+ messages in thread
From: Tian, Kevin @ 2021-06-01  8:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Saturday, May 29, 2021 1:36 AM
> 
> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> 
> > IOASID nesting can be implemented in two ways: hardware nesting and
> > software nesting. With hardware support the child and parent I/O page
> > tables are walked consecutively by the IOMMU to form a nested translation.
> > When it's implemented in software, the ioasid driver is responsible for
> > merging the two-level mappings into a single-level shadow I/O page table.
> > Software nesting requires both child/parent page tables operated through
> > the dma mapping protocol, so any change in either level can be captured
> > by the kernel to update the corresponding shadow mapping.
> 
> Why? A SW emulation could do this synchronization during invalidation
> processing if invalidation contained an IOVA range.

In this proposal we differentiate between host-managed and user-
managed I/O page tables. If host-managed, the user is expected to use
map/unmap cmd explicitly upon any change required on the page table. 
If user-managed, the user first binds its page table to the IOMMU and 
then use invalidation cmd to flush iotlb when necessary (e.g. typically
not required when changing a PTE from non-present to present).

We expect user to use map+unmap and bind+invalidate respectively
instead of mixing them together. Following this policy, map+unmap
must be used in both levels for software nesting, so changes in either 
level are captured timely to synchronize the shadow mapping.

> 
> I think this document would be stronger to include some "Rational"
> statements in key places
> 

Sure. I tried to provide rationale as much as possible but sometimes 
it's lost in a complex context like this. :)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 258+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-05-28 19:58 ` Jason Gunthorpe
@ 2021-06-01  8:38   ` Tian, Kevin
  2021-06-01 17:56     ` Jason Gunthorpe
  2021-06-02  6:48   ` David Gibson
  1 sibling, 1 reply; 258+ messages in thread
From: Tian, Kevin @ 2021-06-01  8:38 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Saturday, May 29, 2021 3:59 AM
> 
> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> >
> > 5. Use Cases and Flows
> >
> > Here assume VFIO will support a new model where every bound device
> > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> > going through legacy container/group interface. For illustration purpose
> > those devices are just called dev[1...N]:
> >
> > 	device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> >
> > As explained earlier, one IOASID fd is sufficient for all intended use cases:
> >
> > 	ioasid_fd = open("/dev/ioasid", mode);
> >
> > For simplicity below examples are all made for the virtualization story.
> > They are representative and could be easily adapted to a non-virtualization
> > scenario.
> 
> For others, I don't think this is *strictly* necessary, we can
> probably still get to the device_fd using the group_fd and fit in
> /dev/ioasid. It does make the rest of this more readable though.

Jason, want to confirm here. Per earlier discussion we remain an
impression that you want VFIO to be a pure device driver thus
container/group are used only for legacy application. From this
comment are you suggesting that VFIO can still keep container/
group concepts and user just deprecates the use of vfio iommu
uAPI (e.g. VFIO_SET_IOMMU) by using /dev/ioasid (which has
a simple policy that an IOASID will reject cmd if partially-attached 
group exists)?

> 
> 
> > Three types of IOASIDs are considered:
> >
> > 	gpa_ioasid[1...N]: 	for GPA address space
> > 	giova_ioasid[1...N]:	for guest IOVA address space
> > 	gva_ioasid[1...N]:	for guest CPU VA address space
> >
> > At least one gpa_ioasid must always be created per guest, while the other
> > two are relevant as far as vIOMMU is concerned.
> >
> > Examples here apply to both pdev and mdev, if not explicitly marked out
> > (e.g. in section 5.5). VFIO device driver in the kernel will figure out the
> > associated routing information in the attaching operation.
> >
> > For illustration simplicity, IOASID_CHECK_EXTENSION and IOASID_GET_
> > INFO are skipped in these examples.
> >
> > 5.1. A simple example
> > ++++++++++++++++++
> >
> > Dev1 is assigned to the guest. One gpa_ioasid is created. The GPA address
> > space is managed through DMA mapping protocol:
> >
> > 	/* Bind device to IOASID fd */
> > 	device_fd = open("/dev/vfio/devices/dev1", mode);
> > 	ioasid_fd = open("/dev/ioasid", mode);
> > 	ioctl(device_fd, VFIO_BIND_IOASID_FD, ioasid_fd);
> >
> > 	/* Attach device to IOASID */
> > 	gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> > 	at_data = { .ioasid = gpa_ioasid};
> > 	ioctl(device_fd, VFIO_ATTACH_IOASID, &at_data);
> >
> > 	/* Setup GPA mapping */
> > 	dma_map = {
> > 		.ioasid	= gpa_ioasid;
> > 		.iova	= 0;		// GPA
> > 		.vaddr	= 0x40000000;	// HVA
> > 		.size	= 1GB;
> > 	};
> > 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> >
> > If the guest is assigned with more than dev1, user follows above sequence
> > to attach other devices to the same gpa_ioasid i.e. sharing the GPA
> > address space cross all assigned devices.
> 
> eg
> 
>  	device2_fd = open("/dev/vfio/devices/dev1", mode);
>  	ioctl(device2_fd, VFIO_BIND_IOASID_FD, ioasid_fd);
>  	ioctl(device2_fd, VFIO_ATTACH_IOASID, &at_data);
> 
> Right?

Exactly, except a small typo ('dev1' -> 'dev2'). :)

> 
> >
> > 5.2. Multiple IOASIDs (no nesting)
> > ++++++++++++++++++++++++++++
> >
> > Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
> > both devices are attached to gpa_ioasid. After boot the guest creates
> > an GIOVA address space (giova_ioasid) for dev2, leaving dev1 in pass
> > through mode (gpa_ioasid).
> >
> > Suppose IOASID nesting is not supported in this case. Qemu need to
> > generate shadow mappings in userspace for giova_ioasid (like how
> > VFIO works today).
> >
> > To avoid duplicated locked page accounting, it's recommended to pre-
> > register the virtual address range that will be used for DMA:
> >
> > 	device_fd1 = open("/dev/vfio/devices/dev1", mode);
> > 	device_fd2 = open("/dev/vfio/devices/dev2", mode);
> > 	ioasid_fd = open("/dev/ioasid", mode);
> > 	ioctl(device_fd1, VFIO_BIND_IOASID_FD, ioasid_fd);
> > 	ioctl(device_fd2, VFIO_BIND_IOASID_FD, ioasid_fd);
> >
> > 	/* pre-register the virtual address range for accounting */
> > 	mem_info = { .vaddr = 0x40000000; .size = 1GB };
> > 	ioctl(ioasid_fd, IOASID_REGISTER_MEMORY, &mem_info);
> >
> > 	/* Attach dev1 and dev2 to gpa_ioasid */
> > 	gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> > 	at_data = { .ioasid = gpa_ioasid};
> > 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> > 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> >
> > 	/* Setup GPA mapping */
> > 	dma_map = {
> > 		.ioasid	= gpa_ioasid;
> > 		.iova	= 0; 		// GPA
> > 		.vaddr	= 0x40000000;	// HVA
> > 		.size	= 1GB;
> > 	};
> > 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> >
> > 	/* After boot, guest enables an GIOVA space for dev2 */
> > 	giova_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> >
> > 	/* First detach dev2 from previous address space */
> > 	at_data = { .ioasid = gpa_ioasid};
> > 	ioctl(device_fd2, VFIO_DETACH_IOASID, &at_data);
> >
> > 	/* Then attach dev2 to the new address space */
> > 	at_data = { .ioasid = giova_ioasid};
> > 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> >
> > 	/* Setup a shadow DMA mapping according to vIOMMU
> > 	  * GIOVA (0x2000) -> GPA (0x1000) -> HVA (0x40001000)
> > 	  */
> 
> Here "shadow DMA" means relay the guest's vIOMMU page tables to the HW
> IOMMU?

'shadow' means the merged mapping: GIOVA(0x2000) -> HVA (0x40001000)

> 
> > 	dma_map = {
> > 		.ioasid	= giova_ioasid;
> > 		.iova	= 0x2000; 	// GIOVA
> > 		.vaddr	= 0x40001000;	// HVA
> 
> eg HVA came from reading the guest's page tables and finding it wanted
> GPA 0x1000 mapped to IOVA 0x2000?

yes

> 
> 
> > 5.3. IOASID nesting (software)
> > +++++++++++++++++++++++++
> >
> > Same usage scenario as 5.2, with software-based IOASID nesting
> > available. In this mode it is the kernel instead of user to create the
> > shadow mapping.
> >
> > The flow before guest boots is same as 5.2, except one point. Because
> > giova_ioasid is nested on gpa_ioasid, locked accounting is only
> > conducted for gpa_ioasid. So it's not necessary to pre-register virtual
> > memory.
> >
> > To save space we only list the steps after boots (i.e. both dev1/dev2
> > have been attached to gpa_ioasid before guest boots):
> >
> > 	/* After boots */
> > 	/* Make GIOVA space nested on GPA space */
> > 	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > 				gpa_ioasid);
> >
> > 	/* Attach dev2 to the new address space (child)
> > 	  * Note dev2 is still attached to gpa_ioasid (parent)
> > 	  */
> > 	at_data = { .ioasid = giova_ioasid};
> > 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> >
> > 	/* Setup a GIOVA->GPA mapping for giova_ioasid, which will be
> > 	  * merged by the kernel with GPA->HVA mapping of gpa_ioasid
> > 	  * to form a shadow mapping.
> > 	  */
> > 	dma_map = {
> > 		.ioasid	= giova_ioasid;
> > 		.iova	= 0x2000;	// GIOVA
> > 		.vaddr	= 0x1000;	// GPA
> > 		.size	= 4KB;
> > 	};
> > 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> 
> And in this version the kernel reaches into the parent IOASID's page
> tables to translate 0x1000 to 0x40001000 to physical page? So we
> basically remove the qemu process address space entirely from this
> translation. It does seem convenient

yes.

> 
> > 5.4. IOASID nesting (hardware)
> > +++++++++++++++++++++++++
> >
> > Same usage scenario as 5.2, with hardware-based IOASID nesting
> > available. In this mode the pgtable binding protocol is used to
> > bind the guest IOVA page table with the IOMMU:
> >
> > 	/* After boots */
> > 	/* Make GIOVA space nested on GPA space */
> > 	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > 				gpa_ioasid);
> >
> > 	/* Attach dev2 to the new address space (child)
> > 	  * Note dev2 is still attached to gpa_ioasid (parent)
> > 	  */
> > 	at_data = { .ioasid = giova_ioasid};
> > 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> >
> > 	/* Bind guest I/O page table  */
> > 	bind_data = {
> > 		.ioasid	= giova_ioasid;
> > 		.addr	= giova_pgtable;
> > 		// and format information
> > 	};
> > 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> 
> I really think you need to use consistent language. Things that
> allocate a new IOASID should be calle IOASID_ALLOC_IOASID. If multiple
> IOCTLs are needed then it is IOASID_ALLOC_IOASID_PGTABLE, etc.
> alloc/create/bind is too confusing.
> 
> > 5.5. Guest SVA (vSVA)
> > ++++++++++++++++++
> >
> > After boots the guest further create a GVA address spaces (gpasid1) on
> > dev1. Dev2 is not affected (still attached to giova_ioasid).
> >
> > As explained in section 4, user should avoid expose ENQCMD on both
> > pdev and mdev.
> >
> > The sequence applies to all device types (being pdev or mdev), except
> > one additional step to call KVM for ENQCMD-capable mdev:
> >
> > 	/* After boots */
> > 	/* Make GVA space nested on GPA space */
> > 	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > 				gpa_ioasid);
> >
> > 	/* Attach dev1 to the new address space and specify vPASID */
> > 	at_data = {
> > 		.ioasid		= gva_ioasid;
> > 		.flag 		= IOASID_ATTACH_USER_PASID;
> > 		.user_pasid	= gpasid1;
> > 	};
> > 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> 
> Still a little unsure why the vPASID is here not on the gva_ioasid. Is
> there any scenario where we want different vpasid's for the same
> IOASID? I guess it is OK like this. Hum.

Yes, it's completely sane that the guest links a I/O page table to 
different vpasids on dev1 and dev2. The IOMMU doesn't mandate
that when multiple devices share an I/O page table they must use
the same PASID#. 

> 
> > 	/* if dev1 is ENQCMD-capable mdev, update CPU PASID
> > 	  * translation structure through KVM
> > 	  */
> > 	pa_data = {
> > 		.ioasid_fd	= ioasid_fd;
> > 		.ioasid		= gva_ioasid;
> > 		.guest_pasid	= gpasid1;
> > 	};
> > 	ioctl(kvm_fd, KVM_MAP_PASID, &pa_data);
> 
> Make sense
> 
> > 	/* Bind guest I/O page table  */
> > 	bind_data = {
> > 		.ioasid	= gva_ioasid;
> > 		.addr	= gva_pgtable1;
> > 		// and format information
> > 	};
> > 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> 
> Again I do wonder if this should just be part of alloc_ioasid. Is
> there any reason to split these things? The only advantage to the
> split is the device is known, but the device shouldn't impact
> anything..

I summarized this as open#4 in another mail for focused discussion.

> 
> > 5.6. I/O page fault
> > +++++++++++++++
> >
> > (uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
> > to guest IOMMU driver and backwards).
> >
> > -   Host IOMMU driver receives a page request with raw fault_data {rid,
> >     pasid, addr};
> >
> > -   Host IOMMU driver identifies the faulting I/O page table according to
> >     information registered by IOASID fault handler;
> >
> > -   IOASID fault handler is called with raw fault_data (rid, pasid, addr),
> which
> >     is saved in ioasid_data->fault_data (used for response);
> >
> > -   IOASID fault handler generates an user fault_data (ioasid, addr), links it
> >     to the shared ring buffer and triggers eventfd to userspace;
> 
> Here rid should be translated to a labeled device and return the
> device label from VFIO_BIND_IOASID_FD. Depending on how the device
> bound the label might match to a rid or to a rid,pasid

Yes, I acknowledged this input from you and Jean about page fault and 
bind_pasid_table. I summarized it as open#3 in another mail.

thus following is skipped...

Thanks
Kevin

> 
> > -   Upon received event, Qemu needs to find the virtual routing information
> >     (v_rid + v_pasid) of the device attached to the faulting ioasid. If there are
> >     multiple, pick a random one. This should be fine since the purpose is to
> >     fix the I/O page table on the guest;
> 
> The device label should fix this
> 
> > -   Qemu finds the pending fault event, converts virtual completion data
> >     into (ioasid, response_code), and then calls a /dev/ioasid ioctl to
> >     complete the pending fault;
> >
> > -   /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved in
> >     ioasid_data->fault_data, and then calls iommu api to complete it with
> >     {rid, pasid, response_code};
> 
> So resuming a fault on an ioasid will resume all devices pending on
> the fault?
> 
> > 5.7. BIND_PASID_TABLE
> > ++++++++++++++++++++
> >
> > PASID table is put in the GPA space on some platform, thus must be
> updated
> > by the guest. It is treated as another user page table to be bound with the
> > IOMMU.
> >
> > As explained earlier, the user still needs to explicitly bind every user I/O
> > page table to the kernel so the same pgtable binding protocol (bind, cache
> > invalidate and fault handling) is unified cross platforms.
> >
> > vIOMMUs may include a caching mode (or paravirtualized way) which,
> once
> > enabled, requires the guest to invalidate PASID cache for any change on the
> > PASID table. This allows Qemu to track the lifespan of guest I/O page tables.
> >
> > In case of missing such capability, Qemu could enable write-protection on
> > the guest PASID table to achieve the same effect.
> >
> > 	/* After boots */
> > 	/* Make vPASID space nested on GPA space */
> > 	pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > 				gpa_ioasid);
> >
> > 	/* Attach dev1 to pasidtbl_ioasid */
> > 	at_data = { .ioasid = pasidtbl_ioasid};
> > 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> >
> > 	/* Bind PASID table */
> > 	bind_data = {
> > 		.ioasid	= pasidtbl_ioasid;
> > 		.addr	= gpa_pasid_table;
> > 		// and format information
> > 	};
> > 	ioctl(ioasid_fd, IOASID_BIND_PASID_TABLE, &bind_data);
> >
> > 	/* vIOMMU detects a new GVA I/O space created */
> > 	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > 				gpa_ioasid);
> >
> > 	/* Attach dev1 to the new address space, with gpasid1 */
> > 	at_data = {
> > 		.ioasid		= gva_ioasid;
> > 		.flag 		= IOASID_ATTACH_USER_PASID;
> > 		.user_pasid	= gpasid1;
> > 	};
> > 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> >
> > 	/* Bind guest I/O page table. Because SET_PASID_TABLE has been
> > 	  * used, the kernel will not update the PASID table. Instead, just
> > 	  * track the bound I/O page table for handling invalidation and
> > 	  * I/O page faults.
> > 	  */
> > 	bind_data = {
> > 		.ioasid	= gva_ioasid;
> > 		.addr	= gva_pgtable1;
> > 		// and format information
> > 	};
> > 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> 
> I still don't quite get the benifit from doing this.
> 
> The idea to create an all PASID IOASID seems to work better with less
> fuss on HW that is directly parsing the guest's PASID table.
> 
> Cache invalidate seems easy enough to support
> 
> Fault handling needs to return the (ioasid, device_label, pasid) when
> working with this kind of ioasid.
> 
> It is true that it does create an additional flow qemu has to
> implement, but it does directly mirror the HW.
> 
> Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  6:16                   ` Tian, Kevin
@ 2021-06-01  8:47                     ` Jason Wang
  2021-06-01 17:31                       ` Jason Gunthorpe
  0 siblings, 1 reply; 258+ messages in thread
From: Jason Wang @ 2021-06-01  8:47 UTC (permalink / raw)
  To: Tian, Kevin, Lu Baolu, Liu Yi L
  Cc: kvm, Jonathan Corbet, iommu, LKML,
	Alex Williamson (alex.williamson@redhat.com)"",
	Jason Gunthorpe, David Woodhouse


在 2021/6/1 下午2:16, Tian, Kevin 写道:
>> From: Jason Wang
>> Sent: Tuesday, June 1, 2021 2:07 PM
>>
>> 在 2021/6/1 下午1:42, Tian, Kevin 写道:
>>>> From: Jason Wang
>>>> Sent: Tuesday, June 1, 2021 1:30 PM
>>>>
>>>> 在 2021/6/1 下午1:23, Lu Baolu 写道:
>>>>> Hi Jason W,
>>>>>
>>>>> On 6/1/21 1:08 PM, Jason Wang wrote:
>>>>>>>> 2) If yes, what's the reason for not simply use the fd opened from
>>>>>>>> /dev/ioas. (This is the question that is not answered) and what
>>>>>>>> happens
>>>>>>>> if we call GET_INFO for the ioasid_fd?
>>>>>>>> 3) If not, how GET_INFO work?
>>>>>>> oh, missed this question in prior reply. Personally, no special reason
>>>>>>> yet. But using ID may give us opportunity to customize the
>> management
>>>>>>> of the handle. For one, better lookup efficiency by using xarray to
>>>>>>> store the allocated IDs. For two, could categorize the allocated IDs
>>>>>>> (parent or nested). GET_INFO just works with an input FD and an ID.
>>>>>> I'm not sure I get this, for nesting cases you can still make the
>>>>>> child an fd.
>>>>>>
>>>>>> And a question still, under what case we need to create multiple
>>>>>> ioasids on a single ioasid fd?
>>>>> One possible situation where multiple IOASIDs per FD could be used is
>>>>> that devices with different underlying IOMMU capabilities are sharing a
>>>>> single FD. In this case, only devices with consistent underlying IOMMU
>>>>> capabilities could be put in an IOASID and multiple IOASIDs per FD could
>>>>> be applied.
>>>>>
>>>>> Though, I still not sure about "multiple IOASID per-FD" vs "multiple
>>>>> IOASID FDs" for such case.
>>>> Right, that's exactly my question. The latter seems much more easier to
>>>> be understood and implemented.
>>>>
>>> A simple reason discussed in previous thread - there could be 1M's
>>> I/O address spaces per device while #FD's are precious resource.
>>
>> Is the concern for ulimit or performance? Note that we had
>>
>> #define NR_OPEN_MAX ~0U
>>
>> And with the fd semantic, you can do a lot of other stuffs: close on
>> exec, passing via SCM_RIGHTS.
> yes, fd has its merits.
>
>> For the case of 1M, I would like to know what's the use case for a
>> single process to handle 1M+ address spaces?
> This single process is Qemu with an assigned device. Within the guest
> there could be many guest processes. Though in reality I didn't see
> such 1M processes on a single device, better not restrict it in uAPI?


Sorry I don't get here.

We can open up to ~0U file descriptors, I don't see why we need to 
restrict it in uAPI.

Thanks


>
>>
>>> So this RFC treats fd as a container of address spaces which is each
>>> tagged by an IOASID.
>>
>> If the container and address space is 1:1 then the container seems useless.
>>
> yes, 1:1 then container is useless. But here it's assumed 1:M then
> even a single fd is sufficient for all intended usages.
>
> Thanks
> Kevin


^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-28 23:36 ` Jason Gunthorpe
  2021-05-31 11:31   ` Liu Yi L
@ 2021-06-01 11:09   ` Lu Baolu
  2021-06-01 17:26     ` Jason Gunthorpe
  2021-06-03  5:54     ` David Gibson
  2021-06-02  7:22   ` David Gibson
  2021-06-03  6:39   ` Tian, Kevin
  3 siblings, 2 replies; 258+ messages in thread
From: Lu Baolu @ 2021-06-01 11:09 UTC (permalink / raw)
  To: Jason Gunthorpe, Tian, Kevin
  Cc: baolu.lu, LKML, Joerg Roedel, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

Hi Jason,

On 2021/5/29 7:36, Jason Gunthorpe wrote:
>> /*
>>    * Bind an user-managed I/O page table with the IOMMU
>>    *
>>    * Because user page table is untrusted, IOASID nesting must be enabled
>>    * for this ioasid so the kernel can enforce its DMA isolation policy
>>    * through the parent ioasid.
>>    *
>>    * Pgtable binding protocol is different from DMA mapping. The latter
>>    * has the I/O page table constructed by the kernel and updated
>>    * according to user MAP/UNMAP commands. With pgtable binding the
>>    * whole page table is created and updated by userspace, thus different
>>    * set of commands are required (bind, iotlb invalidation, page fault, etc.).
>>    *
>>    * Because the page table is directly walked by the IOMMU, the user
>>    * must  use a format compatible to the underlying hardware. It can
>>    * check the format information through IOASID_GET_INFO.
>>    *
>>    * The page table is bound to the IOMMU according to the routing
>>    * information of each attached device under the specified IOASID. The
>>    * routing information (RID and optional PASID) is registered when a
>>    * device is attached to this IOASID through VFIO uAPI.
>>    *
>>    * Input parameters:
>>    *	- child_ioasid;
>>    *	- address of the user page table;
>>    *	- formats (vendor, address_width, etc.);
>>    *
>>    * Return: 0 on success, -errno on failure.
>>    */
>> #define IOASID_BIND_PGTABLE		_IO(IOASID_TYPE, IOASID_BASE + 9)
>> #define IOASID_UNBIND_PGTABLE	_IO(IOASID_TYPE, IOASID_BASE + 10)
> Also feels backwards, why wouldn't we specify this, and the required
> page table format, during alloc time?
> 

Thinking of the required page table format, perhaps we should shed more
light on the page table of an IOASID. So far, an IOASID might represent
one of the following page tables (might be more):

  1) an IOMMU format page table (a.k.a. iommu_domain)
  2) a user application CPU page table (SVA for example)
  3) a KVM EPT (future option)
  4) a VM guest managed page table (nesting mode)

This version only covers 1) and 4). Do you think we need to support 2),
3) and beyond? If so, it seems that we need some in-kernel helpers and
uAPIs to support pre-installing a page table to IOASID. From this point
of view an IOASID is actually not just a variant of iommu_domain, but an
I/O page table representation in a broader sense.

Best regards,
baolu

^ permalink raw reply	[flat|nested] 258+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-05-31 18:12   ` Jason Gunthorpe
@ 2021-06-01 12:04     ` Parav Pandit
  2021-06-01 17:36       ` Jason Gunthorpe
  0 siblings, 1 reply; 258+ messages in thread
From: Parav Pandit @ 2021-06-01 12:04 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy



> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Monday, May 31, 2021 11:43 PM
> 
> On Mon, May 31, 2021 at 05:37:35PM +0000, Parav Pandit wrote:
> 
> > In that case, can it be a new system call? Why does it have to be under
> /dev/ioasid?
> > For example few years back such system call mpin() thought was proposed
> in [1].
> 
> Reference counting of the overall pins are required
> 
> So when a pinned pages is incorporated into an IOASID page table in a later
> IOCTL it means it cannot be unpinned while the IOASID page table is using it.
Ok. but cant it use the same refcount of that mmu uses?

> 
> This is some trick to organize the pinning into groups and then refcount each
> group, thus avoiding needing per-page refcounts.
Pinned page refcount is already maintained by the mmu without ioasid, isn't it?

> 
> The data structure would be an interval tree of pins in general
> 
> The ioasid itself would have an interval tree of its own mappings, each entry
> in this tree would reference count against an element in the above tree
> 
> Then the ioasid's interval tree would be mapped into a page table tree in HW
> format.
Does it mean in simple use case [1], second level page table copy is maintained in the IOMMU side via map interface?
I hope not. It should use the same as what mmu uses, right?

[1] one SIOV/ADI device assigned with one PASID and mapped in guest VM

> 
> The redundant storages are needed to keep track of the refencing and the
> CPU page table values for later unpinning.
> 
> Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  7:15     ` Shenming Lu
@ 2021-06-01 12:30       ` Lu Baolu
  2021-06-01 13:10         ` Shenming Lu
  2021-06-01 17:33         ` Jason Gunthorpe
  0 siblings, 2 replies; 258+ messages in thread
From: Lu Baolu @ 2021-06-01 12:30 UTC (permalink / raw)
  To: Shenming Lu, Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang
  Cc: baolu.lu, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu,
	Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker, David Gibson,
	Kirti Wankhede, Robin Murphy, Zenghui Yu, wanghaibin.wang

On 2021/6/1 15:15, Shenming Lu wrote:
> On 2021/6/1 13:10, Lu Baolu wrote:
>> Hi Shenming,
>>
>> On 6/1/21 12:31 PM, Shenming Lu wrote:
>>> On 2021/5/27 15:58, Tian, Kevin wrote:
>>>> /dev/ioasid provides an unified interface for managing I/O page tables for
>>>> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
>>>> etc.) are expected to use this interface instead of creating their own logic to
>>>> isolate untrusted device DMAs initiated by userspace.
>>>>
>>>> This proposal describes the uAPI of /dev/ioasid and also sample sequences
>>>> with VFIO as example in typical usages. The driver-facing kernel API provided
>>>> by the iommu layer is still TBD, which can be discussed after consensus is
>>>> made on this uAPI.
>>>>
>>>> It's based on a lengthy discussion starting from here:
>>>>      https://lore.kernel.org/linux-iommu/20210330132830.GO2356281@nvidia.com/
>>>>
>>>> It ends up to be a long writing due to many things to be summarized and
>>>> non-trivial effort required to connect them into a complete proposal.
>>>> Hope it provides a clean base to converge.
>>>>
>>> [..]
>>>
>>>> /*
>>>>     * Page fault report and response
>>>>     *
>>>>     * This is TBD. Can be added after other parts are cleared up. Likely it
>>>>     * will be a ring buffer shared between user/kernel, an eventfd to notify
>>>>     * the user and an ioctl to complete the fault.
>>>>     *
>>>>     * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
>>>>     */
>>> Hi,
>>>
>>> It seems that the ioasid has different usage in different situation, it could
>>> be directly used in the physical routing, or just a virtual handle that indicates
>>> a page table or a vPASID table (such as the GPA address space, in the simple
>>> passthrough case, the DMA input to IOMMU will just contain a Stream ID, no
>>> Substream ID), right?
>>>
>>> And Baolu suggested that since one device might consume multiple page tables,
>>> it's more reasonable to have one fault handler per page table. By this, do we
>>> have to maintain such an ioasid info list in the IOMMU layer?
>> As discussed earlier, the I/O page fault and cache invalidation paths
>> will have "device labels" so that the information could be easily
>> translated and routed.
>>
>> So it's likely the per-device fault handler registering API in iommu
>> core can be kept, but /dev/ioasid will be grown with a layer to
>> translate and propagate I/O page fault information to the right
>> consumers.
> Yeah, having a general preprocessing of the faults in IOASID seems to be
> a doable direction. But since there may be more than one consumer at the
> same time, who is responsible for registering the per-device fault handler?

The drivers register per page table fault handlers to /dev/ioasid which
will then register itself to iommu core to listen and route the per-
device I/O page faults. This is just a top level thought. Haven't gone
through the details yet. Need to wait and see what /dev/ioasid finally
looks like.

Best regards,
baolu

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01 12:30       ` Lu Baolu
@ 2021-06-01 13:10         ` Shenming Lu
  2021-06-01 17:33         ` Jason Gunthorpe
  1 sibling, 0 replies; 258+ messages in thread
From: Shenming Lu @ 2021-06-01 13:10 UTC (permalink / raw)
  To: Lu Baolu, Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Jean-Philippe Brucker
  Cc: Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu, Hao,
	Jiang, Dave, Jacob Pan, David Gibson, Kirti Wankhede,
	Robin Murphy, Zenghui Yu, wanghaibin.wang

On 2021/6/1 20:30, Lu Baolu wrote:
> On 2021/6/1 15:15, Shenming Lu wrote:
>> On 2021/6/1 13:10, Lu Baolu wrote:
>>> Hi Shenming,
>>>
>>> On 6/1/21 12:31 PM, Shenming Lu wrote:
>>>> On 2021/5/27 15:58, Tian, Kevin wrote:
>>>>> /dev/ioasid provides an unified interface for managing I/O page tables for
>>>>> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
>>>>> etc.) are expected to use this interface instead of creating their own logic to
>>>>> isolate untrusted device DMAs initiated by userspace.
>>>>>
>>>>> This proposal describes the uAPI of /dev/ioasid and also sample sequences
>>>>> with VFIO as example in typical usages. The driver-facing kernel API provided
>>>>> by the iommu layer is still TBD, which can be discussed after consensus is
>>>>> made on this uAPI.
>>>>>
>>>>> It's based on a lengthy discussion starting from here:
>>>>>      https://lore.kernel.org/linux-iommu/20210330132830.GO2356281@nvidia.com/
>>>>>
>>>>> It ends up to be a long writing due to many things to be summarized and
>>>>> non-trivial effort required to connect them into a complete proposal.
>>>>> Hope it provides a clean base to converge.
>>>>>
>>>> [..]
>>>>
>>>>> /*
>>>>>     * Page fault report and response
>>>>>     *
>>>>>     * This is TBD. Can be added after other parts are cleared up. Likely it
>>>>>     * will be a ring buffer shared between user/kernel, an eventfd to notify
>>>>>     * the user and an ioctl to complete the fault.
>>>>>     *
>>>>>     * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
>>>>>     */
>>>> Hi,
>>>>
>>>> It seems that the ioasid has different usage in different situation, it could
>>>> be directly used in the physical routing, or just a virtual handle that indicates
>>>> a page table or a vPASID table (such as the GPA address space, in the simple
>>>> passthrough case, the DMA input to IOMMU will just contain a Stream ID, no
>>>> Substream ID), right?
>>>>
>>>> And Baolu suggested that since one device might consume multiple page tables,
>>>> it's more reasonable to have one fault handler per page table. By this, do we
>>>> have to maintain such an ioasid info list in the IOMMU layer?
>>> As discussed earlier, the I/O page fault and cache invalidation paths
>>> will have "device labels" so that the information could be easily
>>> translated and routed.
>>>
>>> So it's likely the per-device fault handler registering API in iommu
>>> core can be kept, but /dev/ioasid will be grown with a layer to
>>> translate and propagate I/O page fault information to the right
>>> consumers.
>> Yeah, having a general preprocessing of the faults in IOASID seems to be
>> a doable direction. But since there may be more than one consumer at the
>> same time, who is responsible for registering the per-device fault handler?
> 
> The drivers register per page table fault handlers to /dev/ioasid which
> will then register itself to iommu core to listen and route the per-
> device I/O page faults. This is just a top level thought. Haven't gone
> through the details yet. Need to wait and see what /dev/ioasid finally
> looks like.

OK. And it needs to be confirmed by Jean since we might migrate the code from
io-pgfault.c to IOASID... Anyway, finalize /dev/ioasid first.  Thanks,

Shenming

> 
> Best regards,
> baolu
> .

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  3:08       ` Lu Baolu
@ 2021-06-01 17:24         ` Jason Gunthorpe
  0 siblings, 0 replies; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-01 17:24 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Liu Yi L, Jean-Philippe Brucker, Tian, Kevin, Jiang, Dave, Raj,
	Ashok, kvm, Jonathan Corbet, David Woodhouse, Jason Wang, LKML,
	Kirti Wankhede, Alex Williamson (alex.williamson@redhat.com),
	iommu, Robin Murphy, David Gibson

On Tue, Jun 01, 2021 at 11:08:53AM +0800, Lu Baolu wrote:
> On 6/1/21 2:09 AM, Jason Gunthorpe wrote:
> > > > device bind should fail if the device somehow isn't compatible with
> > > > the scheme the user is tring to use.
> > > yeah, I guess you mean to fail the device attach when the IOASID is a
> > > nesting IOASID but the device is behind an iommu without nesting support.
> > > right?
> > Right..
> 
> Just want to confirm...
> 
> Does this mean that we only support hardware nesting and don't want to
> have soft nesting (shadowed page table in kernel) in IOASID?

No, the uAPI presents a contract, if the kernel can fulfill the
contract then it should be supported.

If you want SW nesting then the kernel has to have the SW support for
it or fail.

At least for the purposes of document I wouldn't devle too much deeper
into that question.

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01 11:09   ` Lu Baolu
@ 2021-06-01 17:26     ` Jason Gunthorpe
  2021-06-02  4:01       ` Lu Baolu
  2021-06-03  5:54     ` David Gibson
  1 sibling, 1 reply; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-01 17:26 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Tian, Kevin, LKML, Joerg Roedel, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

On Tue, Jun 01, 2021 at 07:09:21PM +0800, Lu Baolu wrote:

> This version only covers 1) and 4). Do you think we need to support 2),
> 3) and beyond? 

Yes aboslutely. The API should be flexable enough to specify the
creation of all future page table formats we'd want to have and all HW
specific details on those formats.

> If so, it seems that we need some in-kernel helpers and uAPIs to
> support pre-installing a page table to IOASID. 

Not sure what this means..

> From this point of view an IOASID is actually not just a variant of
> iommu_domain, but an I/O page table representation in a broader
> sense.

Yes, and things need to evolve in a staged way. The ioctl API should
have room for this growth but you need to start out with some
constrained enough to actually implement then figure out how to grow
from there

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  6:07                 ` Jason Wang
  2021-06-01  6:16                   ` Tian, Kevin
@ 2021-06-01 17:29                   ` Jason Gunthorpe
  2021-06-02  8:58                     ` Jason Wang
  1 sibling, 1 reply; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-01 17:29 UTC (permalink / raw)
  To: Jason Wang
  Cc: Tian, Kevin, Lu Baolu, Liu Yi L,
	Alex Williamson (alex.williamson@redhat.com)"",
	kvm, Jonathan Corbet, LKML, iommu, David Woodhouse

On Tue, Jun 01, 2021 at 02:07:05PM +0800, Jason Wang wrote:

> For the case of 1M, I would like to know what's the use case for a single
> process to handle 1M+ address spaces?

For some scenarios every guest PASID will require a IOASID ID # so
there is a large enough demand that FDs alone are not a good fit.

Further there are global container wide properties that are hard to
carry over to a multi-FD model, like the attachment of devices to the
container at the startup.

> > So this RFC treats fd as a container of address spaces which is each
> > tagged by an IOASID.
> 
> If the container and address space is 1:1 then the container seems useless.

The examples at the bottom of the document show multiple IOASIDs in
the container for a parent/child type relationship

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-05-27  7:58 [RFC] /dev/ioasid uAPI proposal Tian, Kevin
                   ` (7 preceding siblings ...)
  2021-06-01  4:31 ` Shenming Lu
@ 2021-06-01 17:30 ` Parav Pandit
  2021-06-03 20:58   ` Jacob Pan
  2021-06-02  6:15 ` David Gibson
  2021-06-02  8:56 ` Enrico Weigelt, metux IT consult
  10 siblings, 1 reply; 258+ messages in thread
From: Parav Pandit @ 2021-06-01 17:30 UTC (permalink / raw)
  To: Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe, Lu Baolu,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang
  Cc: Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu, Hao,
	Jiang, Dave, Jacob Pan, Jean-Philippe Brucker, David Gibson,
	Kirti Wankhede, Robin Murphy

> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Thursday, May 27, 2021 1:28 PM

> 5.6. I/O page fault
> +++++++++++++++
> 
> (uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
> to guest IOMMU driver and backwards).
> 
> -   Host IOMMU driver receives a page request with raw fault_data {rid,
>     pasid, addr};
> 
> -   Host IOMMU driver identifies the faulting I/O page table according to
>     information registered by IOASID fault handler;
> 
> -   IOASID fault handler is called with raw fault_data (rid, pasid, addr), which
>     is saved in ioasid_data->fault_data (used for response);
> 
> -   IOASID fault handler generates an user fault_data (ioasid, addr), links it
>     to the shared ring buffer and triggers eventfd to userspace;
> 
> -   Upon received event, Qemu needs to find the virtual routing information
>     (v_rid + v_pasid) of the device attached to the faulting ioasid. If there are
>     multiple, pick a random one. This should be fine since the purpose is to
>     fix the I/O page table on the guest;
> 
> -   Qemu generates a virtual I/O page fault through vIOMMU into guest,
>     carrying the virtual fault data (v_rid, v_pasid, addr);
> 
Why does it have to be through vIOMMU?
For a VFIO PCI device, have you considered to reuse the same PRI interface to inject page fault in the guest?
This eliminates any new v_rid.
It will also route the page fault request and response through the right vfio device.

> -   Guest IOMMU driver fixes up the fault, updates the I/O page table, and
>     then sends a page response with virtual completion data (v_rid, v_pasid,
>     response_code) to vIOMMU;
> 
What about fixing up the fault for mmu page table as well in guest?
Or you meant both when above you said "updates the I/O page table"?

It is unclear to me that if there is single nested page table maintained or two (one for cr3 references and other for iommu).
Can you please clarify?

> -   Qemu finds the pending fault event, converts virtual completion data
>     into (ioasid, response_code), and then calls a /dev/ioasid ioctl to
>     complete the pending fault;
> 
For VFIO PCI device a virtual PRI request response interface is done, it can be generic interface among multiple vIOMMUs.

> -   /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved in
>     ioasid_data->fault_data, and then calls iommu api to complete it with
>     {rid, pasid, response_code};
>

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  8:47                     ` Jason Wang
@ 2021-06-01 17:31                       ` Jason Gunthorpe
  2021-06-02  8:54                         ` Jason Wang
  0 siblings, 1 reply; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-01 17:31 UTC (permalink / raw)
  To: Jason Wang
  Cc: Tian, Kevin, Lu Baolu, Liu Yi L, kvm, Jonathan Corbet, iommu,
	LKML, Alex Williamson (alex.williamson@redhat.com)"",
	David Woodhouse

On Tue, Jun 01, 2021 at 04:47:15PM +0800, Jason Wang wrote:
 
> We can open up to ~0U file descriptors, I don't see why we need to restrict
> it in uAPI.

There are significant problems with such large file descriptor
tables. High FD numbers man things like select don't work at all
anymore and IIRC there are more complications.

A huge number of FDs for typical usages should be avoided.

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01 12:30       ` Lu Baolu
  2021-06-01 13:10         ` Shenming Lu
@ 2021-06-01 17:33         ` Jason Gunthorpe
  2021-06-02  4:50           ` Shenming Lu
  1 sibling, 1 reply; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-01 17:33 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Shenming Lu, Tian, Kevin, LKML, Joerg Roedel, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy, Zenghui Yu,
	wanghaibin.wang

On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote:

> The drivers register per page table fault handlers to /dev/ioasid which
> will then register itself to iommu core to listen and route the per-
> device I/O page faults. 

I'm still confused why drivers need fault handlers at all?

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01 12:04     ` Parav Pandit
@ 2021-06-01 17:36       ` Jason Gunthorpe
  0 siblings, 0 replies; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-01 17:36 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

On Tue, Jun 01, 2021 at 12:04:00PM +0000, Parav Pandit wrote:
> 
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Monday, May 31, 2021 11:43 PM
> > 
> > On Mon, May 31, 2021 at 05:37:35PM +0000, Parav Pandit wrote:
> > 
> > > In that case, can it be a new system call? Why does it have to be under
> > /dev/ioasid?
> > > For example few years back such system call mpin() thought was proposed
> > in [1].
> > 
> > Reference counting of the overall pins are required
> > 
> > So when a pinned pages is incorporated into an IOASID page table in a later
> > IOCTL it means it cannot be unpinned while the IOASID page table is using it.
> Ok. but cant it use the same refcount of that mmu uses?

Manipulating that refcount is part of the overhead that is trying to
be avoided here, plus ensuring that the pinned pages accounting
doesn't get out of sync with the actual account of pinned pages!

> > Then the ioasid's interval tree would be mapped into a page table tree in HW
> > format.

> Does it mean in simple use case [1], second level page table copy is
> maintained in the IOMMU side via map interface?  I hope not. It
> should use the same as what mmu uses, right?

Not a full page by page copy, but some interval reference.

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  8:10   ` Tian, Kevin
@ 2021-06-01 17:42     ` Jason Gunthorpe
  2021-06-02  1:33       ` Tian, Kevin
  0 siblings, 1 reply; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-01 17:42 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

On Tue, Jun 01, 2021 at 08:10:14AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Saturday, May 29, 2021 1:36 AM
> > 
> > On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > 
> > > IOASID nesting can be implemented in two ways: hardware nesting and
> > > software nesting. With hardware support the child and parent I/O page
> > > tables are walked consecutively by the IOMMU to form a nested translation.
> > > When it's implemented in software, the ioasid driver is responsible for
> > > merging the two-level mappings into a single-level shadow I/O page table.
> > > Software nesting requires both child/parent page tables operated through
> > > the dma mapping protocol, so any change in either level can be captured
> > > by the kernel to update the corresponding shadow mapping.
> > 
> > Why? A SW emulation could do this synchronization during invalidation
> > processing if invalidation contained an IOVA range.
> 
> In this proposal we differentiate between host-managed and user-
> managed I/O page tables. If host-managed, the user is expected to use
> map/unmap cmd explicitly upon any change required on the page table. 
> If user-managed, the user first binds its page table to the IOMMU and 
> then use invalidation cmd to flush iotlb when necessary (e.g. typically
> not required when changing a PTE from non-present to present).
> 
> We expect user to use map+unmap and bind+invalidate respectively
> instead of mixing them together. Following this policy, map+unmap
> must be used in both levels for software nesting, so changes in either 
> level are captured timely to synchronize the shadow mapping.

map+unmap or bind+invalidate is a policy of the IOASID itself set when
it is created. If you put two different types in a tree then each IOASID
must continue to use its own operation mode.

I don't see a reason to force all IOASIDs in a tree to be consistent??

A software emulated two level page table where the leaf level is a
bound page table in guest memory should continue to use
bind/invalidate to maintain the guest page table IOASID even though it
is a SW construct.

The GPA level should use map/unmap because it is a kernel owned page
table

Though how to efficiently mix map/unmap on the GPA when there are SW
nested levels below it looks to be quite challenging.

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  8:38   ` Tian, Kevin
@ 2021-06-01 17:56     ` Jason Gunthorpe
  2021-06-02  2:00       ` Tian, Kevin
  2021-06-02  6:57       ` David Gibson
  0 siblings, 2 replies; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-01 17:56 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

On Tue, Jun 01, 2021 at 08:38:00AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Saturday, May 29, 2021 3:59 AM
> > 
> > On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > >
> > > 5. Use Cases and Flows
> > >
> > > Here assume VFIO will support a new model where every bound device
> > > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> > > going through legacy container/group interface. For illustration purpose
> > > those devices are just called dev[1...N]:
> > >
> > > 	device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> > >
> > > As explained earlier, one IOASID fd is sufficient for all intended use cases:
> > >
> > > 	ioasid_fd = open("/dev/ioasid", mode);
> > >
> > > For simplicity below examples are all made for the virtualization story.
> > > They are representative and could be easily adapted to a non-virtualization
> > > scenario.
> > 
> > For others, I don't think this is *strictly* necessary, we can
> > probably still get to the device_fd using the group_fd and fit in
> > /dev/ioasid. It does make the rest of this more readable though.
> 
> Jason, want to confirm here. Per earlier discussion we remain an
> impression that you want VFIO to be a pure device driver thus
> container/group are used only for legacy application.

Let me call this a "nice wish".

If you get to a point where you hard need this, then identify the hard
requirement and let's do it, but I wouldn't bloat this already large
project unnecessarily.

Similarly I wouldn't depend on the group fd existing in this design
so it could be changed later.

> From this comment are you suggesting that VFIO can still keep
> container/ group concepts and user just deprecates the use of vfio
> iommu uAPI (e.g. VFIO_SET_IOMMU) by using /dev/ioasid (which has a
> simple policy that an IOASID will reject cmd if partially-attached
> group exists)?

I would say no on the container. /dev/ioasid == the container, having
two competing objects at once in a single process is just a mess.

If the group fd can be kept requires charting a path through the
ioctls where the container is not used and /dev/ioasid is sub'd in
using the same device FD specific IOCTLs you show here.

I didn't try to chart this out carefully.

Also, ultimately, something need to be done about compatability with
the vfio container fd. It looks clear enough to me that the the VFIO
container FD is just a single IOASID using a special ioctl interface
so it would be quite rasonable to harmonize these somehow.

But that is too complicated and far out for me at least to guess on at
this point..

> > Still a little unsure why the vPASID is here not on the gva_ioasid. Is
> > there any scenario where we want different vpasid's for the same
> > IOASID? I guess it is OK like this. Hum.
> 
> Yes, it's completely sane that the guest links a I/O page table to 
> different vpasids on dev1 and dev2. The IOMMU doesn't mandate
> that when multiple devices share an I/O page table they must use
> the same PASID#. 

Ok..

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  7:01   ` Tian, Kevin
@ 2021-06-01 20:28     ` Jason Gunthorpe
  2021-06-02  1:25       ` Tian, Kevin
  2021-06-02  8:52       ` Jason Wang
  2021-06-01 22:22     ` Alex Williamson
  1 sibling, 2 replies; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-01 20:28 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

On Tue, Jun 01, 2021 at 07:01:57AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Saturday, May 29, 2021 4:03 AM
> > 
> > On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > > /dev/ioasid provides an unified interface for managing I/O page tables for
> > > devices assigned to userspace. Device passthrough frameworks (VFIO,
> > vDPA,
> > > etc.) are expected to use this interface instead of creating their own logic to
> > > isolate untrusted device DMAs initiated by userspace.
> > 
> > It is very long, but I think this has turned out quite well. It
> > certainly matches the basic sketch I had in my head when we were
> > talking about how to create vDPA devices a few years ago.
> > 
> > When you get down to the operations they all seem pretty common sense
> > and straightfoward. Create an IOASID. Connect to a device. Fill the
> > IOASID with pages somehow. Worry about PASID labeling.
> > 
> > It really is critical to get all the vendor IOMMU people to go over it
> > and see how their HW features map into this.
> > 
> 
> Agree. btw I feel it might be good to have several design opens 
> centrally discussed after going through all the comments. Otherwise 
> they may be buried in different sub-threads and potentially with 
> insufficient care (especially for people who haven't completed the
> reading).
> 
> I summarized five opens here, about:
> 
> 1)  Finalizing the name to replace /dev/ioasid;
> 2)  Whether one device is allowed to bind to multiple IOASID fd's;
> 3)  Carry device information in invalidation/fault reporting uAPI;
> 4)  What should/could be specified when allocating an IOASID;
> 5)  The protocol between vfio group and kvm;
> 
> For 1), two alternative names are mentioned: /dev/iommu and 
> /dev/ioas. I don't have a strong preference and would like to hear 
> votes from all stakeholders. /dev/iommu is slightly better imho for 
> two reasons. First, per AMD's presentation in last KVM forum they 
> implement vIOMMU in hardware thus need to support user-managed 
> domains. An iommu uAPI notation might make more sense moving 
> forward. Second, it makes later uAPI naming easier as 'IOASID' can 
> be always put as an object, e.g. IOMMU_ALLOC_IOASID instead of 
> IOASID_ALLOC_IOASID. :)

I think two years ago I suggested /dev/iommu and it didn't go very far
at the time. We've also talked about this as /dev/sva for a while and
now /dev/ioasid

I think /dev/iommu is fine, and call the things inside them IOAS
objects.

Then we don't have naming aliasing with kernel constructs.
 
> For 2), Jason prefers to not blocking it if no kernel design reason. If 
> one device is allowed to bind multiple IOASID fd's, the main problem
> is about cross-fd IOASID nesting, e.g. having gpa_ioasid created in fd1 
> and giova_ioasid created in fd2 and then nesting them together (and

Huh? This can't happen

Creating an IOASID is an operation on on the /dev/ioasid FD. We won't
provide APIs to create a tree of IOASID's outside a single FD container.

If a device can consume multiple IOASID's it doesn't care how many or
what /dev/ioasid FDs they come from.

> To the other end there was also thought whether we should make
> a single I/O address space per IOASID fd. This was discussed in previous
> thread that #fd's are insufficient to afford theoretical 1M's address
> spaces per device. But let's have another revisit and draw a clear
> conclusion whether this option is viable.

I had remarks on this, I think per-fd doesn't work
 
> This implies that VFIO_BOUND_IOASID will be extended to allow user
> specify a device label. This label will be recorded in /dev/iommu to
> serve per-device invalidation request from and report per-device 
> fault data to the user.

I wonder which of the user providing a 64 bit cookie or the kernel
returning a small IDA is the best choice here? Both have merits
depending on what qemu needs..

> In addition, vPASID (if provided by user) will
> be also recorded in /dev/iommu so vPASID<->pPASID conversion 
> is conducted properly. e.g. invalidation request from user carries
> a vPASID which must be converted into pPASID before calling iommu
> driver. Vice versa for raw fault data which carries pPASID while the
> user expects a vPASID.

I don't think the PASID should be returned at all. It should return
the IOASID number in the FD and/or a u64 cookie associated with that
IOASID. Userspace should figure out what the IOASID & device
combination means.

> Seems to close this design open we have to touch the kAPI design. and 
> Joerg's input is highly appreciated here.

uAPI is forever, the kAPI is constantly changing. I always dislike
warping the uAPI based on the current kAPI situation.

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01  7:01   ` Tian, Kevin
  2021-06-01 20:28     ` Jason Gunthorpe
@ 2021-06-01 22:22     ` Alex Williamson
  2021-06-02  2:20       ` Tian, Kevin
  2021-06-08  2:37       ` David Gibson
  1 sibling, 2 replies; 258+ messages in thread
From: Alex Williamson @ 2021-06-01 22:22 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok,
	Liu, Yi L, Wu, Hao, Jiang, Dave, Jacob Pan,
	Jean-Philippe Brucker, David Gibson, Kirti Wankhede,
	Robin Murphy

On Tue, 1 Jun 2021 07:01:57 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> I summarized five opens here, about:
> 
> 1)  Finalizing the name to replace /dev/ioasid;
> 2)  Whether one device is allowed to bind to multiple IOASID fd's;
> 3)  Carry device information in invalidation/fault reporting uAPI;
> 4)  What should/could be specified when allocating an IOASID;
> 5)  The protocol between vfio group and kvm;
> 
...
> 
> For 5), I'd expect Alex to chime in. Per my understanding looks the
> original purpose of this protocol is not about I/O address space. It's
> for KVM to know whether any device is assigned to this VM and then
> do something special (e.g. posted interrupt, EPT cache attribute, etc.).

Right, the original use case was for KVM to determine whether it needs
to emulate invlpg, so it needs to be aware when an assigned device is
present and be able to test if DMA for that device is cache coherent.
The user, QEMU, creates a KVM "pseudo" device representing the vfio
group, providing the file descriptor of that group to show ownership.
The ugly symbol_get code is to avoid hard module dependencies, ie. the
kvm module should not pull in or require the vfio module, but vfio will
be present if attempting to register this device.

With kvmgt, the interface also became a way to register the kvm pointer
with vfio for the translation mentioned elsewhere in this thread.

The PPC/SPAPR support allows KVM to associate a vfio group to an IOMMU
page table so that it can handle iotlb programming from pre-registered
memory without trapping out to userspace.

> Because KVM deduces some policy based on the fact of assigned device, 
> it needs to hold a reference to related vfio group. this part is irrelevant
> to this RFC. 

All of these use cases are related to the IOMMU, whether DMA is
coherent, translating device IOVA to GPA, and an acceleration path to
emulate IOMMU programming in kernel... they seem pretty relevant.

> But ARM's VMID usage is related to I/O address space thus needs some
> consideration. Another strange thing is about PPC. Looks it also leverages
> this protocol to do iommu group attach: kvm_spapr_tce_attach_iommu_
> group. I don't know why it's done through KVM instead of VFIO uAPI in
> the first place.

AIUI, IOMMU programming on PPC is done through hypercalls, so KVM needs
to know how to handle those for in-kernel acceleration.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 258+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-06-01 20:28     ` Jason Gunthorpe
@ 2021-06-02  1:25       ` Tian, Kevin
  2021-06-02 23:27         ` Jason Gunthorpe
  2021-06-04  8:17         ` Jean-Philippe Brucker
  2021-06-02  8:52       ` Jason Wang
  1 sibling, 2 replies; 258+ messages in thread
From: Tian, Kevin @ 2021-06-02  1:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, June 2, 2021 4:29 AM
> 
> On Tue, Jun 01, 2021 at 07:01:57AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Saturday, May 29, 2021 4:03 AM
> > >
> > > On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > > > /dev/ioasid provides an unified interface for managing I/O page tables
> for
> > > > devices assigned to userspace. Device passthrough frameworks (VFIO,
> > > vDPA,
> > > > etc.) are expected to use this interface instead of creating their own
> logic to
> > > > isolate untrusted device DMAs initiated by userspace.
> > >
> > > It is very long, but I think this has turned out quite well. It
> > > certainly matches the basic sketch I had in my head when we were
> > > talking about how to create vDPA devices a few years ago.
> > >
> > > When you get down to the operations they all seem pretty common
> sense
> > > and straightfoward. Create an IOASID. Connect to a device. Fill the
> > > IOASID with pages somehow. Worry about PASID labeling.
> > >
> > > It really is critical to get all the vendor IOMMU people to go over it
> > > and see how their HW features map into this.
> > >
> >
> > Agree. btw I feel it might be good to have several design opens
> > centrally discussed after going through all the comments. Otherwise
> > they may be buried in different sub-threads and potentially with
> > insufficient care (especially for people who haven't completed the
> > reading).
> >
> > I summarized five opens here, about:
> >
> > 1)  Finalizing the name to replace /dev/ioasid;
> > 2)  Whether one device is allowed to bind to multiple IOASID fd's;
> > 3)  Carry device information in invalidation/fault reporting uAPI;
> > 4)  What should/could be specified when allocating an IOASID;
> > 5)  The protocol between vfio group and kvm;
> >
> > For 1), two alternative names are mentioned: /dev/iommu and
> > /dev/ioas. I don't have a strong preference and would like to hear
> > votes from all stakeholders. /dev/iommu is slightly better imho for
> > two reasons. First, per AMD's presentation in last KVM forum they
> > implement vIOMMU in hardware thus need to support user-managed
> > domains. An iommu uAPI notation might make more sense moving
> > forward. Second, it makes later uAPI naming easier as 'IOASID' can
> > be always put as an object, e.g. IOMMU_ALLOC_IOASID instead of
> > IOASID_ALLOC_IOASID. :)
> 
> I think two years ago I suggested /dev/iommu and it didn't go very far
> at the time. We've also talked about this as /dev/sva for a while and
> now /dev/ioasid
> 
> I think /dev/iommu is fine, and call the things inside them IOAS
> objects.
> 
> Then we don't have naming aliasing with kernel constructs.
> 
> > For 2), Jason prefers to not blocking it if no kernel design reason. If
> > one device is allowed to bind multiple IOASID fd's, the main problem
> > is about cross-fd IOASID nesting, e.g. having gpa_ioasid created in fd1
> > and giova_ioasid created in fd2 and then nesting them together (and
> 
> Huh? This can't happen
> 
> Creating an IOASID is an operation on on the /dev/ioasid FD. We won't
> provide APIs to create a tree of IOASID's outside a single FD container.
> 
> If a device can consume multiple IOASID's it doesn't care how many or
> what /dev/ioasid FDs they come from.

OK, this implies that if one user inadvertently creates intended parent/
child via different fd's then the operation will simply fail. More specifically
taking ARM's case for example. There is only a single 2nd-level I/O page
table per device (nested by multiple 1st-level tables). Say the user already 
creates a gpa_ioasid for a device via fd1. Now he binds the device to fd2, 
intending to enable vSVA which requires nested translation thus needs 
create a parent via fd2. This parent creation will simply fail by the IOMMU 
layer because the 2nd-level (via fd1) is already installed for this device.

> 
> > To the other end there was also thought whether we should make
> > a single I/O address space per IOASID fd. This was discussed in previous
> > thread that #fd's are insufficient to afford theoretical 1M's address
> > spaces per device. But let's have another revisit and draw a clear
> > conclusion whether this option is viable.
> 
> I had remarks on this, I think per-fd doesn't work
> 
> > This implies that VFIO_BOUND_IOASID will be extended to allow user
> > specify a device label. This label will be recorded in /dev/iommu to
> > serve per-device invalidation request from and report per-device
> > fault data to the user.
> 
> I wonder which of the user providing a 64 bit cookie or the kernel
> returning a small IDA is the best choice here? Both have merits
> depending on what qemu needs..

Yes, either way can work. I don't have a strong preference. Jean?

> 
> > In addition, vPASID (if provided by user) will
> > be also recorded in /dev/iommu so vPASID<->pPASID conversion
> > is conducted properly. e.g. invalidation request from user carries
> > a vPASID which must be converted into pPASID before calling iommu
> > driver. Vice versa for raw fault data which carries pPASID while the
> > user expects a vPASID.
> 
> I don't think the PASID should be returned at all. It should return
> the IOASID number in the FD and/or a u64 cookie associated with that
> IOASID. Userspace should figure out what the IOASID & device
> combination means.

This is true for Intel. But what about ARM which has only one IOASID
(pasid table) per device to represent all guest I/O page tables?

> 
> > Seems to close this design open we have to touch the kAPI design. and
> > Joerg's input is highly appreciated here.
> 
> uAPI is forever, the kAPI is constantly changing. I always dislike
> warping the uAPI based on the current kAPI situation.
> 

I got this point. My point was that I didn't see significant gain from either 
option thus to better compare the two uAPI options we might want to 
further consider the involved kAPI effort as another factor.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 258+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-06-01 17:42     ` Jason Gunthorpe
@ 2021-06-02  1:33       ` Tian, Kevin
  2021-06-02 16:09         ` Jason Gunthorpe
  0 siblings, 1 reply; 258+ messages in thread
From: Tian, Kevin @ 2021-06-02  1:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, June 2, 2021 1:42 AM
> 
> On Tue, Jun 01, 2021 at 08:10:14AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Saturday, May 29, 2021 1:36 AM
> > >
> > > On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > >
> > > > IOASID nesting can be implemented in two ways: hardware nesting and
> > > > software nesting. With hardware support the child and parent I/O page
> > > > tables are walked consecutively by the IOMMU to form a nested
> translation.
> > > > When it's implemented in software, the ioasid driver is responsible for
> > > > merging the two-level mappings into a single-level shadow I/O page
> table.
> > > > Software nesting requires both child/parent page tables operated
> through
> > > > the dma mapping protocol, so any change in either level can be
> captured
> > > > by the kernel to update the corresponding shadow mapping.
> > >
> > > Why? A SW emulation could do this synchronization during invalidation
> > > processing if invalidation contained an IOVA range.
> >
> > In this proposal we differentiate between host-managed and user-
> > managed I/O page tables. If host-managed, the user is expected to use
> > map/unmap cmd explicitly upon any change required on the page table.
> > If user-managed, the user first binds its page table to the IOMMU and
> > then use invalidation cmd to flush iotlb when necessary (e.g. typically
> > not required when changing a PTE from non-present to present).
> >
> > We expect user to use map+unmap and bind+invalidate respectively
> > instead of mixing them together. Following this policy, map+unmap
> > must be used in both levels for software nesting, so changes in either
> > level are captured timely to synchronize the shadow mapping.
> 
> map+unmap or bind+invalidate is a policy of the IOASID itself set when
> it is created. If you put two different types in a tree then each IOASID
> must continue to use its own operation mode.
> 
> I don't see a reason to force all IOASIDs in a tree to be consistent??

only for software nesting. With hardware support the parent uses map
while the child uses bind.

Yes, the policy is specified per IOASID. But if the policy violates the
requirement in a specific nesting mode, then nesting should fail.

> 
> A software emulated two level page table where the leaf level is a
> bound page table in guest memory should continue to use
> bind/invalidate to maintain the guest page table IOASID even though it
> is a SW construct.

with software nesting the leaf should be a host-managed page table
(or metadata). A bind/invalidate protocol doesn't require the user
to notify the kernel of every page table change. But for software nesting
the kernel must know every change to timely update the shadow/merged 
mapping, otherwise DMA may hit stale mapping.

> 
> The GPA level should use map/unmap because it is a kernel owned page
> table

yes, this is always true.

> 
> Though how to efficiently mix map/unmap on the GPA when there are SW
> nested levels below it looks to be quite challenging.
> 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 258+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-06-01 17:56     ` Jason Gunthorpe
@ 2021-06-02  2:00       ` Tian, Kevin
  2021-06-02  6:57       ` David Gibson
  1 sibling, 0 replies; 258+ messages in thread
From: Tian, Kevin @ 2021-06-02  2:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, June 2, 2021 1:57 AM
> 
> On Tue, Jun 01, 2021 at 08:38:00AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Saturday, May 29, 2021 3:59 AM
> > >
> > > On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > > >
> > > > 5. Use Cases and Flows
> > > >
> > > > Here assume VFIO will support a new model where every bound device
> > > > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> > > > going through legacy container/group interface. For illustration purpose
> > > > those devices are just called dev[1...N]:
> > > >
> > > > 	device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> > > >
> > > > As explained earlier, one IOASID fd is sufficient for all intended use
> cases:
> > > >
> > > > 	ioasid_fd = open("/dev/ioasid", mode);
> > > >
> > > > For simplicity below examples are all made for the virtualization story.
> > > > They are representative and could be easily adapted to a non-
> virtualization
> > > > scenario.
> > >
> > > For others, I don't think this is *strictly* necessary, we can
> > > probably still get to the device_fd using the group_fd and fit in
> > > /dev/ioasid. It does make the rest of this more readable though.
> >
> > Jason, want to confirm here. Per earlier discussion we remain an
> > impression that you want VFIO to be a pure device driver thus
> > container/group are used only for legacy application.
> 
> Let me call this a "nice wish".
> 
> If you get to a point where you hard need this, then identify the hard
> requirement and let's do it, but I wouldn't bloat this already large
> project unnecessarily.
> 

OK, got your point. So let's start by keeping this room. For new
sub-systems like vDPA,  they don't need inventing group fd uAPI
and just leave to their user to meet the group limitation. For existing
sub-system i.e. VFIO, it could keep a stronger group enforcement
uAPI like today. One day, we may revisit it if the simple policy works
well for all other new sub-systems.

> Similarly I wouldn't depend on the group fd existing in this design
> so it could be changed later.

Yes, this is guaranteed. /dev/ioasid uAPI has no group concept.

> 
> > From this comment are you suggesting that VFIO can still keep
> > container/ group concepts and user just deprecates the use of vfio
> > iommu uAPI (e.g. VFIO_SET_IOMMU) by using /dev/ioasid (which has a
> > simple policy that an IOASID will reject cmd if partially-attached
> > group exists)?
> 
> I would say no on the container. /dev/ioasid == the container, having
> two competing objects at once in a single process is just a mess.
> 
> If the group fd can be kept requires charting a path through the
> ioctls where the container is not used and /dev/ioasid is sub'd in
> using the same device FD specific IOCTLs you show here.

yes

> 
> I didn't try to chart this out carefully.
> 
> Also, ultimately, something need to be done about compatability with
> the vfio container fd. It looks clear enough to me that the the VFIO
> container FD is just a single IOASID using a special ioctl interface
> so it would be quite rasonable to harmonize these somehow.

Possibly multiple IOASIDs as VFIO container cay hold incompatible devices
today. Suppose helper functions will be provided for VFIO container to
create IOASID and then use map/unmap to manage its I/O page table.
This is the shim iommu driver concept in previous discussion between
you and Alex.

This can be done at a later stage. Let's focus on /dev/ioasid uAPI, and
bear some code duplication between it and vfio type1 for now. 

> 
> But that is too complicated and far out for me at least to guess on at
> this point..

We're working on a prototype in parallel with this discussion. Based on
this work we'll figure out what's the best way to start with.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 258+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-06-01 22:22     ` Alex Williamson
@ 2021-06-02  2:20       ` Tian, Kevin
  2021-06-02 16:01         ` Jason Gunthorpe
  2021-06-08  2:37       ` David Gibson
  1 sibling, 1 reply; 258+ messages in thread
From: Tian, Kevin @ 2021-06-02  2:20 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok,
	Liu, Yi L, Wu, Hao, Jiang, Dave, Jacob Pan,
	Jean-Philippe Brucker, David Gibson, Kirti Wankhede,
	Robin Murphy

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, June 2, 2021 6:22 AM
> 
> On Tue, 1 Jun 2021 07:01:57 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >
> > I summarized five opens here, about:
> >
> > 1)  Finalizing the name to replace /dev/ioasid;
> > 2)  Whether one device is allowed to bind to multiple IOASID fd's;
> > 3)  Carry device information in invalidation/fault reporting uAPI;
> > 4)  What should/could be specified when allocating an IOASID;
> > 5)  The protocol between vfio group and kvm;
> >
> ...
> >
> > For 5), I'd expect Alex to chime in. Per my understanding looks the
> > original purpose of this protocol is not about I/O address space. It's
> > for KVM to know whether any device is assigned to this VM and then
> > do something special (e.g. posted interrupt, EPT cache attribute, etc.).
> 
> Right, the original use case was for KVM to determine whether it needs
> to emulate invlpg, so it needs to be aware when an assigned device is

invlpg -> wbinvd :)

> present and be able to test if DMA for that device is cache coherent.
> The user, QEMU, creates a KVM "pseudo" device representing the vfio
> group, providing the file descriptor of that group to show ownership.
> The ugly symbol_get code is to avoid hard module dependencies, ie. the
> kvm module should not pull in or require the vfio module, but vfio will
> be present if attempting to register this device.

so the symbol_get thing is not about the protocol itself. Whatever protocol
is defined, as long as kvm needs to call vfio or ioasid helper function, we 
need define a proper way to do it. Jason, what's your opinion of alternative 
option since you dislike symbol_get?

> 
> With kvmgt, the interface also became a way to register the kvm pointer
> with vfio for the translation mentioned elsewhere in this thread.
> 
> The PPC/SPAPR support allows KVM to associate a vfio group to an IOMMU
> page table so that it can handle iotlb programming from pre-registered
> memory without trapping out to userspace.
> 
> > Because KVM deduces some policy based on the fact of assigned device,
> > it needs to hold a reference to related vfio group. this part is irrelevant
> > to this RFC.
> 
> All of these use cases are related to the IOMMU, whether DMA is
> coherent, translating device IOVA to GPA, and an acceleration path to
> emulate IOMMU programming in kernel... they seem pretty relevant.

One open is whether kvm should hold a device reference or IOASID
reference. For DMA coherence, it only matters whether assigned 
devices are coherent or not (not for a specific address space). For kvmgt, 
it is for recoding kvm pointer in mdev driver to do write protection. For 
ppc, it does relate to a specific I/O page table.

Then I feel only a part of the protocol will be moved to /dev/ioasid and
something will still remain between kvm and vfio?

> 
> > But ARM's VMID usage is related to I/O address space thus needs some
> > consideration. Another strange thing is about PPC. Looks it also leverages
> > this protocol to do iommu group attach: kvm_spapr_tce_attach_iommu_
> > group. I don't know why it's done through KVM instead of VFIO uAPI in
> > the first place.
> 
> AIUI, IOMMU programming on PPC is done through hypercalls, so KVM
> needs
> to know how to handle those for in-kernel acceleration.  Thanks,
> 

ok.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01 17:26     ` Jason Gunthorpe
@ 2021-06-02  4:01       ` Lu Baolu
  2021-06-02 23:23         ` Jason Gunthorpe
  0 siblings, 1 reply; 258+ messages in thread
From: Lu Baolu @ 2021-06-02  4:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: baolu.lu, Tian, Kevin, LKML, Joerg Roedel, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

On 6/2/21 1:26 AM, Jason Gunthorpe wrote:
> On Tue, Jun 01, 2021 at 07:09:21PM +0800, Lu Baolu wrote:
> 
>> This version only covers 1) and 4). Do you think we need to support 2),
>> 3) and beyond?
> 
> Yes aboslutely. The API should be flexable enough to specify the
> creation of all future page table formats we'd want to have and all HW
> specific details on those formats.

OK, stay in the same line.

>> If so, it seems that we need some in-kernel helpers and uAPIs to
>> support pre-installing a page table to IOASID.
> 
> Not sure what this means..

Sorry that I didn't make this clear.

Let me bring back the page table types in my eyes.

  1) IOMMU format page table (a.k.a. iommu_domain)
  2) user application CPU page table (SVA for example)
  3) KVM EPT (future option)
  4) VM guest managed page table (nesting mode)

Each type of page table should be able to be associated with its IOASID.
We have BIND protocol for 4); We explicitly allocate an iommu_domain for
1). But we don't have a clear definition for 2) 3) and others. I think
it's necessary to clearly define a time point and kAPI name between
IOASID_ALLOC and IOASID_ATTACH, so that other modules have the
opportunity to associate their page table with the allocated IOASID
before attaching the page table to the real IOMMU hardware.

I/O page fault handling is similar. The provider of the page table
should take the responsibility to handle the possible page faults.

Could this answer the question of "I'm still confused why drivers need
fault handlers at all?" in below thread?

https://lore.kernel.org/linux-iommu/PH0PR12MB54811863B392C644E5365446DC3E9@PH0PR12MB5481.namprd12.prod.outlook.com/T/#m15def9e8b236dfcf97e21c8e9f8a58da214e3691

> 
>>  From this point of view an IOASID is actually not just a variant of
>> iommu_domain, but an I/O page table representation in a broader
>> sense.
> 
> Yes, and things need to evolve in a staged way. The ioctl API should
> have room for this growth but you need to start out with some
> constrained enough to actually implement then figure out how to grow
> from there

Yes, agreed. I just think about it from the perspective of a design
document.

Best regards,
baolu

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01 17:33         ` Jason Gunthorpe
@ 2021-06-02  4:50           ` Shenming Lu
  2021-06-03 18:19             ` Jacob Pan
  0 siblings, 1 reply; 258+ messages in thread
From: Shenming Lu @ 2021-06-02  4:50 UTC (permalink / raw)
  To: Jason Gunthorpe, Lu Baolu
  Cc: Tian, Kevin, LKML, Joerg Roedel, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy, Zenghui Yu,
	wanghaibin.wang

On 2021/6/2 1:33, Jason Gunthorpe wrote:
> On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote:
> 
>> The drivers register per page table fault handlers to /dev/ioasid which
>> will then register itself to iommu core to listen and route the per-
>> device I/O page faults. 
> 
> I'm still confused why drivers need fault handlers at all?

Essentially it is the userspace that needs the fault handlers,
one case is to deliver the faults to the vIOMMU, and another
case is to enable IOPF on the GPA address space for on-demand
paging, it seems that both could be specified in/through the
IOASID_ALLOC ioctl?

Thanks,
Shenming


^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-27  7:58 [RFC] /dev/ioasid uAPI proposal Tian, Kevin
                   ` (8 preceding siblings ...)
  2021-06-01 17:30 ` Parav Pandit
@ 2021-06-02  6:15 ` David Gibson
  2021-06-02 17:19   ` Jason Gunthorpe
                     ` (2 more replies)
  2021-06-02  8:56 ` Enrico Weigelt, metux IT consult
  10 siblings, 3 replies; 258+ messages in thread
From: David Gibson @ 2021-06-02  6:15 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: LKML, Joerg Roedel, Jason Gunthorpe, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	Kirti Wankhede, Robin Murphy

[-- Attachment #1: Type: text/plain, Size: 52937 bytes --]

On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> /dev/ioasid provides an unified interface for managing I/O page tables for 
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, 
> etc.) are expected to use this interface instead of creating their own logic to 
> isolate untrusted device DMAs initiated by userspace. 
> 
> This proposal describes the uAPI of /dev/ioasid and also sample sequences 
> with VFIO as example in typical usages. The driver-facing kernel API provided 
> by the iommu layer is still TBD, which can be discussed after consensus is 
> made on this uAPI.
> 
> It's based on a lengthy discussion starting from here:
> 	https://lore.kernel.org/linux-iommu/20210330132830.GO2356281@nvidia.com/ 
> 
> It ends up to be a long writing due to many things to be summarized and
> non-trivial effort required to connect them into a complete proposal.
> Hope it provides a clean base to converge.

Thanks for the writeup.  I'm giving this a first pass review, note
that I haven't read all the existing replies in detail yet.

> 
> TOC
> ====
> 1. Terminologies and Concepts
> 2. uAPI Proposal
>     2.1. /dev/ioasid uAPI
>     2.2. /dev/vfio uAPI
>     2.3. /dev/kvm uAPI
> 3. Sample structures and helper functions
> 4. PASID virtualization
> 5. Use Cases and Flows
>     5.1. A simple example
>     5.2. Multiple IOASIDs (no nesting)
>     5.3. IOASID nesting (software)
>     5.4. IOASID nesting (hardware)
>     5.5. Guest SVA (vSVA)
>     5.6. I/O page fault
>     5.7. BIND_PASID_TABLE
> ====
> 
> 1. Terminologies and Concepts
> -----------------------------------------
> 
> IOASID FD is the container holding multiple I/O address spaces. User 
> manages those address spaces through FD operations. Multiple FD's are 
> allowed per process, but with this proposal one FD should be sufficient for 
> all intended usages.
> 
> IOASID is the FD-local software handle representing an I/O address space. 
> Each IOASID is associated with a single I/O page table. IOASIDs can be 
> nested together, implying the output address from one I/O page table 
> (represented by child IOASID) must be further translated by another I/O 
> page table (represented by parent IOASID).

Is there a compelling reason to have all the IOASIDs handled by one
FD?  Simply on the grounds that handles to kernel internal objects are
usualy fds, having an fd per ioasid seems like an obvious alternative.
In that case plain open() would replace IOASID_ALLOC.  Nested could be
handled either by 1) having a CREATED_NESTED on the parent fd which
spawns a new fd or 2) opening /dev/ioasid again for a new fd and doing
a SET_PARENT before doing anything else.

I may be bikeshedding here..

> I/O address space can be managed through two protocols, according to 
> whether the corresponding I/O page table is constructed by the kernel or 
> the user. When kernel-managed, a dma mapping protocol (similar to 
> existing VFIO iommu type1) is provided for the user to explicitly specify 
> how the I/O address space is mapped. Otherwise, a different protocol is 
> provided for the user to bind an user-managed I/O page table to the 
> IOMMU, plus necessary commands for iotlb invalidation and I/O fault 
> handling. 
> 
> Pgtable binding protocol can be used only on the child IOASID's, implying 
> IOASID nesting must be enabled. This is because the kernel doesn't trust 
> userspace. Nesting allows the kernel to enforce its DMA isolation policy 
> through the parent IOASID.

To clarify, I'm guessing that's a restriction of likely practice,
rather than a fundamental API restriction.  I can see a couple of
theoretical future cases where a user-managed pagetable for a "base"
IOASID would be feasible:

  1) On some fancy future MMU allowing free nesting, where the kernel
     would insert an implicit extra layer translating user addresses
     to physical addresses, and the userspace manages a pagetable with
     its own VAs being the target AS
  2) For a purely software virtual device, where its virtual DMA
     engine can interpet user addresses fine

> IOASID nesting can be implemented in two ways: hardware nesting and
> software nesting. With hardware support the child and parent I/O page 
> tables are walked consecutively by the IOMMU to form a nested translation. 
> When it's implemented in software, the ioasid driver is responsible for 
> merging the two-level mappings into a single-level shadow I/O page table. 
> Software nesting requires both child/parent page tables operated through 
> the dma mapping protocol, so any change in either level can be captured 
> by the kernel to update the corresponding shadow mapping.

As Jason also said, I don't think you need to restrict software
nesting to only kernel managed L2 tables - you already need hooks for
cache invalidation, and you can use those to trigger reshadows.

> An I/O address space takes effect in the IOMMU only after it is attached 
> to a device. The device in the /dev/ioasid context always refers to a 
> physical one or 'pdev' (PF or VF). 

What you mean by "physical" device here isn't really clear - VFs
aren't really physical devices, and the PF/VF terminology also doesn't
extent to non-PCI devices (which I think we want to consider for the
API, even if we're not implemenenting it any time soon).

Now, it's clear that we can't program things into the IOMMU before
attaching a device - we might not even know which IOMMU to use.
However, I'm not sure if its wise to automatically make the AS "real"
as soon as we attach a device:

 * If we're going to attach a whole bunch of devices, could we (for at
   least some IOMMU models) end up doing a lot of work which then has
   to be re-done for each extra device we attach?
   
 * With kernel managed IO page tables could attaching a second device
   (at least on some IOMMU models) require some operation which would
   require discarding those tables?  e.g. if the second device somehow
   forces a different IO page size

For that reason I wonder if we want some sort of explicit enable or
activate call.  Device attaches would only be valid before, map or
attach pagetable calls would only be valid after.

> One I/O address space could be attached to multiple devices. In this case, 
> /dev/ioasid uAPI applies to all attached devices under the specified IOASID.
> 
> Based on the underlying IOMMU capability one device might be allowed 
> to attach to multiple I/O address spaces, with DMAs accessing them by 
> carrying different routing information. One of them is the default I/O 
> address space routed by PCI Requestor ID (RID) or ARM Stream ID. The 
> remaining are routed by RID + Process Address Space ID (PASID) or 
> Stream+Substream ID. For simplicity the following context uses RID and
> PASID when talking about the routing information for I/O address spaces.

I'm not really clear on how this interacts with nested ioasids.  Would
you generally expect the RID+PASID IOASes to be children of the base
RID IOAS, or not?

If the PASID ASes are children of the RID AS, can we consider this not
as the device explicitly attaching to multiple IOASIDs, but instead
attaching to the parent IOASID with awareness of the child ones?

> Device attachment is initiated through passthrough framework uAPI (use
> VFIO for simplicity in following context). VFIO is responsible for identifying 
> the routing information and registering it to the ioasid driver when calling 
> ioasid attach helper function. It could be RID if the assigned device is 
> pdev (PF/VF) or RID+PASID if the device is mediated (mdev). In addition, 
> user might also provide its view of virtual routing information (vPASID) in 
> the attach call, e.g. when multiple user-managed I/O address spaces are 
> attached to the vfio_device. In this case VFIO must figure out whether 
> vPASID should be directly used (for pdev) or converted to a kernel-
> allocated one (pPASID, for mdev) for physical routing (see section 4).
> 
> Device must be bound to an IOASID FD before attach operation can be
> conducted. This is also through VFIO uAPI. In this proposal one device 
> should not be bound to multiple FD's. Not sure about the gain of 
> allowing it except adding unnecessary complexity. But if others have 
> different view we can further discuss.
> 
> VFIO must ensure its device composes DMAs with the routing information
> attached to the IOASID. For pdev it naturally happens since vPASID is 
> directly programmed to the device by guest software. For mdev this 
> implies any guest operation carrying a vPASID on this device must be 
> trapped into VFIO and then converted to pPASID before sent to the 
> device. A detail explanation about PASID virtualization policies can be 
> found in section 4. 
> 
> Modern devices may support a scalable workload submission interface 
> based on PCI DMWr capability, allowing a single work queue to access
> multiple I/O address spaces. One example is Intel ENQCMD, having 
> PASID saved in the CPU MSR and carried in the instruction payload 
> when sent out to the device. Then a single work queue shared by 
> multiple processes can compose DMAs carrying different PASIDs. 

Is the assumption here that the processes share the IOASID FD
instance, but not memory?

> When executing ENQCMD in the guest, the CPU MSR includes a vPASID 
> which, if targeting a mdev, must be converted to pPASID before sent
> to the wire. Intel CPU provides a hardware PASID translation capability 
> for auto-conversion in the fast path. The user is expected to setup the 
> PASID mapping through KVM uAPI, with information about {vpasid, 
> ioasid_fd, ioasid}. The ioasid driver provides helper function for KVM 
> to figure out the actual pPASID given an IOASID.
> 
> With above design /dev/ioasid uAPI is all about I/O address spaces. 
> It doesn't include any device routing information, which is only 
> indirectly registered to the ioasid driver through VFIO uAPI. For 
> example, I/O page fault is always reported to userspace per IOASID, 
> although it's physically reported per device (RID+PASID). If there is a 
> need of further relaying this fault into the guest, the user is responsible 
> of identifying the device attached to this IOASID (randomly pick one if 
> multiple attached devices) and then generates a per-device virtual I/O 
> page fault into guest. Similarly the iotlb invalidation uAPI describes the 
> granularity in the I/O address space (all, or a range), different from the 
> underlying IOMMU semantics (domain-wide, PASID-wide, range-based).
> 
> I/O page tables routed through PASID are installed in a per-RID PASID 
> table structure. Some platforms implement the PASID table in the guest 
> physical space (GPA), expecting it managed by the guest. The guest
> PASID table is bound to the IOMMU also by attaching to an IOASID, 
> representing the per-RID vPASID space.

Do we need to consider two management modes here, much as we have for
the pagetables themsleves: either kernel managed, in which we have
explicit calls to bind a vPASID to a parent PASID, or user managed in
which case we register a table in some format.

> We propose the host kernel needs to explicitly track  guest I/O page 
> tables even on these platforms, i.e. the same pgtable binding protocol 
> should be used universally on all platforms (with only difference on who 
> actually writes the PASID table). One opinion from previous discussion 
> was treating this special IOASID as a container for all guest I/O page 
> tables i.e. hiding them from the host. However this way significantly 
> violates the philosophy in this /dev/ioasid proposal. It is not one IOASID 
> one address space any more. Device routing information (indirectly 
> marking hidden I/O spaces) has to be carried in iotlb invalidation and 
> page faulting uAPI to help connect vIOMMU with the underlying 
> pIOMMU. This is one design choice to be confirmed with ARM guys.
> 
> Devices may sit behind IOMMU's with incompatible capabilities. The
> difference may lie in the I/O page table format, or availability of an user
> visible uAPI (e.g. hardware nesting). /dev/ioasid is responsible for 
> checking the incompatibility between newly-attached device and existing
> devices under the specific IOASID and, if found, returning error to user.
> Upon such error the user should create a new IOASID for the incompatible
> device. 
> 
> There is no explicit group enforcement in /dev/ioasid uAPI, due to no 
> device notation in this interface as aforementioned. But the ioasid driver 
> does implicit check to make sure that devices within an iommu group 
> must be all attached to the same IOASID before this IOASID starts to
> accept any uAPI command. Otherwise error information is returned to 
> the user.

An explicit ENABLE call might make this checking simpler.

> There was a long debate in previous discussion whether VFIO should keep 
> explicit container/group semantics in its uAPI. Jason Gunthorpe proposes 
> a simplified model where every device bound to VFIO is explicitly listed 
> under /dev/vfio thus a device fd can be acquired w/o going through legacy
> container/group interface. In this case the user is responsible for 
> understanding the group topology and meeting the implicit group check 
> criteria enforced in /dev/ioasid. The use case examples in this proposal 
> are based on the new model.
> 
> Of course for backward compatibility VFIO still needs to keep the existing 
> uAPI and vfio iommu type1 will become a shim layer connecting VFIO 
> iommu ops to internal ioasid helper functions.
> 
> Notes:
> -   It might be confusing as IOASID is also used in the kernel (drivers/
>     iommu/ioasid.c) to represent PCI PASID or ARM substream ID. We need
>     find a better name later to differentiate.
> 
> -   PPC has not be considered yet as we haven't got time to fully understand
>     its semantics. According to previous discussion there is some generality 
>     between PPC window-based scheme and VFIO type1 semantics. Let's 
>     first make consensus on this proposal and then further discuss how to 
>     extend it to cover PPC's requirement.

From what I've seen so far, it seems ok to me.  Note that at this
stage I'm only familiar with existing PPC IOMMUs, which don't have
PASID or anything similar.  I'm not sure what IBM's future plans are
for IOMMUs, so there will be more checking to be done.

> -   There is a protocol between vfio group and kvm. Needs to think about
>     how it will be affected following this proposal.

I think that's only used on PPC, as an optimization for PAPR's
paravirt IOMMU with a small default IOVA window.  I think we can do
something equivalent for IOASIDs from what I've seen so far.

> -   mdev in this context refers to mediated subfunctions (e.g. Intel SIOV) 
>     which can be physically isolated in-between through PASID-granular
>     IOMMU protection. Historically people also discussed one usage by 
>     mediating a pdev into a mdev. This usage is not covered here, and is 
>     supposed to be replaced by Max's work which allows overriding various 
>     VFIO operations in vfio-pci driver.

I think there are a couple of different mdev cases, so we'll need to
be careful of that and clarify our terminology a bit, I think.

> 2. uAPI Proposal
> ----------------------
> 
> /dev/ioasid uAPI covers everything about managing I/O address spaces.
> 
> /dev/vfio uAPI builds connection between devices and I/O address spaces.
> 
> /dev/kvm uAPI is optional required as far as ENQCMD is concerned.
> 
> 
> 2.1. /dev/ioasid uAPI
> +++++++++++++++++
> 
> /*
>   * Check whether an uAPI extension is supported. 
>   *
>   * This is for FD-level capabilities, such as locked page pre-registration. 
>   * IOASID-level capabilities are reported through IOASID_GET_INFO.
>   *
>   * Return: 0 if not supported, 1 if supported.
>   */
> #define IOASID_CHECK_EXTENSION	_IO(IOASID_TYPE, IOASID_BASE + 0)
> 
> 
> /*
>   * Register user space memory where DMA is allowed.
>   *
>   * It pins user pages and does the locked memory accounting so sub-
>   * sequent IOASID_MAP/UNMAP_DMA calls get faster.
>   *
>   * When this ioctl is not used, one user page might be accounted
>   * multiple times when it is mapped by multiple IOASIDs which are
>   * not nested together.
>   *
>   * Input parameters:
>   *	- vaddr;
>   *	- size;
>   *
>   * Return: 0 on success, -errno on failure.
>   */
> #define IOASID_REGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 1)
> #define IOASID_UNREGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 2)

AIUI PPC is the main user of the current pre-registration API, though
it could have value in any vIOMMU case to avoid possibly costly
accounting on every guest map/unmap.

I wonder if there's a way to model this using a nested AS rather than
requiring special operations.  e.g.

	'prereg' IOAS
	|
	\- 'rid' IOAS
	   |
	   \- 'pasid' IOAS (maybe)

'prereg' would have a kernel managed pagetable into which (for
example) qemu platform code would map all guest memory (using
IOASID_MAP_DMA).  qemu's vIOMMU driver would then mirror the guest's
IO mappings into the 'rid' IOAS in terms of GPA.

This wouldn't quite work as is, because the 'prereg' IOAS would have
no devices.  But we could potentially have another call to mark an
IOAS as a purely "preregistration" or pure virtual IOAS.  Using that
would be an alternative to attaching devices.

> /*
>   * Allocate an IOASID. 
>   *
>   * IOASID is the FD-local software handle representing an I/O address 
>   * space. Each IOASID is associated with a single I/O page table. User 
>   * must call this ioctl to get an IOASID for every I/O address space that is
>   * intended to be enabled in the IOMMU.
>   *
>   * A newly-created IOASID doesn't accept any command before it is 
>   * attached to a device. Once attached, an empty I/O page table is 
>   * bound with the IOMMU then the user could use either DMA mapping 
>   * or pgtable binding commands to manage this I/O page table.
>   *
>   * Device attachment is initiated through device driver uAPI (e.g. VFIO)
>   *
>   * Return: allocated ioasid on success, -errno on failure.
>   */
> #define IOASID_ALLOC	_IO(IOASID_TYPE, IOASID_BASE + 3)
> #define IOASID_FREE	_IO(IOASID_TYPE, IOASID_BASE + 4)
> 
> 
> /*
>   * Get information about an I/O address space
>   *
>   * Supported capabilities:
>   *	- VFIO type1 map/unmap;
>   *	- pgtable/pasid_table binding
>   *	- hardware nesting vs. software nesting;
>   *	- ...
>   *
>   * Related attributes:
>   * 	- supported page sizes, reserved IOVA ranges (DMA mapping);

Can I request we represent this in terms of permitted IOVA ranges,
rather than reserved IOVA ranges.  This works better with the "window"
model I have in mind for unifying the restrictions of the POWER IOMMU
with Type1 like mapping.

>   *	- vendor pgtable formats (pgtable binding);
>   *	- number of child IOASIDs (nesting);
>   *	- ...
>   *
>   * Above information is available only after one or more devices are
>   * attached to the specified IOASID. Otherwise the IOASID is just a
>   * number w/o any capability or attribute.
>   *
>   * Input parameters:
>   *	- u32 ioasid;
>   *
>   * Output parameters:
>   *	- many. TBD.
>   */
> #define IOASID_GET_INFO	_IO(IOASID_TYPE, IOASID_BASE + 5)
> 
> 
> /*
>   * Map/unmap process virtual addresses to I/O virtual addresses.
>   *
>   * Provide VFIO type1 equivalent semantics. Start with the same 
>   * restriction e.g. the unmap size should match those used in the 
>   * original mapping call. 
>   *
>   * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
>   * must be already in the preregistered list.
>   *
>   * Input parameters:
>   *	- u32 ioasid;
>   *	- refer to vfio_iommu_type1_dma_{un}map
>   *
>   * Return: 0 on success, -errno on failure.
>   */
> #define IOASID_MAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 6)
> #define IOASID_UNMAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 7)

I'm assuming these would be expected to fail if a user managed
pagetable has been bound?

> /*
>   * Create a nesting IOASID (child) on an existing IOASID (parent)
>   *
>   * IOASIDs can be nested together, implying that the output address 
>   * from one I/O page table (child) must be further translated by 
>   * another I/O page table (parent).
>   *
>   * As the child adds essentially another reference to the I/O page table 
>   * represented by the parent, any device attached to the child ioasid 
>   * must be already attached to the parent.
>   *
>   * In concept there is no limit on the number of the nesting levels. 
>   * However for the majority case one nesting level is sufficient. The
>   * user should check whether an IOASID supports nesting through 
>   * IOASID_GET_INFO. For example, if only one nesting level is allowed,
>   * the nesting capability is reported only on the parent instead of the
>   * child.
>   *
>   * User also needs check (via IOASID_GET_INFO) whether the nesting 
>   * is implemented in hardware or software. If software-based, DMA 
>   * mapping protocol should be used on the child IOASID. Otherwise, 
>   * the child should be operated with pgtable binding protocol.
>   *
>   * Input parameters:
>   *	- u32 parent_ioasid;
>   *
>   * Return: child_ioasid on success, -errno on failure;
>   */
> #define IOASID_CREATE_NESTING	_IO(IOASID_TYPE, IOASID_BASE + 8)
> 
> 
> /*
>   * Bind an user-managed I/O page table with the IOMMU
>   *
>   * Because user page table is untrusted, IOASID nesting must be enabled 
>   * for this ioasid so the kernel can enforce its DMA isolation policy 
>   * through the parent ioasid.
>   *
>   * Pgtable binding protocol is different from DMA mapping. The latter 
>   * has the I/O page table constructed by the kernel and updated 
>   * according to user MAP/UNMAP commands. With pgtable binding the 
>   * whole page table is created and updated by userspace, thus different 
>   * set of commands are required (bind, iotlb invalidation, page fault, etc.).
>   *
>   * Because the page table is directly walked by the IOMMU, the user 
>   * must  use a format compatible to the underlying hardware. It can 
>   * check the format information through IOASID_GET_INFO.
>   *
>   * The page table is bound to the IOMMU according to the routing 
>   * information of each attached device under the specified IOASID. The
>   * routing information (RID and optional PASID) is registered when a 
>   * device is attached to this IOASID through VFIO uAPI. 
>   *
>   * Input parameters:
>   *	- child_ioasid;
>   *	- address of the user page table;
>   *	- formats (vendor, address_width, etc.);
>   * 
>   * Return: 0 on success, -errno on failure.
>   */
> #define IOASID_BIND_PGTABLE		_IO(IOASID_TYPE, IOASID_BASE + 9)
> #define IOASID_UNBIND_PGTABLE	_IO(IOASID_TYPE, IOASID_BASE + 10)

I'm assuming that UNBIND would return the IOASID to a kernel-managed
pagetable?

For debugging and certain hypervisor edge cases it might be useful to
have a call to allow userspace to lookup and specific IOVA in a guest
managed pgtable.


> /*
>   * Bind an user-managed PASID table to the IOMMU
>   *
>   * This is required for platforms which place PASID table in the GPA space.
>   * In this case the specified IOASID represents the per-RID PASID space.
>   *
>   * Alternatively this may be replaced by IOASID_BIND_PGTABLE plus a
>   * special flag to indicate the difference from normal I/O address spaces.
>   *
>   * The format info of the PASID table is reported in IOASID_GET_INFO.
>   *
>   * As explained in the design section, user-managed I/O page tables must
>   * be explicitly bound to the kernel even on these platforms. It allows
>   * the kernel to uniformly manage I/O address spaces cross all platforms.
>   * Otherwise, the iotlb invalidation and page faulting uAPI must be hacked
>   * to carry device routing information to indirectly mark the hidden I/O
>   * address spaces.
>   *
>   * Input parameters:
>   *	- child_ioasid;

Wouldn't this be the parent ioasid, rather than one of the potentially
many child ioasids?

>   *	- address of PASID table;
>   *	- formats (vendor, size, etc.);
>   *
>   * Return: 0 on success, -errno on failure.
>   */
> #define IOASID_BIND_PASID_TABLE	_IO(IOASID_TYPE, IOASID_BASE + 11)
> #define IOASID_UNBIND_PASID_TABLE	_IO(IOASID_TYPE, IOASID_BASE + 12)
> 
> 
> /*
>   * Invalidate IOTLB for an user-managed I/O page table
>   *
>   * Unlike what's defined in include/uapi/linux/iommu.h, this command 
>   * doesn't allow the user to specify cache type and likely support only
>   * two granularities (all, or a specified range) in the I/O address space.
>   *
>   * Physical IOMMU have three cache types (iotlb, dev_iotlb and pasid
>   * cache). If the IOASID represents an I/O address space, the invalidation
>   * always applies to the iotlb (and dev_iotlb if enabled). If the IOASID
>   * represents a vPASID space, then this command applies to the PASID
>   * cache.
>   *
>   * Similarly this command doesn't provide IOMMU-like granularity
>   * info (domain-wide, pasid-wide, range-based), since it's all about the
>   * I/O address space itself. The ioasid driver walks the attached
>   * routing information to match the IOMMU semantics under the
>   * hood. 
>   *
>   * Input parameters:
>   *	- child_ioasid;

And couldn't this be be any ioasid, not just a child one, depending on
whether you want PASID scope or RID scope invalidation?

>   *	- granularity
>   * 
>   * Return: 0 on success, -errno on failure
>   */
> #define IOASID_INVALIDATE_CACHE	_IO(IOASID_TYPE, IOASID_BASE + 13)
> 
> 
> /*
>   * Page fault report and response
>   *
>   * This is TBD. Can be added after other parts are cleared up. Likely it 
>   * will be a ring buffer shared between user/kernel, an eventfd to notify 
>   * the user and an ioctl to complete the fault.
>   *
>   * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
>   */
> 
> 
> /*
>   * Dirty page tracking 
>   *
>   * Track and report memory pages dirtied in I/O address spaces. There 
>   * is an ongoing work by Kunkun Jiang by extending existing VFIO type1. 
>   * It needs be adapted to /dev/ioasid later.
>   */
> 
> 
> 2.2. /dev/vfio uAPI
> ++++++++++++++++
> 
> /*
>   * Bind a vfio_device to the specified IOASID fd
>   *
>   * Multiple vfio devices can be bound to a single ioasid_fd, but a single 
>   * vfio device should not be bound to multiple ioasid_fd's. 
>   *
>   * Input parameters:
>   *	- ioasid_fd;
>   *
>   * Return: 0 on success, -errno on failure.
>   */
> #define VFIO_BIND_IOASID_FD		_IO(VFIO_TYPE, VFIO_BASE + 22)
> #define VFIO_UNBIND_IOASID_FD	_IO(VFIO_TYPE, VFIO_BASE + 23)
> 
> 
> /*
>   * Attach a vfio device to the specified IOASID
>   *
>   * Multiple vfio devices can be attached to the same IOASID, and vice 
>   * versa. 
>   *
>   * User may optionally provide a "virtual PASID" to mark an I/O page 
>   * table on this vfio device. Whether the virtual PASID is physically used 
>   * or converted to another kernel-allocated PASID is a policy in vfio device 
>   * driver.
>   *
>   * There is no need to specify ioasid_fd in this call due to the assumption 
>   * of 1:1 connection between vfio device and the bound fd.
>   *
>   * Input parameter:
>   *	- ioasid;
>   *	- flag;
>   *	- user_pasid (if specified);

Wouldn't the PASID be communicated by whether you give a parent or
child ioasid, rather than needing an extra value?

>   * Return: 0 on success, -errno on failure.
>   */
> #define VFIO_ATTACH_IOASID		_IO(VFIO_TYPE, VFIO_BASE + 24)
> #define VFIO_DETACH_IOASID		_IO(VFIO_TYPE, VFIO_BASE + 25)
> 
> 
> 2.3. KVM uAPI
> ++++++++++++
> 
> /*
>   * Update CPU PASID mapping
>   *
>   * This is necessary when ENQCMD will be used in the guest while the
>   * targeted device doesn't accept the vPASID saved in the CPU MSR.
>   *
>   * This command allows user to set/clear the vPASID->pPASID mapping
>   * in the CPU, by providing the IOASID (and FD) information representing
>   * the I/O address space marked by this vPASID.
>   *
>   * Input parameters:
>   *	- user_pasid;
>   *	- ioasid_fd;
>   *	- ioasid;
>   */
> #define KVM_MAP_PASID	_IO(KVMIO, 0xf0)
> #define KVM_UNMAP_PASID	_IO(KVMIO, 0xf1)
> 
> 
> 3. Sample structures and helper functions
> --------------------------------------------------------
> 
> Three helper functions are provided to support VFIO_BIND_IOASID_FD:
> 
> 	struct ioasid_ctx *ioasid_ctx_fdget(int fd);
> 	int ioasid_register_device(struct ioasid_ctx *ctx, struct ioasid_dev *dev);
> 	int ioasid_unregister_device(struct ioasid_dev *dev);
> 
> An ioasid_ctx is created for each fd:
> 
> 	struct ioasid_ctx {
> 		// a list of allocated IOASID data's
> 		struct list_head		ioasid_list;
> 		// a list of registered devices
> 		struct list_head		dev_list;
> 		// a list of pre-registered virtual address ranges
> 		struct list_head		prereg_list;
> 	};
> 
> Each registered device is represented by ioasid_dev:
> 
> 	struct ioasid_dev {
> 		struct list_head		next;
> 		struct ioasid_ctx	*ctx;
> 		// always be the physical device

Again "physical" isn't really clearly defined here.

> 		struct device 		*device;
> 		struct kref		kref;
> 	};
> 
> Because we assume one vfio_device connected to at most one ioasid_fd, 
> here ioasid_dev could be embedded in vfio_device and then linked to 
> ioasid_ctx->dev_list when registration succeeds. For mdev the struct
> device should be the pointer to the parent device. PASID marking this
> mdev is specified later when VFIO_ATTACH_IOASID.
> 
> An ioasid_data is created when IOASID_ALLOC, as the main object 
> describing characteristics about an I/O page table:
> 
> 	struct ioasid_data {
> 		// link to ioasid_ctx->ioasid_list
> 		struct list_head		next;
> 
> 		// the IOASID number
> 		u32			ioasid;
> 
> 		// the handle to convey iommu operations
> 		// hold the pgd (TBD until discussing iommu api)
> 		struct iommu_domain *domain;
> 
> 		// map metadata (vfio type1 semantics)
> 		struct rb_node		dma_list;

Why do you need this?  Can't you just store the kernel managed
mappings in the host IO pgtable?

> 		// pointer to user-managed pgtable (for nesting case)
> 		u64			user_pgd;

> 		// link to the parent ioasid (for nesting)
> 		struct ioasid_data	*parent;
> 
> 		// cache the global PASID shared by ENQCMD-capable
> 		// devices (see below explanation in section 4)
> 		u32			pasid;
> 
> 		// a list of device attach data (routing information)
> 		struct list_head		attach_data;
> 
> 		// a list of partially-attached devices (group)
> 		struct list_head		partial_devices;
> 
> 		// a list of fault_data reported from the iommu layer
> 		struct list_head		fault_data;
> 
> 		...
> 	}
> 
> ioasid_data and iommu_domain have overlapping roles as both are 
> introduced to represent an I/O address space. It is still a big TBD how 
> the two should be corelated or even merged, and whether new iommu 
> ops are required to handle RID+PASID explicitly. We leave this as open 
> for now as this proposal is mainly about uAPI. For simplification 
> purpose the two objects are kept separate in this context, assuming an 
> 1:1 connection in-between and the domain as the place-holder 
> representing the 1st class object in the iommu ops. 
> 
> Two helper functions are provided to support VFIO_ATTACH_IOASID:
> 
> 	struct attach_info {
> 		u32	ioasid;
> 		// If valid, the PASID to be used physically
> 		u32	pasid;

Again shouldn't the choice of a parent or child ioasid inform whether
there is a pasid, and if so which one?

> 	};
> 	int ioasid_device_attach(struct ioasid_dev *dev, 
> 		struct attach_info info);
> 	int ioasid_device_detach(struct ioasid_dev *dev, u32 ioasid);
> 
> The pasid parameter is optionally provided based on the policy in vfio
> device driver. It could be the PASID marking the default I/O address 
> space for a mdev, or the user-provided PASID marking an user I/O page
> table, or another kernel-allocated PASID backing the user-provided one.
> Please check next section for detail explanation.
> 
> A new object is introduced and linked to ioasid_data->attach_data for 
> each successful attach operation:
> 
> 	struct ioasid_attach_data {
> 		struct list_head		next;
> 		struct ioasid_dev	*dev;
> 		u32 			pasid;
> 	}
> 
> As explained in the design section, there is no explicit group enforcement
> in /dev/ioasid uAPI or helper functions. But the ioasid driver does
> implicit group check - before every device within an iommu group is 
> attached to this IOASID, the previously-attached devices in this group are
> put in ioasid_data->partial_devices. The IOASID rejects any command if
> the partial_devices list is not empty.
> 
> Then is the last helper function:
> 	u32 ioasid_get_global_pasid(struct ioasid_ctx *ctx, 
> 		u32 ioasid, bool alloc);
> 
> ioasid_get_global_pasid is necessary in scenarios where multiple devices 
> want to share a same PASID value on the attached I/O page table (e.g. 
> when ENQCMD is enabled, as explained in next section). We need a 
> centralized place (ioasid_data->pasid) to hold this value (allocated when
> first called with alloc=true). vfio device driver calls this function (alloc=
> true) to get the global PASID for an ioasid before calling ioasid_device_
> attach. KVM also calls this function (alloc=false) to setup PASID translation 
> structure when user calls KVM_MAP_PASID.
> 
> 4. PASID Virtualization
> ------------------------------
> 
> When guest SVA (vSVA) is enabled, multiple GVA address spaces are 
> created on the assigned vfio device. This leads to the concepts of 
> "virtual PASID" (vPASID) vs. "physical PASID" (pPASID). vPASID is assigned 
> by the guest to mark an GVA address space while pPASID is the one 
> selected by the host and actually routed in the wire.
> 
> vPASID is conveyed to the kernel when user calls VFIO_ATTACH_IOASID.
> 
> vfio device driver translates vPASID to pPASID before calling ioasid_attach_
> device, with two factors to be considered:
> 
> -    Whether vPASID is directly used (vPASID==pPASID) in the wire, or 
>      should be instead converted to a newly-allocated one (vPASID!=
>      pPASID);
> 
> -    If vPASID!=pPASID, whether pPASID is allocated from per-RID PASID
>      space or a global PASID space (implying sharing pPASID cross devices,
>      e.g. when supporting Intel ENQCMD which puts PASID in a CPU MSR
>      as part of the process context);
> 
> The actual policy depends on pdev vs. mdev, and whether ENQCMD is
> supported. There are three possible scenarios:
> 
> (Note: /dev/ioasid uAPI is not affected by underlying PASID virtualization 
> policies.)
> 
> 1)  pdev (w/ or w/o ENQCMD): vPASID==pPASID
> 
>      vPASIDs are directly programmed by the guest to the assigned MMIO 
>      bar, implying all DMAs out of this device having vPASID in the packet 
>      header. This mandates vPASID==pPASID, sort of delegating the entire 
>      per-RID PASID space to the guest.
> 
>      When ENQCMD is enabled, the CPU MSR when running a guest task
>      contains a vPASID. In this case the CPU PASID translation capability 
>      should be disabled so this vPASID in CPU MSR is directly sent to the
>      wire.
> 
>      This ensures consistent vPASID usage on pdev regardless of the 
>      workload submitted through a MMIO register or ENQCMD instruction.
> 
> 2)  mdev: vPASID!=pPASID (per-RID if w/o ENQCMD, otherwise global)
> 
>      PASIDs are also used by kernel to mark the default I/O address space 
>      for mdev, thus cannot be delegated to the guest. Instead, the mdev 
>      driver must allocate a new pPASID for each vPASID (thus vPASID!=
>      pPASID) and then use pPASID when attaching this mdev to an ioasid.
> 
>      The mdev driver needs cache the PASID mapping so in mediation 
>      path vPASID programmed by the guest can be converted to pPASID 
>      before updating the physical MMIO register. The mapping should
>      also be saved in the CPU PASID translation structure (via KVM uAPI), 
>      so the vPASID saved in the CPU MSR is auto-translated to pPASID 
>      before sent to the wire, when ENQCMD is enabled. 
> 
>      Generally pPASID could be allocated from the per-RID PASID space
>      if all mdev's created on the parent device don't support ENQCMD.
> 
>      However if the parent supports ENQCMD-capable mdev, pPASIDs
>      must be allocated from a global pool because the CPU PASID 
>      translation structure is per-VM. It implies that when an guest I/O 
>      page table is attached to two mdevs with a single vPASID (i.e. bind 
>      to the same guest process), a same pPASID should be used for 
>      both mdevs even when they belong to different parents. Sharing
>      pPASID cross mdevs is achieved by calling aforementioned ioasid_
>      get_global_pasid().
> 
> 3)  Mix pdev/mdev together
> 
>      Above policies are per device type thus are not affected when mixing 
>      those device types together (when assigned to a single guest). However, 
>      there is one exception - when both pdev/mdev support ENQCMD.
> 
>      Remember the two types have conflicting requirements on whether 
>      CPU PASID translation should be enabled. This capability is per-VM, 
>      and must be enabled for mdev isolation. When enabled, pdev will 
>      receive a mdev pPASID violating its vPASID expectation.
> 
>      In previous thread a PASID range split scheme was discussed to support
>      this combination, but we haven't worked out a clean uAPI design yet.
>      Therefore in this proposal we decide to not support it, implying the 
>      user should have some intelligence to avoid such scenario. It could be
>      a TODO task for future.
> 
> In spite of those subtle considerations, the kernel implementation could
> start simple, e.g.:
> 
> -    v==p for pdev;
> -    v!=p and always use a global PASID pool for all mdev's;
> 
> Regardless of the kernel policy, the user policy is unchanged:
> 
> -    provide vPASID when calling VFIO_ATTACH_IOASID;
> -    call KVM uAPI to setup CPU PASID translation if ENQCMD-capable mdev;
> -    Don't expose ENQCMD capability on both pdev and mdev;
> 
> Sample user flow is described in section 5.5.
> 
> 5. Use Cases and Flows
> -------------------------------
> 
> Here assume VFIO will support a new model where every bound device
> is explicitly listed under /dev/vfio thus a device fd can be acquired w/o 
> going through legacy container/group interface. For illustration purpose
> those devices are just called dev[1...N]:
> 
> 	device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);

Minor detail, but I'd suggest /dev/vfio/pci/DDDD:BB:SS.F for the
filenames for actual PCI functions.  Maybe /dev/vfio/mdev/something
for mdevs.  That leaves other subdirs of /dev/vfio free for future
non-PCI device types, and /dev/vfio itself for the legacy group
devices.

> As explained earlier, one IOASID fd is sufficient for all intended use cases:
> 
> 	ioasid_fd = open("/dev/ioasid", mode);
> 
> For simplicity below examples are all made for the virtualization story.
> They are representative and could be easily adapted to a non-virtualization
> scenario.
> 
> Three types of IOASIDs are considered:
> 
> 	gpa_ioasid[1...N]: 	for GPA address space
> 	giova_ioasid[1...N]:	for guest IOVA address space
> 	gva_ioasid[1...N]:	for guest CPU VA address space
> 
> At least one gpa_ioasid must always be created per guest, while the other 
> two are relevant as far as vIOMMU is concerned.
> 
> Examples here apply to both pdev and mdev, if not explicitly marked out
> (e.g. in section 5.5). VFIO device driver in the kernel will figure out the 
> associated routing information in the attaching operation.
> 
> For illustration simplicity, IOASID_CHECK_EXTENSION and IOASID_GET_
> INFO are skipped in these examples.
> 
> 5.1. A simple example
> ++++++++++++++++++
> 
> Dev1 is assigned to the guest. One gpa_ioasid is created. The GPA address
> space is managed through DMA mapping protocol:
> 
> 	/* Bind device to IOASID fd */
> 	device_fd = open("/dev/vfio/devices/dev1", mode);
> 	ioasid_fd = open("/dev/ioasid", mode);
> 	ioctl(device_fd, VFIO_BIND_IOASID_FD, ioasid_fd);
> 
> 	/* Attach device to IOASID */
> 	gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> 	at_data = { .ioasid = gpa_ioasid};
> 	ioctl(device_fd, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Setup GPA mapping */
> 	dma_map = {
> 		.ioasid	= gpa_ioasid;
> 		.iova	= 0;		// GPA
> 		.vaddr	= 0x40000000;	// HVA
> 		.size	= 1GB;
> 	};
> 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> 
> If the guest is assigned with more than dev1, user follows above sequence
> to attach other devices to the same gpa_ioasid i.e. sharing the GPA 
> address space cross all assigned devices.
> 
> 5.2. Multiple IOASIDs (no nesting)
> ++++++++++++++++++++++++++++
> 
> Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
> both devices are attached to gpa_ioasid.

Doesn't really affect your example, but note that the PAPR IOMMU does
not have a passthrough mode, so devices will not initially be attached
to gpa_ioasid - they will be unusable for DMA until attached to a
gIOVA ioasid.

> After boot the guest creates 
> an GIOVA address space (giova_ioasid) for dev2, leaving dev1 in pass
> through mode (gpa_ioasid).
> 
> Suppose IOASID nesting is not supported in this case. Qemu need to
> generate shadow mappings in userspace for giova_ioasid (like how
> VFIO works today).
> 
> To avoid duplicated locked page accounting, it's recommended to pre-
> register the virtual address range that will be used for DMA:
> 
> 	device_fd1 = open("/dev/vfio/devices/dev1", mode);
> 	device_fd2 = open("/dev/vfio/devices/dev2", mode);
> 	ioasid_fd = open("/dev/ioasid", mode);
> 	ioctl(device_fd1, VFIO_BIND_IOASID_FD, ioasid_fd);
> 	ioctl(device_fd2, VFIO_BIND_IOASID_FD, ioasid_fd);
> 
> 	/* pre-register the virtual address range for accounting */
> 	mem_info = { .vaddr = 0x40000000; .size = 1GB };
> 	ioctl(ioasid_fd, IOASID_REGISTER_MEMORY, &mem_info);
> 	/* Attach dev1 and dev2 to gpa_ioasid */
> 	gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> 	at_data = { .ioasid = gpa_ioasid};
> 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Setup GPA mapping */
> 	dma_map = {
> 		.ioasid	= gpa_ioasid;
> 		.iova	= 0; 		// GPA
> 		.vaddr	= 0x40000000;	// HVA
> 		.size	= 1GB;
> 	};
> 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> 
> 	/* After boot, guest enables an GIOVA space for dev2 */

Again, doesn't break the example, but this need not happen after guest
boot.  On the PAPR vIOMMU, the guest IOVA spaces (known as "logical IO
bus numbers" / liobns) and which devices are in each are fixed at
guest creation time and advertised to the guest via firmware.

> 	giova_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> 
> 	/* First detach dev2 from previous address space */
> 	at_data = { .ioasid = gpa_ioasid};
> 	ioctl(device_fd2, VFIO_DETACH_IOASID, &at_data);
> 
> 	/* Then attach dev2 to the new address space */
> 	at_data = { .ioasid = giova_ioasid};
> 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Setup a shadow DMA mapping according to vIOMMU
> 	  * GIOVA (0x2000) -> GPA (0x1000) -> HVA (0x40001000)
> 	  */
> 	dma_map = {
> 		.ioasid	= giova_ioasid;
> 		.iova	= 0x2000; 	// GIOVA
> 		.vaddr	= 0x40001000;	// HVA
> 		.size	= 4KB;
> 	};
> 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> 
> 5.3. IOASID nesting (software)
> +++++++++++++++++++++++++
> 
> Same usage scenario as 5.2, with software-based IOASID nesting 
> available. In this mode it is the kernel instead of user to create the
> shadow mapping.

In this case, I feel like the preregistration is redundant with the
GPA level mapping.  As long as the gIOVA mappings (which might be
frequent) can piggyback on the accounting done for the GPA mapping we
accomplish what we need from preregistration.

> The flow before guest boots is same as 5.2, except one point. Because 
> giova_ioasid is nested on gpa_ioasid, locked accounting is only 
> conducted for gpa_ioasid. So it's not necessary to pre-register virtual 
> memory.
> 
> To save space we only list the steps after boots (i.e. both dev1/dev2
> have been attached to gpa_ioasid before guest boots):
> 
> 	/* After boots */
> 	/* Make GIOVA space nested on GPA space */
> 	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
> 
> 	/* Attach dev2 to the new address space (child)
> 	  * Note dev2 is still attached to gpa_ioasid (parent)
> 	  */
> 	at_data = { .ioasid = giova_ioasid};
> 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Setup a GIOVA->GPA mapping for giova_ioasid, which will be 
> 	  * merged by the kernel with GPA->HVA mapping of gpa_ioasid
> 	  * to form a shadow mapping.
> 	  */
> 	dma_map = {
> 		.ioasid	= giova_ioasid;
> 		.iova	= 0x2000;	// GIOVA
> 		.vaddr	= 0x1000;	// GPA
> 		.size	= 4KB;
> 	};
> 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> 
> 5.4. IOASID nesting (hardware)
> +++++++++++++++++++++++++
> 
> Same usage scenario as 5.2, with hardware-based IOASID nesting
> available. In this mode the pgtable binding protocol is used to 
> bind the guest IOVA page table with the IOMMU:
> 
> 	/* After boots */
> 	/* Make GIOVA space nested on GPA space */
> 	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
> 
> 	/* Attach dev2 to the new address space (child)
> 	  * Note dev2 is still attached to gpa_ioasid (parent)
> 	  */
> 	at_data = { .ioasid = giova_ioasid};
> 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Bind guest I/O page table  */
> 	bind_data = {
> 		.ioasid	= giova_ioasid;
> 		.addr	= giova_pgtable;
> 		// and format information
> 	};
> 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> 
> 	/* Invalidate IOTLB when required */
> 	inv_data = {
> 		.ioasid	= giova_ioasid;
> 		// granular information
> 	};
> 	ioctl(ioasid_fd, IOASID_INVALIDATE_CACHE, &inv_data);
> 
> 	/* See 5.6 for I/O page fault handling */
> 	
> 5.5. Guest SVA (vSVA)
> ++++++++++++++++++
> 
> After boots the guest further create a GVA address spaces (gpasid1) on 
> dev1. Dev2 is not affected (still attached to giova_ioasid).
> 
> As explained in section 4, user should avoid expose ENQCMD on both
> pdev and mdev.
> 
> The sequence applies to all device types (being pdev or mdev), except
> one additional step to call KVM for ENQCMD-capable mdev:
> 
> 	/* After boots */
> 	/* Make GVA space nested on GPA space */
> 	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);

I'm not clear what gva_ioasid is representing.  Is it representing a
single vPASID's address space, or a whole bunch of vPASIDs address
spaces?

> 	/* Attach dev1 to the new address space and specify vPASID */
> 	at_data = {
> 		.ioasid		= gva_ioasid;
> 		.flag 		= IOASID_ATTACH_USER_PASID;
> 		.user_pasid	= gpasid1;
> 	};
> 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* if dev1 is ENQCMD-capable mdev, update CPU PASID 
> 	  * translation structure through KVM
> 	  */
> 	pa_data = {
> 		.ioasid_fd	= ioasid_fd;
> 		.ioasid		= gva_ioasid;
> 		.guest_pasid	= gpasid1;
> 	};
> 	ioctl(kvm_fd, KVM_MAP_PASID, &pa_data);
> 
> 	/* Bind guest I/O page table  */
> 	bind_data = {
> 		.ioasid	= gva_ioasid;
> 		.addr	= gva_pgtable1;
> 		// and format information
> 	};
> 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> 
> 	...
> 
> 
> 5.6. I/O page fault
> +++++++++++++++
> 
> (uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
> to guest IOMMU driver and backwards).
> 
> -   Host IOMMU driver receives a page request with raw fault_data {rid, 
>     pasid, addr};
> 
> -   Host IOMMU driver identifies the faulting I/O page table according to
>     information registered by IOASID fault handler;
> 
> -   IOASID fault handler is called with raw fault_data (rid, pasid, addr), which 
>     is saved in ioasid_data->fault_data (used for response);
> 
> -   IOASID fault handler generates an user fault_data (ioasid, addr), links it 
>     to the shared ring buffer and triggers eventfd to userspace;
> 
> -   Upon received event, Qemu needs to find the virtual routing information 
>     (v_rid + v_pasid) of the device attached to the faulting ioasid. If there are 
>     multiple, pick a random one. This should be fine since the purpose is to
>     fix the I/O page table on the guest;
> 
> -   Qemu generates a virtual I/O page fault through vIOMMU into guest,
>     carrying the virtual fault data (v_rid, v_pasid, addr);
> 
> -   Guest IOMMU driver fixes up the fault, updates the I/O page table, and
>     then sends a page response with virtual completion data (v_rid, v_pasid, 
>     response_code) to vIOMMU;
> 
> -   Qemu finds the pending fault event, converts virtual completion data 
>     into (ioasid, response_code), and then calls a /dev/ioasid ioctl to 
>     complete the pending fault;
> 
> -   /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved in 
>     ioasid_data->fault_data, and then calls iommu api to complete it with
>     {rid, pasid, response_code};
> 
> 5.7. BIND_PASID_TABLE
> ++++++++++++++++++++
> 
> PASID table is put in the GPA space on some platform, thus must be updated
> by the guest. It is treated as another user page table to be bound with the 
> IOMMU.
> 
> As explained earlier, the user still needs to explicitly bind every user I/O 
> page table to the kernel so the same pgtable binding protocol (bind, cache 
> invalidate and fault handling) is unified cross platforms.
> 
> vIOMMUs may include a caching mode (or paravirtualized way) which, once 
> enabled, requires the guest to invalidate PASID cache for any change on the 
> PASID table. This allows Qemu to track the lifespan of guest I/O page tables.
> 
> In case of missing such capability, Qemu could enable write-protection on
> the guest PASID table to achieve the same effect.
> 
> 	/* After boots */
> 	/* Make vPASID space nested on GPA space */
> 	pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);

I think this time pasidtbl_ioasid is representing multiple vPASID
address spaces, yes?  In which case I don't think it should be treated
as the same sort of object as a normal IOASID, which represents a
single address space IIUC.

> 	/* Attach dev1 to pasidtbl_ioasid */
> 	at_data = { .ioasid = pasidtbl_ioasid};
> 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Bind PASID table */
> 	bind_data = {
> 		.ioasid	= pasidtbl_ioasid;
> 		.addr	= gpa_pasid_table;
> 		// and format information
> 	};
> 	ioctl(ioasid_fd, IOASID_BIND_PASID_TABLE, &bind_data);
> 
> 	/* vIOMMU detects a new GVA I/O space created */
> 	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> 				gpa_ioasid);
> 
> 	/* Attach dev1 to the new address space, with gpasid1 */
> 	at_data = {
> 		.ioasid		= gva_ioasid;
> 		.flag 		= IOASID_ATTACH_USER_PASID;
> 		.user_pasid	= gpasid1;
> 	};
> 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> 
> 	/* Bind guest I/O page table. Because SET_PASID_TABLE has been
> 	  * used, the kernel will not update the PASID table. Instead, just
> 	  * track the bound I/O page table for handling invalidation and
> 	  * I/O page faults.
> 	  */
> 	bind_data = {
> 		.ioasid	= gva_ioasid;
> 		.addr	= gva_pgtable1;
> 		// and format information
> 	};
> 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);

Hrm.. if you still have to individually bind a table for each vPASID,
what's the point of BIND_PASID_TABLE?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-28 17:35 ` Jason Gunthorpe
  2021-06-01  8:10   ` Tian, Kevin
@ 2021-06-02  6:32   ` David Gibson
  2021-06-02 16:16     ` Jason Gunthorpe
  1 sibling, 1 reply; 258+ messages in thread
From: David Gibson @ 2021-06-02  6:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	Kirti Wankhede, Robin Murphy

[-- Attachment #1: Type: text/plain, Size: 4577 bytes --]

On Fri, May 28, 2021 at 02:35:38PM -0300, Jason Gunthorpe wrote:
> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
[snip]
> > With above design /dev/ioasid uAPI is all about I/O address spaces. 
> > It doesn't include any device routing information, which is only 
> > indirectly registered to the ioasid driver through VFIO uAPI. For
> > example, I/O page fault is always reported to userspace per IOASID,
> > although it's physically reported per device (RID+PASID). 
> 
> I agree with Jean-Philippe - at the very least erasing this
> information needs a major rational - but I don't really see why it
> must be erased? The HW reports the originating device, is it just a
> matter of labeling the devices attached to the /dev/ioasid FD so it
> can be reported to userspace?

HW reports the originating device as far as it knows.  In many cases
where you have multiple devices in an IOMMU group, it's because
although they're treated as separate devices at the kernel level, they
have the same RID at the HW level.  Which means a RID for something in
the right group is the closest you can count on supplying.

[snip]
> > However this way significantly 
> > violates the philosophy in this /dev/ioasid proposal. It is not one IOASID 
> > one address space any more. Device routing information (indirectly 
> > marking hidden I/O spaces) has to be carried in iotlb invalidation and 
> > page faulting uAPI to help connect vIOMMU with the underlying 
> > pIOMMU. This is one design choice to be confirmed with ARM guys.
> 
> I'm confused by this rational.
> 
> For a vIOMMU that has IO page tables in the guest the basic
> choices are:
>  - Do we have a hypervisor trap to bind the page table or not? (RID
>    and PASID may differ here)
>  - Do we have a hypervisor trap to invaliate the page tables or not?
> 
> If the first is a hypervisor trap then I agree it makes sense to create a
> child IOASID that points to each guest page table and manage it
> directly. This should not require walking guest page tables as it is
> really just informing the HW where the page table lives. HW will walk
> them.
> 
> If there are no hypervisor traps (does this exist?) then there is no
> way to involve the hypervisor here and the child IOASID should simply
> be a pointer to the guest's data structure that describes binding. In
> this case that IOASID should claim all PASIDs when bound to a
> RID. 

And in that case I think we should call that object something other
than an IOASID, since it represents multiple address spaces.

> Invalidation should be passed up the to the IOMMU driver in terms of
> the guest tables information and either the HW or software has to walk
> to guest tables to make sense of it.
> 
> Events from the IOMMU to userspace should be tagged with the attached
> device label and the PASID/substream ID. This means there is no issue
> to have a a 'all PASID' IOASID.
> 
> > Notes:
> > -   It might be confusing as IOASID is also used in the kernel (drivers/
> >     iommu/ioasid.c) to represent PCI PASID or ARM substream ID. We need
> >     find a better name later to differentiate.
> 
> +1 on Jean-Philippe's remarks
> 
> > -   PPC has not be considered yet as we haven't got time to fully understand
> >     its semantics. According to previous discussion there is some generality 
> >     between PPC window-based scheme and VFIO type1 semantics. Let's 
> >     first make consensus on this proposal and then further discuss how to 
> >     extend it to cover PPC's requirement.
> 
> From what I understood PPC is not so bad, Nesting IOASID's did its
> preload feature and it needed a way to specify/query the IOVA range a
> IOASID will cover.
> 
> > -   There is a protocol between vfio group and kvm. Needs to think about
> >     how it will be affected following this proposal.
> 
> Ugh, I always stop looking when I reach that boundary. Can anyone
> summarize what is going on there?
> 
> Most likely passing the /dev/ioasid into KVM's FD (or vicevera) is the
> right answer. Eg if ARM needs to get the VMID from KVM and set it to
> ioasid then a KVM "ioctl set_arm_vmid(/dev/ioasid)" call is
> reasonable. Certainly better than the symbol get sutff we have right
> now.
> 
> I will read through the detail below in another email
> 
> Jason
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-28 19:58 ` Jason Gunthorpe
  2021-06-01  8:38   ` Tian, Kevin
@ 2021-06-02  6:48   ` David Gibson
  2021-06-02 16:58     ` Jason Gunthorpe
  1 sibling, 1 reply; 258+ messages in thread
From: David Gibson @ 2021-06-02  6:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	Kirti Wankhede, Robin Murphy

[-- Attachment #1: Type: text/plain, Size: 7739 bytes --]

On Fri, May 28, 2021 at 04:58:39PM -0300, Jason Gunthorpe wrote:
> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > 
> > 5. Use Cases and Flows
> > 
> > Here assume VFIO will support a new model where every bound device
> > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o 
> > going through legacy container/group interface. For illustration purpose
> > those devices are just called dev[1...N]:
> > 
> > 	device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> > 
> > As explained earlier, one IOASID fd is sufficient for all intended use cases:
> > 
> > 	ioasid_fd = open("/dev/ioasid", mode);
> > 
> > For simplicity below examples are all made for the virtualization story.
> > They are representative and could be easily adapted to a non-virtualization
> > scenario.
> 
> For others, I don't think this is *strictly* necessary, we can
> probably still get to the device_fd using the group_fd and fit in
> /dev/ioasid. It does make the rest of this more readable though.

Leaving aside whether group fds should exist, while they *do* exist
binding to an IOASID should be done on the group not an individual
device.

[snip]
> > 	/* if dev1 is ENQCMD-capable mdev, update CPU PASID 
> > 	  * translation structure through KVM
> > 	  */
> > 	pa_data = {
> > 		.ioasid_fd	= ioasid_fd;
> > 		.ioasid		= gva_ioasid;
> > 		.guest_pasid	= gpasid1;
> > 	};
> > 	ioctl(kvm_fd, KVM_MAP_PASID, &pa_data);
> 
> Make sense
> 
> > 	/* Bind guest I/O page table  */
> > 	bind_data = {
> > 		.ioasid	= gva_ioasid;
> > 		.addr	= gva_pgtable1;
> > 		// and format information
> > 	};
> > 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> 
> Again I do wonder if this should just be part of alloc_ioasid. Is
> there any reason to split these things? The only advantage to the
> split is the device is known, but the device shouldn't impact
> anything..

I'm pretty sure the device(s) could matter, although they probably
won't usually.  But it would certainly be possible for a system to
have two different host bridges with two different IOMMUs with
different pagetable formats.  Until you know which devices (and
therefore which host bridge) you're talking about, you don't know what
formats of pagetable to accept.  And if you have devices from *both*
bridges you can't bind a page table at all - you could theoretically
support a kernel managed pagetable by mirroring each MAP and UNMAP to
tables in both formats, but it would be pretty reasonable not to
support that.

> > 5.6. I/O page fault
> > +++++++++++++++
> > 
> > (uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
> > to guest IOMMU driver and backwards).
> > 
> > -   Host IOMMU driver receives a page request with raw fault_data {rid, 
> >     pasid, addr};
> > 
> > -   Host IOMMU driver identifies the faulting I/O page table according to
> >     information registered by IOASID fault handler;
> > 
> > -   IOASID fault handler is called with raw fault_data (rid, pasid, addr), which 
> >     is saved in ioasid_data->fault_data (used for response);
> > 
> > -   IOASID fault handler generates an user fault_data (ioasid, addr), links it 
> >     to the shared ring buffer and triggers eventfd to userspace;
> 
> Here rid should be translated to a labeled device and return the
> device label from VFIO_BIND_IOASID_FD. Depending on how the device
> bound the label might match to a rid or to a rid,pasid

I like the idea of labelling devices when they're attached, it makes
extension to non-PCI devices much more obvious that having to deal
with concrete RIDs.

But, remember we can only (reliably) determine rid up to the group
boundary.  So if you're labelling devices, all devices in a group
would have to have the same label.  Or you attach the label to a group
not a device, which would be a reason to represent the group as an
object again.

> > -   Upon received event, Qemu needs to find the virtual routing information 
> >     (v_rid + v_pasid) of the device attached to the faulting ioasid. If there are 
> >     multiple, pick a random one. This should be fine since the purpose is to
> >     fix the I/O page table on the guest;
> 
> The device label should fix this
>  
> > -   Qemu finds the pending fault event, converts virtual completion data 
> >     into (ioasid, response_code), and then calls a /dev/ioasid ioctl to 
> >     complete the pending fault;
> > 
> > -   /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved in 
> >     ioasid_data->fault_data, and then calls iommu api to complete it with
> >     {rid, pasid, response_code};
> 
> So resuming a fault on an ioasid will resume all devices pending on
> the fault?
> 
> > 5.7. BIND_PASID_TABLE
> > ++++++++++++++++++++
> > 
> > PASID table is put in the GPA space on some platform, thus must be updated
> > by the guest. It is treated as another user page table to be bound with the 
> > IOMMU.
> > 
> > As explained earlier, the user still needs to explicitly bind every user I/O 
> > page table to the kernel so the same pgtable binding protocol (bind, cache 
> > invalidate and fault handling) is unified cross platforms.
> >
> > vIOMMUs may include a caching mode (or paravirtualized way) which, once 
> > enabled, requires the guest to invalidate PASID cache for any change on the 
> > PASID table. This allows Qemu to track the lifespan of guest I/O page tables.
> >
> > In case of missing such capability, Qemu could enable write-protection on
> > the guest PASID table to achieve the same effect.
> > 
> > 	/* After boots */
> > 	/* Make vPASID space nested on GPA space */
> > 	pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > 				gpa_ioasid);
> > 
> > 	/* Attach dev1 to pasidtbl_ioasid */
> > 	at_data = { .ioasid = pasidtbl_ioasid};
> > 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> > 
> > 	/* Bind PASID table */
> > 	bind_data = {
> > 		.ioasid	= pasidtbl_ioasid;
> > 		.addr	= gpa_pasid_table;
> > 		// and format information
> > 	};
> > 	ioctl(ioasid_fd, IOASID_BIND_PASID_TABLE, &bind_data);
> > 
> > 	/* vIOMMU detects a new GVA I/O space created */
> > 	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > 				gpa_ioasid);
> > 
> > 	/* Attach dev1 to the new address space, with gpasid1 */
> > 	at_data = {
> > 		.ioasid		= gva_ioasid;
> > 		.flag 		= IOASID_ATTACH_USER_PASID;
> > 		.user_pasid	= gpasid1;
> > 	};
> > 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> > 
> > 	/* Bind guest I/O page table. Because SET_PASID_TABLE has been
> > 	  * used, the kernel will not update the PASID table. Instead, just
> > 	  * track the bound I/O page table for handling invalidation and
> > 	  * I/O page faults.
> > 	  */
> > 	bind_data = {
> > 		.ioasid	= gva_ioasid;
> > 		.addr	= gva_pgtable1;
> > 		// and format information
> > 	};
> > 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> 
> I still don't quite get the benifit from doing this.
> 
> The idea to create an all PASID IOASID seems to work better with less
> fuss on HW that is directly parsing the guest's PASID table.
> 
> Cache invalidate seems easy enough to support
> 
> Fault handling needs to return the (ioasid, device_label, pasid) when
> working with this kind of ioasid.
> 
> It is true that it does create an additional flow qemu has to
> implement, but it does directly mirror the HW.
> 
> Jason
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01 17:56     ` Jason Gunthorpe
  2021-06-02  2:00       ` Tian, Kevin
@ 2021-06-02  6:57       ` David Gibson
  2021-06-02 16:37         ` Jason Gunthorpe
  1 sibling, 1 reply; 258+ messages in thread
From: David Gibson @ 2021-06-02  6:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	Kirti Wankhede, Robin Murphy

[-- Attachment #1: Type: text/plain, Size: 4248 bytes --]

On Tue, Jun 01, 2021 at 02:56:43PM -0300, Jason Gunthorpe wrote:
> On Tue, Jun 01, 2021 at 08:38:00AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Saturday, May 29, 2021 3:59 AM
> > > 
> > > On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > > >
> > > > 5. Use Cases and Flows
> > > >
> > > > Here assume VFIO will support a new model where every bound device
> > > > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> > > > going through legacy container/group interface. For illustration purpose
> > > > those devices are just called dev[1...N]:
> > > >
> > > > 	device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> > > >
> > > > As explained earlier, one IOASID fd is sufficient for all intended use cases:
> > > >
> > > > 	ioasid_fd = open("/dev/ioasid", mode);
> > > >
> > > > For simplicity below examples are all made for the virtualization story.
> > > > They are representative and could be easily adapted to a non-virtualization
> > > > scenario.
> > > 
> > > For others, I don't think this is *strictly* necessary, we can
> > > probably still get to the device_fd using the group_fd and fit in
> > > /dev/ioasid. It does make the rest of this more readable though.
> > 
> > Jason, want to confirm here. Per earlier discussion we remain an
> > impression that you want VFIO to be a pure device driver thus
> > container/group are used only for legacy application.
> 
> Let me call this a "nice wish".
> 
> If you get to a point where you hard need this, then identify the hard
> requirement and let's do it, but I wouldn't bloat this already large
> project unnecessarily.
> 
> Similarly I wouldn't depend on the group fd existing in this design
> so it could be changed later.

I don't think presence or absence of a group fd makes a lot of
difference to this design.  Having a group fd just means we attach
groups to the ioasid instead of individual devices, and we no longer
need the bookkeeping of "partial" devices.

> > From this comment are you suggesting that VFIO can still keep
> > container/ group concepts and user just deprecates the use of vfio
> > iommu uAPI (e.g. VFIO_SET_IOMMU) by using /dev/ioasid (which has a
> > simple policy that an IOASID will reject cmd if partially-attached
> > group exists)?
> 
> I would say no on the container. /dev/ioasid == the container, having
> two competing objects at once in a single process is just a mess.

Right.  I'd assume that for compatibility, creating a container would
create a single IOASID under the hood with a compatiblity layer
translating the container operations to iosaid operations.

> If the group fd can be kept requires charting a path through the
> ioctls where the container is not used and /dev/ioasid is sub'd in
> using the same device FD specific IOCTLs you show here.

Again, I don't think it makes much difference.  The model doesn't
really change even if you allow both ATTACH_GROUP and ATTACH_DEVICE on
the IOASID.  Basically ATTACH_GROUP would just be equivalent to
attaching all the constituent devices.

> I didn't try to chart this out carefully.
> 
> Also, ultimately, something need to be done about compatability with
> the vfio container fd. It looks clear enough to me that the the VFIO
> container FD is just a single IOASID using a special ioctl interface
> so it would be quite rasonable to harmonize these somehow.
> 
> But that is too complicated and far out for me at least to guess on at
> this point..
> 
> > > Still a little unsure why the vPASID is here not on the gva_ioasid. Is
> > > there any scenario where we want different vpasid's for the same
> > > IOASID? I guess it is OK like this. Hum.
> > 
> > Yes, it's completely sane that the guest links a I/O page table to 
> > different vpasids on dev1 and dev2. The IOMMU doesn't mandate
> > that when multiple devices share an I/O page table they must use
> > the same PASID#. 
> 
> Ok..
> 
> Jason
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-28 23:36 ` Jason Gunthorpe
  2021-05-31 11:31   ` Liu Yi L
  2021-06-01 11:09   ` Lu Baolu
@ 2021-06-02  7:22   ` David Gibson
  2021-06-03  6:39   ` Tian, Kevin
  3 siblings, 0 replies; 258+ messages in thread
From: David Gibson @ 2021-06-02  7:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	Kirti Wankhede, Robin Murphy

[-- Attachment #1: Type: text/plain, Size: 6445 bytes --]

On Fri, May 28, 2021 at 08:36:49PM -0300, Jason Gunthorpe wrote:
> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> 
> > 2.1. /dev/ioasid uAPI
> > +++++++++++++++++
> > 
> > /*
> >   * Check whether an uAPI extension is supported. 
> >   *
> >   * This is for FD-level capabilities, such as locked page pre-registration. 
> >   * IOASID-level capabilities are reported through IOASID_GET_INFO.
> >   *
> >   * Return: 0 if not supported, 1 if supported.
> >   */
> > #define IOASID_CHECK_EXTENSION	_IO(IOASID_TYPE, IOASID_BASE + 0)
> 
>  
> > /*
> >   * Register user space memory where DMA is allowed.
> >   *
> >   * It pins user pages and does the locked memory accounting so sub-
> >   * sequent IOASID_MAP/UNMAP_DMA calls get faster.
> >   *
> >   * When this ioctl is not used, one user page might be accounted
> >   * multiple times when it is mapped by multiple IOASIDs which are
> >   * not nested together.
> >   *
> >   * Input parameters:
> >   *	- vaddr;
> >   *	- size;
> >   *
> >   * Return: 0 on success, -errno on failure.
> >   */
> > #define IOASID_REGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 1)
> > #define IOASID_UNREGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 2)
> 
> So VA ranges are pinned and stored in a tree and later references to
> those VA ranges by any other IOASID use the pin cached in the tree?
> 
> It seems reasonable and is similar to the ioasid parent/child I
> suggested for PPC.
> 
> IMHO this should be merged with the all SW IOASID that is required for
> today's mdev drivers. If this can be done while keeping this uAPI then
> great, otherwise I don't think it is so bad to weakly nest a physical
> IOASID under a SW one just to optimize page pinning.

Right, I think we can simplify the interface by modelling the
preregistration as a nesting layer.  Well, mostly.. the wrinkle is
that generally you can't do anything with an ioasid until you've
attached devices to it, but that doesn't really make sense for the
prereg layer.  I expect we can find some way to deal with that,
though.

Actually... to simplify that "weak nesting" concept I wonder if we
want to expand to 3 ways of specifying the pagetables for the ioasid:
  1) kernel managed (MAP/UNMAP)
  2) user managed (BIND/INVALIDATE)
  3) pass-though (IOVA==parent address)

Obviously pass-through wouldn't be allowed in all circumstances.

> Either way this seems like a smart direction
> 
> > /*
> >   * Allocate an IOASID. 
> >   *
> >   * IOASID is the FD-local software handle representing an I/O address 
> >   * space. Each IOASID is associated with a single I/O page table. User 
> >   * must call this ioctl to get an IOASID for every I/O address space that is
> >   * intended to be enabled in the IOMMU.
> >   *
> >   * A newly-created IOASID doesn't accept any command before it is 
> >   * attached to a device. Once attached, an empty I/O page table is 
> >   * bound with the IOMMU then the user could use either DMA mapping 
> >   * or pgtable binding commands to manage this I/O page table.
> 
> Can the IOASID can be populated before being attached?

I don't think it reasonably can.  Until attached, you don't actually
know what hardware IOMMU will be backing it, and therefore you don't
know it's capabilities.  You can't really allow mappings if you don't
even know allowed IOVA ranges and page size.

> >   * Device attachment is initiated through device driver uAPI (e.g. VFIO)
> >   *
> >   * Return: allocated ioasid on success, -errno on failure.
> >   */
> > #define IOASID_ALLOC	_IO(IOASID_TYPE, IOASID_BASE + 3)
> > #define IOASID_FREE	_IO(IOASID_TYPE, IOASID_BASE + 4)
> 
> I assume alloc will include quite a big structure to satisfy the
> various vendor needs?
> 
> > /*
> >   * Get information about an I/O address space
> >   *
> >   * Supported capabilities:
> >   *	- VFIO type1 map/unmap;
> >   *	- pgtable/pasid_table binding
> >   *	- hardware nesting vs. software nesting;
> >   *	- ...
> >   *
> >   * Related attributes:
> >   * 	- supported page sizes, reserved IOVA ranges (DMA mapping);
> >   *	- vendor pgtable formats (pgtable binding);
> >   *	- number of child IOASIDs (nesting);
> >   *	- ...
> >   *
> >   * Above information is available only after one or more devices are
> >   * attached to the specified IOASID. Otherwise the IOASID is just a
> >   * number w/o any capability or attribute.
> 
> This feels wrong to learn most of these attributes of the IOASID after
> attaching to a device.

Yes... but as above, we have no idea what the IOMMU's capabilities are
until devices are attached.

> The user should have some idea how it intends to use the IOASID when
> it creates it and the rest of the system should match the intention.
> 
> For instance if the user is creating a IOASID to cover the guest GPA
> with the intention of making children it should indicate this during
> alloc.
> 
> If the user is intending to point a child IOASID to a guest page table
> in a certain descriptor format then it should indicate it during
> alloc.
> 
> device bind should fail if the device somehow isn't compatible with
> the scheme the user is tring to use.

[snip]
> > 2.2. /dev/vfio uAPI
> > ++++++++++++++++
> 
> To be clear you mean the 'struct vfio_device' API, these are not
> IOCTLs on the container or group?
> 
> > /*
> >    * Bind a vfio_device to the specified IOASID fd
> >    *
> >    * Multiple vfio devices can be bound to a single ioasid_fd, but a single
> >    * vfio device should not be bound to multiple ioasid_fd's.
> >    *
> >    * Input parameters:
> >    *  - ioasid_fd;
> >    *
> >    * Return: 0 on success, -errno on failure.
> >    */
> > #define VFIO_BIND_IOASID_FD           _IO(VFIO_TYPE, VFIO_BASE + 22)
> > #define VFIO_UNBIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 23)
> 
> This is where it would make sense to have an output "device id" that
> allows /dev/ioasid to refer to this "device" by number in events and
> other related things.

The group number could be used for that, even if there are no group
fds.  You generally can't identify things more narrowly than group
anyway.


-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-31 17:37 ` Parav Pandit
  2021-05-31 18:12   ` Jason Gunthorpe
@ 2021-06-02  8:38   ` Enrico Weigelt, metux IT consult
  2021-06-02 12:41     ` Parav Pandit
  1 sibling, 1 reply; 258+ messages in thread
From: Enrico Weigelt, metux IT consult @ 2021-06-02  8:38 UTC (permalink / raw)
  To: Parav Pandit, Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe,
	Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang
  Cc: Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu, Hao,
	Jiang, Dave, Jacob Pan, Jean-Philippe Brucker, David Gibson,
	Kirti Wankhede, Robin Murphy

On 31.05.21 19:37, Parav Pandit wrote:

> It appears that this is only to make map ioctl faster apart from accounting.
> It doesn't have any ioasid handle input either.
> 
> In that case, can it be a new system call? Why does it have to be under /dev/ioasid?
> For example few years back such system call mpin() thought was proposed in [1].

I'm very reluctant to more syscall inflation. We already have lots of
syscalls that could have been easily done via devices or filesystems
(yes, some of them are just old Unix relics).

Syscalls don't play well w/ modules, containers, distributed systems,
etc, and need extra low-level code for most non-C languages (eg.
scripting languages).


--mtx

-- 
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@metux.net -- +49-151-27565287

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01 20:28     ` Jason Gunthorpe
  2021-06-02  1:25       ` Tian, Kevin
@ 2021-06-02  8:52       ` Jason Wang
  2021-06-02 16:07         ` Jason Gunthorpe
  1 sibling, 1 reply; 258+ messages in thread
From: Jason Wang @ 2021-06-02  8:52 UTC (permalink / raw)
  To: Jason Gunthorpe, Tian, Kevin
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu, Hao,
	Jiang, Dave, Jacob Pan, Jean-Philippe Brucker, David Gibson,
	Kirti Wankhede, Robin Murphy


在 2021/6/2 上午4:28, Jason Gunthorpe 写道:
>> I summarized five opens here, about:
>>
>> 1)  Finalizing the name to replace /dev/ioasid;
>> 2)  Whether one device is allowed to bind to multiple IOASID fd's;
>> 3)  Carry device information in invalidation/fault reporting uAPI;
>> 4)  What should/could be specified when allocating an IOASID;
>> 5)  The protocol between vfio group and kvm;
>>
>> For 1), two alternative names are mentioned: /dev/iommu and
>> /dev/ioas. I don't have a strong preference and would like to hear
>> votes from all stakeholders. /dev/iommu is slightly better imho for
>> two reasons. First, per AMD's presentation in last KVM forum they
>> implement vIOMMU in hardware thus need to support user-managed
>> domains. An iommu uAPI notation might make more sense moving
>> forward. Second, it makes later uAPI naming easier as 'IOASID' can
>> be always put as an object, e.g. IOMMU_ALLOC_IOASID instead of
>> IOASID_ALLOC_IOASID.:)
> I think two years ago I suggested /dev/iommu and it didn't go very far
> at the time.


It looks to me using "/dev/iommu" excludes the possibility of 
implementing IOASID in a device specific way (e.g through the 
co-operation with device MMU + platform IOMMU)?

What's more, ATS spec doesn't forbid the device #PF to be reported via a 
device specific way.

Thanks


> We've also talked about this as /dev/sva for a while and
> now /dev/ioasid
>
> I think /dev/iommu is fine, and call the things inside them IOAS
> objects.
>
> Then we don't have naming aliasing with kernel constructs.
>   


^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01 17:31                       ` Jason Gunthorpe
@ 2021-06-02  8:54                         ` Jason Wang
  2021-06-02 17:21                           ` Jason Gunthorpe
  0 siblings, 1 reply; 258+ messages in thread
From: Jason Wang @ 2021-06-02  8:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Lu Baolu, Liu Yi L, kvm, Jonathan Corbet, iommu,
	LKML, Alex Williamson (alex.williamson@redhat.com)"",
	David Woodhouse


在 2021/6/2 上午1:31, Jason Gunthorpe 写道:
> On Tue, Jun 01, 2021 at 04:47:15PM +0800, Jason Wang wrote:
>   
>> We can open up to ~0U file descriptors, I don't see why we need to restrict
>> it in uAPI.
> There are significant problems with such large file descriptor
> tables. High FD numbers man things like select don't work at all
> anymore and IIRC there are more complications.


I don't see how much difference for IOASID and other type of fds. People 
can choose to use poll or epoll.

And with the current proposal, (assuming there's a N:1 ioasid to 
ioasid). I wonder how select can work for the specific ioasid.

Thanks


>
> A huge number of FDs for typical usages should be avoided.
>
> Jason
>


^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-05-27  7:58 [RFC] /dev/ioasid uAPI proposal Tian, Kevin
                   ` (9 preceding siblings ...)
  2021-06-02  6:15 ` David Gibson
@ 2021-06-02  8:56 ` Enrico Weigelt, metux IT consult
  2021-06-02 17:24   ` Jason Gunthorpe
  10 siblings, 1 reply; 258+ messages in thread
From: Enrico Weigelt, metux IT consult @ 2021-06-02  8:56 UTC (permalink / raw)
  To: Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe, Lu Baolu,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang
  Cc: Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu, Hao,
	Jiang, Dave, Jacob Pan, Jean-Philippe Brucker, David Gibson,
	Kirti Wankhede, Robin Murphy

On 27.05.21 09:58, Tian, Kevin wrote:

Hi,

> /dev/ioasid provides an unified interface for managing I/O page tables for
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
> etc.) are expected to use this interface instead of creating their own logic to
> isolate untrusted device DMAs initiated by userspace.

While I'm in favour of having generic APIs for generic tasks, as well as
using FDs, I wonder whether it has to be a new and separate device.

Now applications have to use multiple APIs in lockstep. One consequence
of that is operators, as well as provisioning systems, container
infrastructures, etc, always have to consider multiple devices together.

You can't just say "give workload XY access to device /dev/foo" anymore.
Now you have to take care about scenarios like "if someone wants
/dev/foo, he also needs /dev/bar"). And if that happens multiple times
together ("/dev/foo and /dev/wurst, both require /dev/bar), leading to
scenarios like the dev nodes are bind-mounted somewhere, you need to
take care that additional devices aren't bind-mounted twice, etc ...

If I understand this correctly, /dev/ioasid is a kind of "common 
supplier" to other APIs / devices. Why can't the fd be acquired by the
consumer APIs (eg. kvm, vfio, etc) ?


--mtx

-- 
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@metux.net -- +49-151-27565287

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01 17:29                   ` Jason Gunthorpe
@ 2021-06-02  8:58                     ` Jason Wang
  0 siblings, 0 replies; 258+ messages in thread
From: Jason Wang @ 2021-06-02  8:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Lu Baolu, Liu Yi L,
	Alex Williamson (alex.williamson@redhat.com)"",
	kvm, Jonathan Corbet, LKML, iommu, David Woodhouse


在 2021/6/2 上午1:29, Jason Gunthorpe 写道:
> On Tue, Jun 01, 2021 at 02:07:05PM +0800, Jason Wang wrote:
>
>> For the case of 1M, I would like to know what's the use case for a single
>> process to handle 1M+ address spaces?
> For some scenarios every guest PASID will require a IOASID ID # so
> there is a large enough demand that FDs alone are not a good fit.
>
> Further there are global container wide properties that are hard to
> carry over to a multi-FD model, like the attachment of devices to the
> container at the startup.


So if we implement per fd model. The global "container" properties could 
be done via the parent fd. E.g attaching the parent to the device at the 
startup.


>
>>> So this RFC treats fd as a container of address spaces which is each
>>> tagged by an IOASID.
>> If the container and address space is 1:1 then the container seems useless.
> The examples at the bottom of the document show multiple IOASIDs in
> the container for a parent/child type relationship


This can also be done per fd? A fd parent can have multiple fd childs.

Thanks


>
> Jason
>


^ permalink raw reply	[flat|nested] 258+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-06-02  8:38   ` Enrico Weigelt, metux IT consult
@ 2021-06-02 12:41     ` Parav Pandit
  0 siblings, 0 replies; 258+ messages in thread
From: Parav Pandit @ 2021-06-02 12:41 UTC (permalink / raw)
  To: Enrico Weigelt, metux IT consult, Tian, Kevin, LKML,
	Joerg Roedel, Jason Gunthorpe, Lu Baolu, David Woodhouse, iommu,
	kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang
  Cc: Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu, Hao,
	Jiang, Dave, Jacob Pan, Jean-Philippe Brucker, David Gibson,
	Kirti Wankhede, Robin Murphy


> From: Enrico Weigelt, metux IT consult <lkml@metux.net>
> Sent: Wednesday, June 2, 2021 2:09 PM
> 
> On 31.05.21 19:37, Parav Pandit wrote:
> 
> > It appears that this is only to make map ioctl faster apart from accounting.
> > It doesn't have any ioasid handle input either.
> >
> > In that case, can it be a new system call? Why does it have to be under
> /dev/ioasid?
> > For example few years back such system call mpin() thought was proposed
> in [1].
> 
> I'm very reluctant to more syscall inflation. We already have lots of syscalls
> that could have been easily done via devices or filesystems (yes, some of
> them are just old Unix relics).
> 
> Syscalls don't play well w/ modules, containers, distributed systems, etc, and
> need extra low-level code for most non-C languages (eg.
> scripting languages).

Likely, but as per my understanding, this ioctl() is a wrapper to device agnostic code as,

 {
   atomic_inc(mm->pinned_vm);
   pin_user_pages();
}

And mm must got to hold the reference to it, so that these pages cannot be munmap() or freed.

And second reason I think (I could be wrong) is that, second level page table for a PASID, should be same as what process CR3 has used.
Essentially iommu page table and mmu page table should be pointing to same page table entry.
If they are different, than even if the guest CPU has accessed the pages, device access via IOMMU will result in an expensive page faults.

So assuming both cr3 and pasid table entry points to same page table, I fail to understand for the need of extra refcount and hence driver specific ioctl().
Though I do not have strong objection to the ioctl(). But want to know what it will and will_not do.
Io uring fs has similar ioctl() doing io_sqe_buffer_register(), pinning the memory.

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-02  2:20       ` Tian, Kevin
@ 2021-06-02 16:01         ` Jason Gunthorpe
  2021-06-02 17:11           ` Alex Williamson
  0 siblings, 1 reply; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-02 16:01 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok,
	Liu, Yi L, Wu, Hao, Jiang, Dave, Jacob Pan,
	Jean-Philippe Brucker, David Gibson, Kirti Wankhede,
	Robin Murphy

On Wed, Jun 02, 2021 at 02:20:15AM +0000, Tian, Kevin wrote:
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, June 2, 2021 6:22 AM
> > 
> > On Tue, 1 Jun 2021 07:01:57 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > >
> > > I summarized five opens here, about:
> > >
> > > 1)  Finalizing the name to replace /dev/ioasid;
> > > 2)  Whether one device is allowed to bind to multiple IOASID fd's;
> > > 3)  Carry device information in invalidation/fault reporting uAPI;
> > > 4)  What should/could be specified when allocating an IOASID;
> > > 5)  The protocol between vfio group and kvm;
> > >
> > ...
> > >
> > > For 5), I'd expect Alex to chime in. Per my understanding looks the
> > > original purpose of this protocol is not about I/O address space. It's
> > > for KVM to know whether any device is assigned to this VM and then
> > > do something special (e.g. posted interrupt, EPT cache attribute, etc.).
> > 
> > Right, the original use case was for KVM to determine whether it needs
> > to emulate invlpg, so it needs to be aware when an assigned device is
> 
> invlpg -> wbinvd :)
> 
> > present and be able to test if DMA for that device is cache
> > coherent.

Why is this such a strong linkage to VFIO and not just a 'hey kvm
emulate wbinvd' flag from qemu?

I briefly didn't see any obvios linkage in the arch code, just some
dead code:

$ git grep iommu_noncoherent
arch/x86/include/asm/kvm_host.h:	bool iommu_noncoherent;
$ git grep iommu_domain arch/x86
arch/x86/include/asm/kvm_host.h:        struct iommu_domain *iommu_domain;

Huh?

It kind of looks like the other main point is to generate the
VFIO_GROUP_NOTIFY_SET_KVM which is being used by two VFIO drivers to
connect back to the kvm data

But that seems like it would have been better handled with some IOCTL
on the vfio_device fd to import the KVM to the driver not this
roundabout way?

> > The user, QEMU, creates a KVM "pseudo" device representing the vfio
> > group, providing the file descriptor of that group to show ownership.
> > The ugly symbol_get code is to avoid hard module dependencies, ie. the
> > kvm module should not pull in or require the vfio module, but vfio will
> > be present if attempting to register this device.
> 
> so the symbol_get thing is not about the protocol itself. Whatever protocol
> is defined, as long as kvm needs to call vfio or ioasid helper function, we 
> need define a proper way to do it. Jason, what's your opinion of alternative 
> option since you dislike symbol_get?

The symbol_get was to avoid module dependencies because bringing in
vfio along with kvm is not nice.

The symbol get is not nice here, but unless things can be truely moved
to lower levels of code where module dependencies are not a problem (eg
kvm to iommu is a non-issue) I don't see much of a solution.

Other cases like kvmgt or AP would be similarly fine to have had a
kvmgt to kvm module dependency.

> > All of these use cases are related to the IOMMU, whether DMA is
> > coherent, translating device IOVA to GPA, and an acceleration path to
> > emulate IOMMU programming in kernel... they seem pretty relevant.
> 
> One open is whether kvm should hold a device reference or IOASID
> reference. For DMA coherence, it only matters whether assigned 
> devices are coherent or not (not for a specific address space). For kvmgt, 
> it is for recoding kvm pointer in mdev driver to do write protection. For 
> ppc, it does relate to a specific I/O page table.
> 
> Then I feel only a part of the protocol will be moved to /dev/ioasid and
> something will still remain between kvm and vfio?

Honestly I would try not to touch this too much :\

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-02  8:52       ` Jason Wang
@ 2021-06-02 16:07         ` Jason Gunthorpe
  0 siblings, 0 replies; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-02 16:07 UTC (permalink / raw)
  To: Jason Wang
  Cc: Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu, Hao,
	Jiang, Dave, Jacob Pan, Jean-Philippe Brucker, David Gibson,
	Kirti Wankhede, Robin Murphy

On Wed, Jun 02, 2021 at 04:52:02PM +0800, Jason Wang wrote:
> 
> 在 2021/6/2 上午4:28, Jason Gunthorpe 写道:
> > > I summarized five opens here, about:
> > > 
> > > 1)  Finalizing the name to replace /dev/ioasid;
> > > 2)  Whether one device is allowed to bind to multiple IOASID fd's;
> > > 3)  Carry device information in invalidation/fault reporting uAPI;
> > > 4)  What should/could be specified when allocating an IOASID;
> > > 5)  The protocol between vfio group and kvm;
> > > 
> > > For 1), two alternative names are mentioned: /dev/iommu and
> > > /dev/ioas. I don't have a strong preference and would like to hear
> > > votes from all stakeholders. /dev/iommu is slightly better imho for
> > > two reasons. First, per AMD's presentation in last KVM forum they
> > > implement vIOMMU in hardware thus need to support user-managed
> > > domains. An iommu uAPI notation might make more sense moving
> > > forward. Second, it makes later uAPI naming easier as 'IOASID' can
> > > be always put as an object, e.g. IOMMU_ALLOC_IOASID instead of
> > > IOASID_ALLOC_IOASID.:)
> > I think two years ago I suggested /dev/iommu and it didn't go very far
> > at the time.
> 
> 
> It looks to me using "/dev/iommu" excludes the possibility of implementing
> IOASID in a device specific way (e.g through the co-operation with device
> MMU + platform IOMMU)?

This is intended to be the 'drivers/iommu' subsystem though. I don't
want to see pluggabilit here beyoned what drivers/iommu is providing.

If someone wants to do a something complicated through this interface
then they need to make a drivers/iommu implementation.

Or they need to use the mdev-esque "SW TABLE" mode when their driver
attaches to the interface.

If this is good enough or not for a specific device is an entirely
other question though

> What's more, ATS spec doesn't forbid the device #PF to be reported via a
> device specific way.

And if this is done then a kernel function indicating page fault
should be triggered on the ioasid handle that the device has. It is
still drivers/iommu functionality

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-02  1:33       ` Tian, Kevin
@ 2021-06-02 16:09         ` Jason Gunthorpe
  2021-06-03  1:29           ` Tian, Kevin
  0 siblings, 1 reply; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-02 16:09 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

On Wed, Jun 02, 2021 at 01:33:22AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, June 2, 2021 1:42 AM
> > 
> > On Tue, Jun 01, 2021 at 08:10:14AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Saturday, May 29, 2021 1:36 AM
> > > >
> > > > On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > > >
> > > > > IOASID nesting can be implemented in two ways: hardware nesting and
> > > > > software nesting. With hardware support the child and parent I/O page
> > > > > tables are walked consecutively by the IOMMU to form a nested
> > translation.
> > > > > When it's implemented in software, the ioasid driver is responsible for
> > > > > merging the two-level mappings into a single-level shadow I/O page
> > table.
> > > > > Software nesting requires both child/parent page tables operated
> > through
> > > > > the dma mapping protocol, so any change in either level can be
> > captured
> > > > > by the kernel to update the corresponding shadow mapping.
> > > >
> > > > Why? A SW emulation could do this synchronization during invalidation
> > > > processing if invalidation contained an IOVA range.
> > >
> > > In this proposal we differentiate between host-managed and user-
> > > managed I/O page tables. If host-managed, the user is expected to use
> > > map/unmap cmd explicitly upon any change required on the page table.
> > > If user-managed, the user first binds its page table to the IOMMU and
> > > then use invalidation cmd to flush iotlb when necessary (e.g. typically
> > > not required when changing a PTE from non-present to present).
> > >
> > > We expect user to use map+unmap and bind+invalidate respectively
> > > instead of mixing them together. Following this policy, map+unmap
> > > must be used in both levels for software nesting, so changes in either
> > > level are captured timely to synchronize the shadow mapping.
> > 
> > map+unmap or bind+invalidate is a policy of the IOASID itself set when
> > it is created. If you put two different types in a tree then each IOASID
> > must continue to use its own operation mode.
> > 
> > I don't see a reason to force all IOASIDs in a tree to be consistent??
> 
> only for software nesting. With hardware support the parent uses map
> while the child uses bind.
> 
> Yes, the policy is specified per IOASID. But if the policy violates the
> requirement in a specific nesting mode, then nesting should fail.

I don't get it.

If the IOASID is a page table then it is bind/invalidate. SW or not SW
doesn't matter at all.

> > 
> > A software emulated two level page table where the leaf level is a
> > bound page table in guest memory should continue to use
> > bind/invalidate to maintain the guest page table IOASID even though it
> > is a SW construct.
> 
> with software nesting the leaf should be a host-managed page table
> (or metadata). A bind/invalidate protocol doesn't require the user
> to notify the kernel of every page table change. 

The purpose of invalidate is to inform the implementation that the
page table has changed so it can flush the caches. If the page table
is changed and invalidation is not issued then then the implementation
is free to ignore the changes.

In this way the SW mode is the same as a HW mode with an infinite
cache.

The collaposed shadow page table is really just a cache.

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-02  6:32   ` David Gibson
@ 2021-06-02 16:16     ` Jason Gunthorpe
  2021-06-03  2:11       ` Tian, Kevin
  2021-06-03  5:13       ` David Gibson
  0 siblings, 2 replies; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-02 16:16 UTC (permalink / raw)
  To: David Gibson
  Cc: Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	Kirti Wankhede, Robin Murphy

On Wed, Jun 02, 2021 at 04:32:27PM +1000, David Gibson wrote:
> > I agree with Jean-Philippe - at the very least erasing this
> > information needs a major rational - but I don't really see why it
> > must be erased? The HW reports the originating device, is it just a
> > matter of labeling the devices attached to the /dev/ioasid FD so it
> > can be reported to userspace?
> 
> HW reports the originating device as far as it knows.  In many cases
> where you have multiple devices in an IOMMU group, it's because
> although they're treated as separate devices at the kernel level, they
> have the same RID at the HW level.  Which means a RID for something in
> the right group is the closest you can count on supplying.

Granted there may be cases where exact fidelity is not possible, but
that doesn't excuse eliminating fedelity where it does exist..

> > If there are no hypervisor traps (does this exist?) then there is no
> > way to involve the hypervisor here and the child IOASID should simply
> > be a pointer to the guest's data structure that describes binding. In
> > this case that IOASID should claim all PASIDs when bound to a
> > RID. 
> 
> And in that case I think we should call that object something other
> than an IOASID, since it represents multiple address spaces.

Maybe.. It is certainly a special case.

We can still consider it a single "address space" from the IOMMU
perspective. What has happened is that the address table is not just a
64 bit IOVA, but an extended ~80 bit IOVA formed by "PASID, IOVA".

If we are already going in the direction of having the IOASID specify
the page table format and other details, specifying that the page
tabnle format is the 80 bit "PASID, IOVA" format is a fairly small
step.

I wouldn't twist things into knots to create a difference, but if it
is easy to do it wouldn't hurt either.

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-02  6:57       ` David Gibson
@ 2021-06-02 16:37         ` Jason Gunthorpe
  2021-06-03  5:23           ` David Gibson
  0 siblings, 1 reply; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-02 16:37 UTC (permalink / raw)
  To: David Gibson
  Cc: Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	Kirti Wankhede, Robin Murphy

On Wed, Jun 02, 2021 at 04:57:52PM +1000, David Gibson wrote:

> I don't think presence or absence of a group fd makes a lot of
> difference to this design.  Having a group fd just means we attach
> groups to the ioasid instead of individual devices, and we no longer
> need the bookkeeping of "partial" devices.

Oh, I think we really don't want to attach the group to an ioasid, or
at least not as a first-class idea.

The fundamental problem that got us here is we now live in a world
where there are many ways to attach a device to an IOASID:

 - A RID binding
 - A RID,PASID binding
 - A RID,PASID binding for ENQCMD
 - A SW TABLE binding
 - etc

The selection of which mode to use is based on the specific
driver/device operation. Ie the thing that implements the 'struct
vfio_device' is the thing that has to select the binding mode.

group attachment was fine when there was only one mode. As you say it
is fine to just attach every group member with RID binding if RID
binding is the only option.

When SW TABLE binding was added the group code was hacked up - now the
group logic is choosing between RID/SW TABLE in a very hacky and mdev
specific way, and this is just a mess.

The flow must carry the IOASID from the /dev/iommu to the vfio_device
driver and the vfio_device implementation must choose which binding
mode and parameters it wants based on driver and HW configuration.

eg if two PCI devices are in a group then it is perfectly fine that
one device uses RID binding and the other device uses RID,PASID
binding.

The only place I see for a "group bind" in the uAPI is some compat
layer for the vfio container, and the implementation would be quite
different, we'd have to call each vfio_device driver in the group and
execute the IOASID attach IOCTL.

> > I would say no on the container. /dev/ioasid == the container, having
> > two competing objects at once in a single process is just a mess.
> 
> Right.  I'd assume that for compatibility, creating a container would
> create a single IOASID under the hood with a compatiblity layer
> translating the container operations to iosaid operations.

It is a nice dream for sure

/dev/vfio could be a special case of /dev/ioasid just with a different
uapi and ending up with only one IOASID. They could be interchangable
from then on, which would simplify the internals of VFIO if it
consistently delt with these new ioasid objects everywhere. But last I
looked it was complicated enough to best be done later on

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-02  6:48   ` David Gibson
@ 2021-06-02 16:58     ` Jason Gunthorpe
  2021-06-03  2:49       ` Tian, Kevin
  2021-06-03  5:45       ` David Gibson
  0 siblings, 2 replies; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-02 16:58 UTC (permalink / raw)
  To: David Gibson
  Cc: Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	Kirti Wankhede, Robin Murphy

On Wed, Jun 02, 2021 at 04:48:35PM +1000, David Gibson wrote:
> > > 	/* Bind guest I/O page table  */
> > > 	bind_data = {
> > > 		.ioasid	= gva_ioasid;
> > > 		.addr	= gva_pgtable1;
> > > 		// and format information
> > > 	};
> > > 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> > 
> > Again I do wonder if this should just be part of alloc_ioasid. Is
> > there any reason to split these things? The only advantage to the
> > split is the device is known, but the device shouldn't impact
> > anything..
> 
> I'm pretty sure the device(s) could matter, although they probably
> won't usually. 

It is a bit subtle, but the /dev/iommu fd itself is connected to the
devices first. This prevents wildly incompatible devices from being
joined together, and allows some "get info" to report the capability
union of all devices if we want to do that.

The original concept was that devices joined would all have to support
the same IOASID format, at least for the kernel owned map/unmap IOASID
type. Supporting different page table formats maybe is reason to
revisit that concept.

There is a small advantage to re-using the IOASID container because of
the get_user_pages caching and pinned accounting management at the FD
level.

I don't know if that small advantage is worth the extra complexity
though.

> But it would certainly be possible for a system to have two
> different host bridges with two different IOMMUs with different
> pagetable formats.  Until you know which devices (and therefore
> which host bridge) you're talking about, you don't know what formats
> of pagetable to accept.  And if you have devices from *both* bridges
> you can't bind a page table at all - you could theoretically support
> a kernel managed pagetable by mirroring each MAP and UNMAP to tables
> in both formats, but it would be pretty reasonable not to support
> that.

The basic process for a user space owned pgtable mode would be:

 1) qemu has to figure out what format of pgtable to use

    Presumably it uses query functions using the device label. The
    kernel code should look at the entire device path through all the
    IOMMU HW to determine what is possible.

    Or it already knows because the VM's vIOMMU is running in some
    fixed page table format, or the VM's vIOMMU already told it, or
    something.

 2) qemu creates an IOASID and based on #1 and says 'I want this format'

 3) qemu binds the IOASID to the device. 

    If qmeu gets it wrong then it just fails.

 4) For the next device qemu would have to figure out if it can re-use
    an existing IOASID based on the required proeprties.

You pointed to the case of mixing vIOMMU's of different platforms. So
it is completely reasonable for qemu to ask for a "ARM 64 bit IOMMU
page table mode v2" while running on an x86 because that is what the
vIOMMU is wired to work with.

Presumably qemu will fall back to software emulation if this is not
possible.

One interesting option for software emulation is to just transform the
ARM page table format to a x86 page table format in userspace and use
nested bind/invalidate to synchronize with the kernel. With SW nesting
I suspect this would be much faster

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-02 16:01         ` Jason Gunthorpe
@ 2021-06-02 17:11           ` Alex Williamson
  2021-06-02 17:35             ` Jason Gunthorpe
  0 siblings, 1 reply; 258+ messages in thread
From: Alex Williamson @ 2021-06-02 17:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok, kvm,
	Jonathan Corbet, Robin Murphy, LKML, iommu, David Gibson,
	Kirti Wankhede, David Woodhouse, Jason Wang

On Wed, 2 Jun 2021 13:01:40 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Jun 02, 2021 at 02:20:15AM +0000, Tian, Kevin wrote:
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Wednesday, June 2, 2021 6:22 AM
> > > 
> > > On Tue, 1 Jun 2021 07:01:57 +0000
> > > "Tian, Kevin" <kevin.tian@intel.com> wrote:  
> > > >
> > > > I summarized five opens here, about:
> > > >
> > > > 1)  Finalizing the name to replace /dev/ioasid;
> > > > 2)  Whether one device is allowed to bind to multiple IOASID fd's;
> > > > 3)  Carry device information in invalidation/fault reporting uAPI;
> > > > 4)  What should/could be specified when allocating an IOASID;
> > > > 5)  The protocol between vfio group and kvm;
> > > >  
> > > ...  
> > > >
> > > > For 5), I'd expect Alex to chime in. Per my understanding looks the
> > > > original purpose of this protocol is not about I/O address space. It's
> > > > for KVM to know whether any device is assigned to this VM and then
> > > > do something special (e.g. posted interrupt, EPT cache attribute, etc.).  
> > > 
> > > Right, the original use case was for KVM to determine whether it needs
> > > to emulate invlpg, so it needs to be aware when an assigned device is  
> > 
> > invlpg -> wbinvd :)

Oops, of course.
   
> > > present and be able to test if DMA for that device is cache
> > > coherent.  
> 
> Why is this such a strong linkage to VFIO and not just a 'hey kvm
> emulate wbinvd' flag from qemu?

IIRC, wbinvd has host implications, a malicious user could tell KVM to
emulate wbinvd then run the op in a loop and induce a disproportionate
load on the system.  We therefore wanted a way that it would only be
enabled when required.

> I briefly didn't see any obvios linkage in the arch code, just some
> dead code:
> 
> $ git grep iommu_noncoherent
> arch/x86/include/asm/kvm_host.h:	bool iommu_noncoherent;
> $ git grep iommu_domain arch/x86
> arch/x86/include/asm/kvm_host.h:        struct iommu_domain *iommu_domain;
> 
> Huh?

Cruft from legacy KVM device assignment, I assume.  What you're looking
for is:

kvm_vfio_update_coherency
 kvm_arch_register_noncoherent_dma
  atomic_inc(&kvm->arch.noncoherent_dma_count);

need_emulate_wbinvd
 kvm_arch_has_noncoherent_dma
  atomic_read(&kvm->arch.noncoherent_dma_count);

There are a couple other callers that I'm not as familiar with.

> It kind of looks like the other main point is to generate the
> VFIO_GROUP_NOTIFY_SET_KVM which is being used by two VFIO drivers to
> connect back to the kvm data
> 
> But that seems like it would have been better handled with some IOCTL
> on the vfio_device fd to import the KVM to the driver not this
> roundabout way?

Then QEMU would need to know which drivers require KVM knowledge?  This
allowed transparent backwards compatibility with userspace.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-02  6:15 ` David Gibson
@ 2021-06-02 17:19   ` Jason Gunthorpe
  2021-06-03  3:02     ` Tian, Kevin
  2021-06-03  6:26     ` David Gibson
  2021-06-03  7:17   ` Tian, Kevin
  2021-06-03  8:12   ` Tian, Kevin
  2 siblings, 2 replies; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-02 17:19 UTC (permalink / raw)
  To: David Gibson
  Cc: Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	Kirti Wankhede, Robin Murphy

On Wed, Jun 02, 2021 at 04:15:07PM +1000, David Gibson wrote:

> Is there a compelling reason to have all the IOASIDs handled by one
> FD?

There was an answer on this, if every PASID needs an IOASID then there
are too many FDs.

It is difficult to share the get_user_pages cache across FDs.

There are global properties in the /dev/iommu FD, like what devices
are part of it, that are important for group security operations. This
becomes confused if it is split to many FDs.

> > I/O address space can be managed through two protocols, according to 
> > whether the corresponding I/O page table is constructed by the kernel or 
> > the user. When kernel-managed, a dma mapping protocol (similar to 
> > existing VFIO iommu type1) is provided for the user to explicitly specify 
> > how the I/O address space is mapped. Otherwise, a different protocol is 
> > provided for the user to bind an user-managed I/O page table to the 
> > IOMMU, plus necessary commands for iotlb invalidation and I/O fault 
> > handling. 
> > 
> > Pgtable binding protocol can be used only on the child IOASID's, implying 
> > IOASID nesting must be enabled. This is because the kernel doesn't trust 
> > userspace. Nesting allows the kernel to enforce its DMA isolation policy 
> > through the parent IOASID.
> 
> To clarify, I'm guessing that's a restriction of likely practice,
> rather than a fundamental API restriction.  I can see a couple of
> theoretical future cases where a user-managed pagetable for a "base"
> IOASID would be feasible:
> 
>   1) On some fancy future MMU allowing free nesting, where the kernel
>      would insert an implicit extra layer translating user addresses
>      to physical addresses, and the userspace manages a pagetable with
>      its own VAs being the target AS

I would model this by having a "SVA" parent IOASID. A "SVA" IOASID one
where the IOVA == process VA and the kernel maintains this mapping.

Since the uAPI is so general I do have a general expecation that the
drivers/iommu implementations might need to be a bit more complicated,
like if the HW can optimize certain specific graphs of IOASIDs we
would still model them as graphs and the HW driver would have to
"compile" the graph into the optimal hardware.

This approach has worked reasonable in other kernel areas.

>   2) For a purely software virtual device, where its virtual DMA
>      engine can interpet user addresses fine

This also sounds like an SVA IOASID.

Depending on HW if a device can really only bind to a very narrow kind
of IOASID then it should ask for that (probably platform specific!)
type during its attachment request to drivers/iommu.

eg "I am special hardware and only know how to do PLATFORM_BLAH
transactions, give me an IOASID comatible with that". If the only way
to create "PLATFORM_BLAH" is with a SVA IOASID because BLAH is
hardwired to the CPU ASID  then that is just how it is.

> I wonder if there's a way to model this using a nested AS rather than
> requiring special operations.  e.g.
> 
> 	'prereg' IOAS
> 	|
> 	\- 'rid' IOAS
> 	   |
> 	   \- 'pasid' IOAS (maybe)
> 
> 'prereg' would have a kernel managed pagetable into which (for
> example) qemu platform code would map all guest memory (using
> IOASID_MAP_DMA).  qemu's vIOMMU driver would then mirror the guest's
> IO mappings into the 'rid' IOAS in terms of GPA.
> 
> This wouldn't quite work as is, because the 'prereg' IOAS would have
> no devices.  But we could potentially have another call to mark an
> IOAS as a purely "preregistration" or pure virtual IOAS.  Using that
> would be an alternative to attaching devices.

It is one option for sure, this is where I was thinking when we were
talking in the other thread. I think the decision is best
implementation driven as the datastructure to store the
preregsitration data should be rather purpose built.

> > /*
> >   * Map/unmap process virtual addresses to I/O virtual addresses.
> >   *
> >   * Provide VFIO type1 equivalent semantics. Start with the same 
> >   * restriction e.g. the unmap size should match those used in the 
> >   * original mapping call. 
> >   *
> >   * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
> >   * must be already in the preregistered list.
> >   *
> >   * Input parameters:
> >   *	- u32 ioasid;
> >   *	- refer to vfio_iommu_type1_dma_{un}map
> >   *
> >   * Return: 0 on success, -errno on failure.
> >   */
> > #define IOASID_MAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 6)
> > #define IOASID_UNMAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 7)
> 
> I'm assuming these would be expected to fail if a user managed
> pagetable has been bound?

Me too, or a SVA page table.

This document would do well to have a list of imagined page table
types and the set of operations that act on them. I think they are all
pretty disjoint..

Your presentation of 'kernel owns the table' vs 'userspace owns the
table' is a useful clarification to call out too

> > 5. Use Cases and Flows
> > 
> > Here assume VFIO will support a new model where every bound device
> > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o 
> > going through legacy container/group interface. For illustration purpose
> > those devices are just called dev[1...N]:
> > 
> > 	device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> 
> Minor detail, but I'd suggest /dev/vfio/pci/DDDD:BB:SS.F for the
> filenames for actual PCI functions.  Maybe /dev/vfio/mdev/something
> for mdevs.  That leaves other subdirs of /dev/vfio free for future
> non-PCI device types, and /dev/vfio itself for the legacy group
> devices.

There are a bunch of nice options here if we go this path

> > 5.2. Multiple IOASIDs (no nesting)
> > ++++++++++++++++++++++++++++
> > 
> > Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
> > both devices are attached to gpa_ioasid.
> 
> Doesn't really affect your example, but note that the PAPR IOMMU does
> not have a passthrough mode, so devices will not initially be attached
> to gpa_ioasid - they will be unusable for DMA until attached to a
> gIOVA ioasid.

I think attachment should always be explicit in the API. If the user
doesn't explicitly ask for a device to be attached to the IOASID then
the iommu driver is free to block it.

If you want passthrough then you have to create a passthrough IOASID
and attach every device to it. Some of those attaches might be NOP's
due to groups.

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-02  8:54                         ` Jason Wang
@ 2021-06-02 17:21                           ` Jason Gunthorpe
  2021-06-07 13:30                             ` Enrico Weigelt, metux IT consult
  2021-06-08  1:10                             ` Jason Wang
  0 siblings, 2 replies; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-02 17:21 UTC (permalink / raw)
  To: Jason Wang
  Cc: Tian, Kevin, Lu Baolu, Liu Yi L, kvm, Jonathan Corbet, iommu,
	LKML, Alex Williamson (alex.williamson@redhat.com)"",
	David Woodhouse

On Wed, Jun 02, 2021 at 04:54:26PM +0800, Jason Wang wrote:
> 
> 在 2021/6/2 上午1:31, Jason Gunthorpe 写道:
> > On Tue, Jun 01, 2021 at 04:47:15PM +0800, Jason Wang wrote:
> > > We can open up to ~0U file descriptors, I don't see why we need to restrict
> > > it in uAPI.
> > There are significant problems with such large file descriptor
> > tables. High FD numbers man things like select don't work at all
> > anymore and IIRC there are more complications.
> 
> 
> I don't see how much difference for IOASID and other type of fds. People can
> choose to use poll or epoll.

Not really, once one thing in an applicate uses a large number FDs the
entire application is effected. If any open() can return 'very big
number' then nothing in the process is allowed to ever use select.

It is not a trivial thing to ask for

> And with the current proposal, (assuming there's a N:1 ioasid to ioasid). I
> wonder how select can work for the specific ioasid.

pagefault events are one thing that comes to mind. Bundling them all
together into a single ring buffer is going to be necessary. Multifds
just complicate this too

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-02  8:56 ` Enrico Weigelt, metux IT consult
@ 2021-06-02 17:24   ` Jason Gunthorpe
  2021-06-04 10:44     ` Enrico Weigelt, metux IT consult
  0 siblings, 1 reply; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-02 17:24 UTC (permalink / raw)
  To: Enrico Weigelt, metux IT consult
  Cc: Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

On Wed, Jun 02, 2021 at 10:56:48AM +0200, Enrico Weigelt, metux IT consult wrote:

> If I understand this correctly, /dev/ioasid is a kind of "common supplier"
> to other APIs / devices. Why can't the fd be acquired by the
> consumer APIs (eg. kvm, vfio, etc) ?

/dev/ioasid would be similar to /dev/vfio, and everything already
deals with exposing /dev/vfio and /dev/vfio/N together

I don't see it as a problem, just more work.

Having FDs spawn other FDs is pretty ugly, it defeats the "everything
is a file" model of UNIX.

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-02 17:11           ` Alex Williamson
@ 2021-06-02 17:35             ` Jason Gunthorpe
  2021-06-02 18:01               ` Alex Williamson
  0 siblings, 1 reply; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-02 17:35 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok, kvm,
	Jonathan Corbet, Robin Murphy, LKML, iommu, David Gibson,
	Kirti Wankhede, David Woodhouse, Jason Wang

On Wed, Jun 02, 2021 at 11:11:17AM -0600, Alex Williamson wrote:

> > > > present and be able to test if DMA for that device is cache
> > > > coherent.  
> > 
> > Why is this such a strong linkage to VFIO and not just a 'hey kvm
> > emulate wbinvd' flag from qemu?
> 
> IIRC, wbinvd has host implications, a malicious user could tell KVM to
> emulate wbinvd then run the op in a loop and induce a disproportionate
> load on the system.  We therefore wanted a way that it would only be
> enabled when required.

I think the non-coherentness is vfio_device specific? eg a specific
device will decide if it is coherent or not?

If yes I'd recast this to call kvm_arch_register_noncoherent_dma()
from the VFIO_GROUP_NOTIFY_SET_KVM in the struct vfio_device
implementation and not link it through the IOMMU.

If userspace is telling the vfio_device to be non-coherent or not then
it can call kvm_arch_register_noncoherent_dma() or not based on that
signal.

> > It kind of looks like the other main point is to generate the
> > VFIO_GROUP_NOTIFY_SET_KVM which is being used by two VFIO drivers to
> > connect back to the kvm data
> > 
> > But that seems like it would have been better handled with some IOCTL
> > on the vfio_device fd to import the KVM to the driver not this
> > roundabout way?
> 
> Then QEMU would need to know which drivers require KVM knowledge?  This
> allowed transparent backwards compatibility with userspace.  Thanks,

I'd just blindly fire a generic 'hey here is your KVM FD' into every
VFIO device.

The backwards compat angle is reasonable enough though.

So those two don't sound so bad, don't know about PPC, but David seem
optimistic

A basic idea is to remove the iommu stuff from the kvm connection so
that the scope of the iommu related rework is contained to vfio

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-02 17:35             ` Jason Gunthorpe
@ 2021-06-02 18:01               ` Alex Williamson
  2021-06-02 18:09                 ` Jason Gunthorpe
  0 siblings, 1 reply; 258+ messages in thread
From: Alex Williamson @ 2021-06-02 18:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok, kvm,
	Jonathan Corbet, Robin Murphy, LKML, iommu, David Gibson,
	Kirti Wankhede, David Woodhouse, Jason Wang

On Wed, 2 Jun 2021 14:35:10 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Jun 02, 2021 at 11:11:17AM -0600, Alex Williamson wrote:
> 
> > > > > present and be able to test if DMA for that device is cache
> > > > > coherent.    
> > > 
> > > Why is this such a strong linkage to VFIO and not just a 'hey kvm
> > > emulate wbinvd' flag from qemu?  
> > 
> > IIRC, wbinvd has host implications, a malicious user could tell KVM to
> > emulate wbinvd then run the op in a loop and induce a disproportionate
> > load on the system.  We therefore wanted a way that it would only be
> > enabled when required.  
> 
> I think the non-coherentness is vfio_device specific? eg a specific
> device will decide if it is coherent or not?

No, this is specifically whether DMA is cache coherent to the
processor, ie. in the case of wbinvd whether the processor needs to
invalidate its cache in order to see data from DMA.

> If yes I'd recast this to call kvm_arch_register_noncoherent_dma()
> from the VFIO_GROUP_NOTIFY_SET_KVM in the struct vfio_device
> implementation and not link it through the IOMMU.

The IOMMU tells us if DMA is cache coherent, VFIO_DMA_CC_IOMMU maps to
IOMMU_CAP_CACHE_COHERENCY for all domains within a container.

> If userspace is telling the vfio_device to be non-coherent or not then
> it can call kvm_arch_register_noncoherent_dma() or not based on that
> signal.

Not non-coherent device memory, that would be a driver issue, cache
coherence of DMA is what we're after.

> > > It kind of looks like the other main point is to generate the
> > > VFIO_GROUP_NOTIFY_SET_KVM which is being used by two VFIO drivers to
> > > connect back to the kvm data
> > > 
> > > But that seems like it would have been better handled with some IOCTL
> > > on the vfio_device fd to import the KVM to the driver not this
> > > roundabout way?  
> > 
> > Then QEMU would need to know which drivers require KVM knowledge?  This
> > allowed transparent backwards compatibility with userspace.  Thanks,  
> 
> I'd just blindly fire a generic 'hey here is your KVM FD' into every
> VFIO device.

Yes, QEMU could do this, but the vfio-kvm device was already there with
this association and required no uAPI work.  This one is the least IOMMU
related of the use cases.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-02 18:01               ` Alex Williamson
@ 2021-06-02 18:09                 ` Jason Gunthorpe
  2021-06-02 19:00                   ` Alex Williamson
  0 siblings, 1 reply; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-02 18:09 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok, kvm,
	Jonathan Corbet, Robin Murphy, LKML, iommu, David Gibson,
	Kirti Wankhede, David Woodhouse, Jason Wang

On Wed, Jun 02, 2021 at 12:01:11PM -0600, Alex Williamson wrote:
> On Wed, 2 Jun 2021 14:35:10 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Jun 02, 2021 at 11:11:17AM -0600, Alex Williamson wrote:
> > 
> > > > > > present and be able to test if DMA for that device is cache
> > > > > > coherent.    
> > > > 
> > > > Why is this such a strong linkage to VFIO and not just a 'hey kvm
> > > > emulate wbinvd' flag from qemu?  
> > > 
> > > IIRC, wbinvd has host implications, a malicious user could tell KVM to
> > > emulate wbinvd then run the op in a loop and induce a disproportionate
> > > load on the system.  We therefore wanted a way that it would only be
> > > enabled when required.  
> > 
> > I think the non-coherentness is vfio_device specific? eg a specific
> > device will decide if it is coherent or not?
> 
> No, this is specifically whether DMA is cache coherent to the
> processor, ie. in the case of wbinvd whether the processor needs to
> invalidate its cache in order to see data from DMA.

I'm confused. This is x86, all DMA is cache coherent unless the device
is doing something special.

> > If yes I'd recast this to call kvm_arch_register_noncoherent_dma()
> > from the VFIO_GROUP_NOTIFY_SET_KVM in the struct vfio_device
> > implementation and not link it through the IOMMU.
> 
> The IOMMU tells us if DMA is cache coherent, VFIO_DMA_CC_IOMMU maps to
> IOMMU_CAP_CACHE_COHERENCY for all domains within a container.

And this special IOMMU mode is basically requested by the device
driver, right? Because if you use this mode you have to also use
special programming techniques.

This smells like all the "snoop bypass" stuff from PCIE (for GPUs
even) in a different guise - it is device triggered, not platform
triggered behavior.

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-02 18:09                 ` Jason Gunthorpe
@ 2021-06-02 19:00                   ` Alex Williamson
  2021-06-02 19:54                     ` Jason Gunthorpe
  0 siblings, 1 reply; 258+ messages in thread
From: Alex Williamson @ 2021-06-02 19:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok, kvm,
	Jonathan Corbet, Robin Murphy, LKML, iommu, David Gibson,
	Kirti Wankhede, David Woodhouse, Jason Wang

On Wed, 2 Jun 2021 15:09:25 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Jun 02, 2021 at 12:01:11PM -0600, Alex Williamson wrote:
> > On Wed, 2 Jun 2021 14:35:10 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Wed, Jun 02, 2021 at 11:11:17AM -0600, Alex Williamson wrote:
> > >   
> > > > > > > present and be able to test if DMA for that device is cache
> > > > > > > coherent.      
> > > > > 
> > > > > Why is this such a strong linkage to VFIO and not just a 'hey kvm
> > > > > emulate wbinvd' flag from qemu?    
> > > > 
> > > > IIRC, wbinvd has host implications, a malicious user could tell KVM to
> > > > emulate wbinvd then run the op in a loop and induce a disproportionate
> > > > load on the system.  We therefore wanted a way that it would only be
> > > > enabled when required.    
> > > 
> > > I think the non-coherentness is vfio_device specific? eg a specific
> > > device will decide if it is coherent or not?  
> > 
> > No, this is specifically whether DMA is cache coherent to the
> > processor, ie. in the case of wbinvd whether the processor needs to
> > invalidate its cache in order to see data from DMA.  
> 
> I'm confused. This is x86, all DMA is cache coherent unless the device
> is doing something special.
> 
> > > If yes I'd recast this to call kvm_arch_register_noncoherent_dma()
> > > from the VFIO_GROUP_NOTIFY_SET_KVM in the struct vfio_device
> > > implementation and not link it through the IOMMU.  
> > 
> > The IOMMU tells us if DMA is cache coherent, VFIO_DMA_CC_IOMMU maps to
> > IOMMU_CAP_CACHE_COHERENCY for all domains within a container.  
> 
> And this special IOMMU mode is basically requested by the device
> driver, right? Because if you use this mode you have to also use
> special programming techniques.
> 
> This smells like all the "snoop bypass" stuff from PCIE (for GPUs
> even) in a different guise - it is device triggered, not platform
> triggered behavior.

Right, the device can generate the no-snoop transactions, but it's the
IOMMU that essentially determines whether those transactions are
actually still cache coherent, AIUI.

I did experiment with virtually hardwiring the Enable No-Snoop bit in
the Device Control Register to zero, which would be generically allowed
by the PCIe spec, but then we get into subtle dependencies in the device
drivers and clearing the bit again after any sort of reset and the
backdoor accesses to config space which exist mostly in the class of
devices that might use no-snoop transactions (yes, GPUs suck).

It was much easier and more robust to ignore the device setting and rely
on the IOMMU behavior.  Yes, maybe we sometimes emulate wbinvd for VMs
where the device doesn't support no-snoop, but it seemed like platforms
were headed in this direction where no-snoop was ignored anyway.
Thanks,

Alex


^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-02 19:00                   ` Alex Williamson
@ 2021-06-02 19:54                     ` Jason Gunthorpe
  2021-06-02 20:37                       ` Alex Williamson
  0 siblings, 1 reply; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-02 19:54 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok, kvm,
	Jonathan Corbet, Robin Murphy, LKML, iommu, David Gibson,
	Kirti Wankhede, David Woodhouse, Jason Wang

On Wed, Jun 02, 2021 at 01:00:53PM -0600, Alex Williamson wrote:
> On Wed, 2 Jun 2021 15:09:25 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Jun 02, 2021 at 12:01:11PM -0600, Alex Williamson wrote:
> > > On Wed, 2 Jun 2021 14:35:10 -0300
> > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >   
> > > > On Wed, Jun 02, 2021 at 11:11:17AM -0600, Alex Williamson wrote:
> > > >   
> > > > > > > > present and be able to test if DMA for that device is cache
> > > > > > > > coherent.      
> > > > > > 
> > > > > > Why is this such a strong linkage to VFIO and not just a 'hey kvm
> > > > > > emulate wbinvd' flag from qemu?    
> > > > > 
> > > > > IIRC, wbinvd has host implications, a malicious user could tell KVM to
> > > > > emulate wbinvd then run the op in a loop and induce a disproportionate
> > > > > load on the system.  We therefore wanted a way that it would only be
> > > > > enabled when required.    
> > > > 
> > > > I think the non-coherentness is vfio_device specific? eg a specific
> > > > device will decide if it is coherent or not?  
> > > 
> > > No, this is specifically whether DMA is cache coherent to the
> > > processor, ie. in the case of wbinvd whether the processor needs to
> > > invalidate its cache in order to see data from DMA.  
> > 
> > I'm confused. This is x86, all DMA is cache coherent unless the device
> > is doing something special.
> > 
> > > > If yes I'd recast this to call kvm_arch_register_noncoherent_dma()
> > > > from the VFIO_GROUP_NOTIFY_SET_KVM in the struct vfio_device
> > > > implementation and not link it through the IOMMU.  
> > > 
> > > The IOMMU tells us if DMA is cache coherent, VFIO_DMA_CC_IOMMU maps to
> > > IOMMU_CAP_CACHE_COHERENCY for all domains within a container.  
> > 
> > And this special IOMMU mode is basically requested by the device
> > driver, right? Because if you use this mode you have to also use
> > special programming techniques.
> > 
> > This smells like all the "snoop bypass" stuff from PCIE (for GPUs
> > even) in a different guise - it is device triggered, not platform
> > triggered behavior.
> 
> Right, the device can generate the no-snoop transactions, but it's the
> IOMMU that essentially determines whether those transactions are
> actually still cache coherent, AIUI.

Wow, this is really confusing stuff in the code.

At the PCI level there is a TLP bit called no-snoop that is platform
specific. The general intention is to allow devices to selectively
bypass the CPU caching for DMAs. GPUs like to use this feature for
performance.

I assume there is some exciting security issues here. Looks like
allowing cache bypass does something bad inside VMs? Looks like
allowing the VM to use the cache clear instruction that is mandatory
with cache bypass DMA causes some QOS issues? OK.

So how does it work?

What I see in the intel/iommu.c is that some domains support "snoop
control" or not, based on some HW flag. This indicates if the
DMA_PTE_SNP bit is supported on a page by page basis or not.

Since x86 always leans toward "DMA cache coherent" I'm reading some
tea leaves here:

	IOMMU_CAP_CACHE_COHERENCY,	/* IOMMU can enforce cache coherent DMA
					   transactions */

And guessing that IOMMUs that implement DMA_PTE_SNP will ignore the
snoop bit in TLPs for IOVA's that have DMA_PTE_SNP set?

Further, I guess IOMMUs that don't support PTE_SNP, or have
DMA_PTE_SNP clear will always honour the snoop bit. (backwards compat
and all)

So, IOMMU_CAP_CACHE_COHERENCY does not mean the IOMMU is DMA
incoherent with the CPU caches, it just means that that snoop bit in
the TLP cannot be enforced. ie the device *could* do no-shoop DMA
if it wants. Devices that never do no-snoop remain DMA coherent on
x86, as they always have been.

IOMMU_CACHE does not mean the IOMMU is DMA cache coherent, it means
the PCI device is blocked from using no-snoop in its TLPs.

I wonder if ARM implemented this consistently? I see VDPA is
confused.. I was confused. What a terrible set of names.

In VFIO generic code I see it always sets IOMMU_CACHE:

        if (iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY))
                domain->prot |= IOMMU_CACHE;

And thus also always provides IOMMU_CACHE to iommu_map:

                ret = iommu_map(d->domain, iova, (phys_addr_t)pfn << PAGE_SHIFT,
                                npage << PAGE_SHIFT, prot | d->prot);

So when the IOMMU supports the no-snoop blocking security feature VFIO
turns it on and blocks no-snoop to all pages? Ok..

But I must be missing something big because *something* in the IOVA
map should work with no-snoopable DMA, right? Otherwise what is the
point of exposing the invalidate instruction to the guest?

I would think userspace should be relaying the DMA_PTE_SNP bit from
the guest's page tables up to here??

The KVM hookup is driven by IOMMU_CACHE which is driven by
IOMMU_CAP_CACHE_COHERENCY. So we turn on the special KVM support only
if the IOMMU can block the SNP bit? And then we map all the pages to
block the snoop bit? Huh?

Your explanation makes perfect sense: Block guests from using the
dangerous cache invalidate instruction unless a device that uses
no-snoop is plugged in. Block devices from using no-snoop because
something about it is insecure. Ok.

But the conditions I'm looking for "device that uses no-snoop" is:
 - The device will issue no-snoop TLPs at all
 - The IOMMU will let no-snoop through
 - The platform will honor no-snoop

Only if all three are met we should allow the dangerous instruction in
KVM, right?

Which brings me back to my original point - this is at least partially
a device specific behavior. It depends on the content of the IOMMU
page table, it depends if the device even supports no-snoop at all.

My guess is this works correctly for the mdev Intel kvmgt which
probably somehow allows no-snoop DMA throught the mdev SW iommu
mappings. (assuming I didn't miss a tricky iommu_map without
IOMMU_CACHe set in the type1 code?)

But why is vfio-pci using it? Hmm?

Confused,
Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-02 19:54                     ` Jason Gunthorpe
@ 2021-06-02 20:37                       ` Alex Williamson
  2021-06-02 22:45                         ` Jason Gunthorpe
  2021-06-03  2:52                         ` Jason Wang
  0 siblings, 2 replies; 258+ messages in thread
From: Alex Williamson @ 2021-06-02 20:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok, kvm,
	Jonathan Corbet, Robin Murphy, LKML, iommu, David Gibson,
	Kirti Wankhede, David Woodhouse, Jason Wang

On Wed, 2 Jun 2021 16:54:04 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Jun 02, 2021 at 01:00:53PM -0600, Alex Williamson wrote:
> > 
> > Right, the device can generate the no-snoop transactions, but it's the
> > IOMMU that essentially determines whether those transactions are
> > actually still cache coherent, AIUI.  
> 
> Wow, this is really confusing stuff in the code.
> 
> At the PCI level there is a TLP bit called no-snoop that is platform
> specific. The general intention is to allow devices to selectively
> bypass the CPU caching for DMAs. GPUs like to use this feature for
> performance.

Yes

> I assume there is some exciting security issues here. Looks like
> allowing cache bypass does something bad inside VMs? Looks like
> allowing the VM to use the cache clear instruction that is mandatory
> with cache bypass DMA causes some QOS issues? OK.

IIRC, largely a DoS issue if userspace gets to choose when to emulate
wbinvd rather than it being demanded for correct operation.

> So how does it work?
> 
> What I see in the intel/iommu.c is that some domains support "snoop
> control" or not, based on some HW flag. This indicates if the
> DMA_PTE_SNP bit is supported on a page by page basis or not.
> 
> Since x86 always leans toward "DMA cache coherent" I'm reading some
> tea leaves here:
> 
> 	IOMMU_CAP_CACHE_COHERENCY,	/* IOMMU can enforce cache coherent DMA
> 					   transactions */
> 
> And guessing that IOMMUs that implement DMA_PTE_SNP will ignore the
> snoop bit in TLPs for IOVA's that have DMA_PTE_SNP set?

That's my understanding as well.

> Further, I guess IOMMUs that don't support PTE_SNP, or have
> DMA_PTE_SNP clear will always honour the snoop bit. (backwards compat
> and all)

Yes.

> So, IOMMU_CAP_CACHE_COHERENCY does not mean the IOMMU is DMA
> incoherent with the CPU caches, it just means that that snoop bit in
> the TLP cannot be enforced. ie the device *could* do no-shoop DMA
> if it wants. Devices that never do no-snoop remain DMA coherent on
> x86, as they always have been.

Yes, IOMMU_CAP_CACHE_COHERENCY=false means we cannot force the device
DMA to be coherent via the IOMMU.

> IOMMU_CACHE does not mean the IOMMU is DMA cache coherent, it means
> the PCI device is blocked from using no-snoop in its TLPs.
> 
> I wonder if ARM implemented this consistently? I see VDPA is
> confused.. I was confused. What a terrible set of names.
> 
> In VFIO generic code I see it always sets IOMMU_CACHE:
> 
>         if (iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY))
>                 domain->prot |= IOMMU_CACHE;
> 
> And thus also always provides IOMMU_CACHE to iommu_map:
> 
>                 ret = iommu_map(d->domain, iova, (phys_addr_t)pfn << PAGE_SHIFT,
>                                 npage << PAGE_SHIFT, prot | d->prot);
> 
> So when the IOMMU supports the no-snoop blocking security feature VFIO
> turns it on and blocks no-snoop to all pages? Ok..

Yep, I'd forgotten this nuance that we need to enable it via the
mapping flags.

> But I must be missing something big because *something* in the IOVA
> map should work with no-snoopable DMA, right? Otherwise what is the
> point of exposing the invalidate instruction to the guest?
> 
> I would think userspace should be relaying the DMA_PTE_SNP bit from
> the guest's page tables up to here??
> 
> The KVM hookup is driven by IOMMU_CACHE which is driven by
> IOMMU_CAP_CACHE_COHERENCY. So we turn on the special KVM support only
> if the IOMMU can block the SNP bit? And then we map all the pages to
> block the snoop bit? Huh?

Right.  I don't follow where you're jumping to relaying DMA_PTE_SNP
from the guest page table... what page table?  We don't necessarily
have a vIOMMU to expose such things, I don't think it even existed when
this we added.  Essentially if we can ignore no-snoop at the IOMMU,
then KVM doesn't need to worry about emulating wbinvd because of an
assigned device, whether that device uses it or not.  Win-win.

> Your explanation makes perfect sense: Block guests from using the
> dangerous cache invalidate instruction unless a device that uses
> no-snoop is plugged in. Block devices from using no-snoop because
> something about it is insecure. Ok.

No-snoop itself is not insecure, but to support no-snoop in a VM KVM
can't ignore wbinvd, which has overhead and abuse implications.

> But the conditions I'm looking for "device that uses no-snoop" is:
>  - The device will issue no-snoop TLPs at all

We can't really know this generically.  We can try to set the enable
bit to see if the device is capable of no-snoop, but that doesn't mean
it will use no-snoop.

>  - The IOMMU will let no-snoop through
>  - The platform will honor no-snoop
> 
> Only if all three are met we should allow the dangerous instruction in
> KVM, right?

We test at the IOMMU and assume that the IOMMU knowledge encompasses
whether the platform honors no-snoop (note for example how amd and arm
report true for IOMMU_CAP_CACHE_COHERENCY but seem to ignore the
IOMMU_CACHE flag).  We could probably use an iommu_group_for_each_dev
to test if any devices within the group are capable of no-snoop if the
IOMMU can't protect us, but at the time it didn't seem worthwhile.  I'm
still not sure if it is.
 
> Which brings me back to my original point - this is at least partially
> a device specific behavior. It depends on the content of the IOMMU
> page table, it depends if the device even supports no-snoop at all.
> 
> My guess is this works correctly for the mdev Intel kvmgt which
> probably somehow allows no-snoop DMA throught the mdev SW iommu
> mappings. (assuming I didn't miss a tricky iommu_map without
> IOMMU_CACHe set in the type1 code?)

This support existed before mdev, IIRC we needed it for direct
assignment of NVIDIA GPUs.
 
> But why is vfio-pci using it? Hmm?

Use the IOMMU to reduce hypervisor overhead, let the hypervisor learn
about it, ignore the subtleties of whether the device actually uses
no-snoop as imprecise and poor ROI given the apparent direction of
hardware.

¯\_(ツ)_/¯,
Alex


^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-02 20:37                       ` Alex Williamson
@ 2021-06-02 22:45                         ` Jason Gunthorpe
  2021-06-03  2:50                           ` Alex Williamson
  2021-06-03  2:52                         ` Jason Wang
  1 sibling, 1 reply; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-02 22:45 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok, kvm,
	Jonathan Corbet, Robin Murphy, LKML, iommu, David Gibson,
	Kirti Wankhede, David Woodhouse, Jason Wang

On Wed, Jun 02, 2021 at 02:37:34PM -0600, Alex Williamson wrote:

> Right.  I don't follow where you're jumping to relaying DMA_PTE_SNP
> from the guest page table... what page table?  

I see my confusion now, the phrasing in your earlier remark led me
think this was about allowing the no-snoop performance enhancement in
some restricted way.

It is really about blocking no-snoop 100% of the time and then
disabling the dangerous wbinvd when the block is successful.

Didn't closely read the kvm code :\

If it was about allowing the optimization then I'd expect the guest to
enable no-snoopable regions via it's vIOMMU and realize them to the
hypervisor and plumb the whole thing through. Hence my remark about
the guest page tables..

So really the test is just 'were we able to block it' ?

> This support existed before mdev, IIRC we needed it for direct
> assignment of NVIDIA GPUs.

Probably because they ignored the disable no-snoop bits in the control
block, or reset them in some insane way to "fix" broken bioses and
kept using it even though by all rights qemu would have tried hard to
turn it off via the config space. Processing no-snoop without a
working wbinvd would be fatal. Yeesh

But Ok, back the /dev/ioasid. This answers a few lingering questions I
had..

1) Mixing IOMMU_CAP_CACHE_COHERENCY and !IOMMU_CAP_CACHE_COHERENCY
   domains.

   This doesn't actually matter. If you mix them together then kvm
   will turn on wbinvd anyhow, so we don't need to use the DMA_PTE_SNP
   anywhere in this VM.

   This if two IOMMU's are joined together into a single /dev/ioasid
   then we can just make them both pretend to be
   !IOMMU_CAP_CACHE_COHERENCY and both not set IOMMU_CACHE.

2) How to fit this part of kvm in some new /dev/ioasid world

   What we want to do here is iterate over every ioasid associated
   with the group fd that is passed into kvm.

   Today the group fd has a single container which specifies the
   single ioasid so this is being done trivially.

   To reorg we want to get the ioasid from the device not the
   group (see my note to David about the groups vs device rational)

   This is just iterating over each vfio_device in the group and
   querying the ioasid it is using.

   Or perhaps more directly: an op attaching the vfio_device to the
   kvm and having some simple helper 
         '(un)register ioasid with kvm (kvm, ioasid)'
   that the vfio_device driver can call that just sorts this out.

   It is not terrible..

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-02  4:01       ` Lu Baolu
@ 2021-06-02 23:23         ` Jason Gunthorpe
  2021-06-03  5:49           ` Lu Baolu
  0 siblings, 1 reply; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-02 23:23 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Tian, Kevin, LKML, Joerg Roedel, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

On Wed, Jun 02, 2021 at 12:01:57PM +0800, Lu Baolu wrote:
> On 6/2/21 1:26 AM, Jason Gunthorpe wrote:
> > On Tue, Jun 01, 2021 at 07:09:21PM +0800, Lu Baolu wrote:
> > 
> > > This version only covers 1) and 4). Do you think we need to support 2),
> > > 3) and beyond?
> > 
> > Yes aboslutely. The API should be flexable enough to specify the
> > creation of all future page table formats we'd want to have and all HW
> > specific details on those formats.
> 
> OK, stay in the same line.
> 
> > > If so, it seems that we need some in-kernel helpers and uAPIs to
> > > support pre-installing a page table to IOASID.
> > 
> > Not sure what this means..
> 
> Sorry that I didn't make this clear.
> 
> Let me bring back the page table types in my eyes.
> 
>  1) IOMMU format page table (a.k.a. iommu_domain)
>  2) user application CPU page table (SVA for example)
>  3) KVM EPT (future option)
>  4) VM guest managed page table (nesting mode)
> 
> Each type of page table should be able to be associated with its IOASID.
> We have BIND protocol for 4); We explicitly allocate an iommu_domain for
> 1). But we don't have a clear definition for 2) 3) and others. I think
> it's necessary to clearly define a time point and kAPI name between
> IOASID_ALLOC and IOASID_ATTACH, so that other modules have the
> opportunity to associate their page table with the allocated IOASID
> before attaching the page table to the real IOMMU hardware.

In my mind these are all actions of creation..

#1 is ALLOC_IOASID 'to be compatible with thes devices attached to
   this FD'
#2 is ALLOC_IOASID_SVA
#3 is some ALLOC_IOASID_KVM (and maybe the kvm fd has to issue this ioctl)
#4 is ALLOC_IOASID_USER_PAGE_TABLE w/ user VA address or
      ALLOC_IOASID_NESTED_PAGE_TABLE w/ IOVA address

Each allocation should have a set of operations that are allows
map/unmap is only legal on #1. invalidate is only legal on #4, etc.

How you want to split this up in the ioctl interface is a more
interesting question. I generally like more calls than giant unwieldly
multiplexer structs, but some things are naturally flags and optional
modifications of a single ioctl.

In any event they should have a similar naming 'ALLOC_IOASID_XXX' and
then a single 'DESTROY_IOASID' that works on all of them.

> I/O page fault handling is similar. The provider of the page table
> should take the responsibility to handle the possible page faults.

For the faultable types, yes #3 and #4 should hook in the fault
handler and deal with it.

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-02  1:25       ` Tian, Kevin
@ 2021-06-02 23:27         ` Jason Gunthorpe
  2021-06-04  8:17         ` Jean-Philippe Brucker
  1 sibling, 0 replies; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-02 23:27 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

On Wed, Jun 02, 2021 at 01:25:00AM +0000, Tian, Kevin wrote:

> OK, this implies that if one user inadvertently creates intended parent/
> child via different fd's then the operation will simply fail.

Remember the number space to refer to the ioasid's inside the FD is
local to that instance of the FD. Each FD should have its own xarray

You can't actually accidently refer to an IOASID in FD A from FD B
because the xarray lookup in FD B will not return 'IOASID A'.

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-06-02 16:09         ` Jason Gunthorpe
@ 2021-06-03  1:29           ` Tian, Kevin
  2021-06-03  5:09             ` David Gibson
  0 siblings, 1 reply; 258+ messages in thread
From: Tian, Kevin @ 2021-06-03  1:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker,
	Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, David Gibson, Jiang, Dave,
	David Woodhouse, Jason Wang

> From: Jason Gunthorpe
> Sent: Thursday, June 3, 2021 12:09 AM
> 
> On Wed, Jun 02, 2021 at 01:33:22AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, June 2, 2021 1:42 AM
> > >
> > > On Tue, Jun 01, 2021 at 08:10:14AM +0000, Tian, Kevin wrote:
> > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > Sent: Saturday, May 29, 2021 1:36 AM
> > > > >
> > > > > On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > > > >
> > > > > > IOASID nesting can be implemented in two ways: hardware nesting
> and
> > > > > > software nesting. With hardware support the child and parent I/O
> page
> > > > > > tables are walked consecutively by the IOMMU to form a nested
> > > translation.
> > > > > > When it's implemented in software, the ioasid driver is responsible
> for
> > > > > > merging the two-level mappings into a single-level shadow I/O page
> > > table.
> > > > > > Software nesting requires both child/parent page tables operated
> > > through
> > > > > > the dma mapping protocol, so any change in either level can be
> > > captured
> > > > > > by the kernel to update the corresponding shadow mapping.
> > > > >
> > > > > Why? A SW emulation could do this synchronization during
> invalidation
> > > > > processing if invalidation contained an IOVA range.
> > > >
> > > > In this proposal we differentiate between host-managed and user-
> > > > managed I/O page tables. If host-managed, the user is expected to use
> > > > map/unmap cmd explicitly upon any change required on the page table.
> > > > If user-managed, the user first binds its page table to the IOMMU and
> > > > then use invalidation cmd to flush iotlb when necessary (e.g. typically
> > > > not required when changing a PTE from non-present to present).
> > > >
> > > > We expect user to use map+unmap and bind+invalidate respectively
> > > > instead of mixing them together. Following this policy, map+unmap
> > > > must be used in both levels for software nesting, so changes in either
> > > > level are captured timely to synchronize the shadow mapping.
> > >
> > > map+unmap or bind+invalidate is a policy of the IOASID itself set when
> > > it is created. If you put two different types in a tree then each IOASID
> > > must continue to use its own operation mode.
> > >
> > > I don't see a reason to force all IOASIDs in a tree to be consistent??
> >
> > only for software nesting. With hardware support the parent uses map
> > while the child uses bind.
> >
> > Yes, the policy is specified per IOASID. But if the policy violates the
> > requirement in a specific nesting mode, then nesting should fail.
> 
> I don't get it.
> 
> If the IOASID is a page table then it is bind/invalidate. SW or not SW
> doesn't matter at all.
> 
> > >
> > > A software emulated two level page table where the leaf level is a
> > > bound page table in guest memory should continue to use
> > > bind/invalidate to maintain the guest page table IOASID even though it
> > > is a SW construct.
> >
> > with software nesting the leaf should be a host-managed page table
> > (or metadata). A bind/invalidate protocol doesn't require the user
> > to notify the kernel of every page table change.
> 
> The purpose of invalidate is to inform the implementation that the
> page table has changed so it can flush the caches. If the page table
> is changed and invalidation is not issued then then the implementation
> is free to ignore the changes.
> 
> In this way the SW mode is the same as a HW mode with an infinite
> cache.
> 
> The collaposed shadow page table is really just a cache.
> 

OK. One additional thing is that we may need a 'caching_mode"
thing reported by /dev/ioasid, indicating whether invalidation is
required when changing non-present to present. For hardware 
nesting it's not reported as the hardware IOMMU will walk the
guest page table in cases of iotlb miss. For software nesting 
caching_mode is reported so the user must issue invalidation 
upon any change in guest page table so the kernel can update
the shadow page table timely.

Following this and your other comment with David, we will mark
host-managed vs. guest-managed explicitly for I/O page table
of each IOASID. map+unmap or bind+invalid is decided by
which owner is specified by the user.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 258+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-06-02 16:16     ` Jason Gunthorpe
@ 2021-06-03  2:11       ` Tian, Kevin
  2021-06-03  5:13       ` David Gibson
  1 sibling, 0 replies; 258+ messages in thread
From: Tian, Kevin @ 2021-06-03  2:11 UTC (permalink / raw)
  To: Jason Gunthorpe, David Gibson
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	Kirti Wankhede, Robin Murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, June 3, 2021 12:17 AM
>
[...] 
> > > If there are no hypervisor traps (does this exist?) then there is no
> > > way to involve the hypervisor here and the child IOASID should simply
> > > be a pointer to the guest's data structure that describes binding. In
> > > this case that IOASID should claim all PASIDs when bound to a
> > > RID.
> >
> > And in that case I think we should call that object something other
> > than an IOASID, since it represents multiple address spaces.
> 
> Maybe.. It is certainly a special case.
> 
> We can still consider it a single "address space" from the IOMMU
> perspective. What has happened is that the address table is not just a
> 64 bit IOVA, but an extended ~80 bit IOVA formed by "PASID, IOVA".

More accurately 64+20=84 bit IOVA 😊

> 
> If we are already going in the direction of having the IOASID specify
> the page table format and other details, specifying that the page

I'm leaning toward this direction now, after a discussion with Baolu.
He reminded me that a default domain is already created for each
device when it's probed by the iommu driver. So it looks workable
to expose a per-device capability query uAPI to user once a device
is bound to the ioasid fd. Once it's available, the user should be able
to judge what format/mode should be set when creating an IOASID.

> tabnle format is the 80 bit "PASID, IOVA" format is a fairly small
> step.

In concept this view is true. But when designing the uAPI possibly
we will not call it a 84bit format as the PASID table itself just
serves 20bit PASID space. 

Will think more how to mark it in the next version.

> 
> I wouldn't twist things into knots to create a difference, but if it
> is easy to do it wouldn't hurt either.
> 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 258+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-06-02 16:58     ` Jason Gunthorpe
@ 2021-06-03  2:49       ` Tian, Kevin
  2021-06-03  5:48         ` David Gibson
  2021-06-03  5:45       ` David Gibson
  1 sibling, 1 reply; 258+ messages in thread
From: Tian, Kevin @ 2021-06-03  2:49 UTC (permalink / raw)
  To: Jason Gunthorpe, David Gibson
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	Kirti Wankhede, Robin Murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, June 3, 2021 12:59 AM
> 
> On Wed, Jun 02, 2021 at 04:48:35PM +1000, David Gibson wrote:
> > > > 	/* Bind guest I/O page table  */
> > > > 	bind_data = {
> > > > 		.ioasid	= gva_ioasid;
> > > > 		.addr	= gva_pgtable1;
> > > > 		// and format information
> > > > 	};
> > > > 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> > >
> > > Again I do wonder if this should just be part of alloc_ioasid. Is
> > > there any reason to split these things? The only advantage to the
> > > split is the device is known, but the device shouldn't impact
> > > anything..
> >
> > I'm pretty sure the device(s) could matter, although they probably
> > won't usually.
> 
> It is a bit subtle, but the /dev/iommu fd itself is connected to the
> devices first. This prevents wildly incompatible devices from being
> joined together, and allows some "get info" to report the capability
> union of all devices if we want to do that.

I would expect the capability reported per-device via /dev/iommu. 
Incompatible devices can bind to the same fd but cannot attach to
the same IOASID. This allows incompatible devices to share locked
page accounting.

> 
> The original concept was that devices joined would all have to support
> the same IOASID format, at least for the kernel owned map/unmap IOASID
> type. Supporting different page table formats maybe is reason to
> revisit that concept.

I hope my memory was not broken, that the original concept was 
the devices attached to the same IOASID must support the same
format. Otherwise they need attach to different IOASIDs (but still
within the same fd).

> 
> There is a small advantage to re-using the IOASID container because of
> the get_user_pages caching and pinned accounting management at the FD
> level.

With above concept we don't need IOASID container then.

> 
> I don't know if that small advantage is worth the extra complexity
> though.
> 
> > But it would certainly be possible for a system to have two
> > different host bridges with two different IOMMUs with different
> > pagetable formats.  Until you know which devices (and therefore
> > which host bridge) you're talking about, you don't know what formats
> > of pagetable to accept.  And if you have devices from *both* bridges
> > you can't bind a page table at all - you could theoretically support
> > a kernel managed pagetable by mirroring each MAP and UNMAP to tables
> > in both formats, but it would be pretty reasonable not to support
> > that.
> 
> The basic process for a user space owned pgtable mode would be:
> 
>  1) qemu has to figure out what format of pgtable to use
> 
>     Presumably it uses query functions using the device label. The
>     kernel code should look at the entire device path through all the
>     IOMMU HW to determine what is possible.
> 
>     Or it already knows because the VM's vIOMMU is running in some
>     fixed page table format, or the VM's vIOMMU already told it, or
>     something.

I'd expect the both. First get the hardware format. Then detect whether
it's compatible to the vIOMMU format.

> 
>  2) qemu creates an IOASID and based on #1 and says 'I want this format'

Based on earlier discussion this will possibly be:

struct iommu_ioasid_create_info {

// if set this is a guest-managed page table, use bind+invalidate, with
// info provided in struct pgtable_info;
// if clear it's host-managed and use map+unmap;
#define IOMMU_IOASID_FLAG_USER_PGTABLE		1

// if set it is for pasid table binding. same implication as USER_PGTABLE
// except it's for a different pgtable type
#define IOMMU_IOASID_FLAG_USER_PASID_TABLE	2
	int		flags;

	// Create nesting if not INVALID_IOASID
	u32		parent_ioasid;

	// additional info about the page table
	union {
		// for user-managed page table
		struct {
			u64	user_pgd;
			u32	format;
			u32	addr_width;
			// and other vendor format info
		} pgtable_info;

		// for kernel-managed page table
		struct {
			// not required on x86
			// for ppc, iirc the user wants to claim a window
			// explicitly?
		} map_info;
	};
};

then there will be no UNBIND_PGTABLE ioctl. The unbind is done 
automatically when the IOASID is freed.

> 
>  3) qemu binds the IOASID to the device.

let's use 'attach' for consistency. 😊 'bind' is for ioasid fd which must
be completed in step 0) so format can be reported in step 1)

> 
>     If qmeu gets it wrong then it just fails.
> 
>  4) For the next device qemu would have to figure out if it can re-use
>     an existing IOASID based on the required proeprties.
> 
> You pointed to the case of mixing vIOMMU's of different platforms. So
> it is completely reasonable for qemu to ask for a "ARM 64 bit IOMMU
> page table mode v2" while running on an x86 because that is what the
> vIOMMU is wired to work with.
> 
> Presumably qemu will fall back to software emulation if this is not
> possible.
> 
> One interesting option for software emulation is to just transform the
> ARM page table format to a x86 page table format in userspace and use
> nested bind/invalidate to synchronize with the kernel. With SW nesting
> I suspect this would be much faster
> 

or just use map+unmap. It's no difference from how an virtio-iommu could
work on all platforms, which is by definition is not the same type as the
underlying hardware.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-02 22:45                         ` Jason Gunthorpe
@ 2021-06-03  2:50                           ` Alex Williamson
  2021-06-03  3:22                             ` Tian, Kevin
  2021-06-03 12:34                             ` Jason Gunthorpe
  0 siblings, 2 replies; 258+ messages in thread
From: Alex Williamson @ 2021-06-03  2:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok, kvm,
	Jonathan Corbet, Robin Murphy, LKML, iommu, David Gibson,
	Kirti Wankhede, David Woodhouse, Jason Wang

On Wed, 2 Jun 2021 19:45:36 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Jun 02, 2021 at 02:37:34PM -0600, Alex Williamson wrote:
> 
> > Right.  I don't follow where you're jumping to relaying DMA_PTE_SNP
> > from the guest page table... what page table?    
> 
> I see my confusion now, the phrasing in your earlier remark led me
> think this was about allowing the no-snoop performance enhancement in
> some restricted way.
> 
> It is really about blocking no-snoop 100% of the time and then
> disabling the dangerous wbinvd when the block is successful.
> 
> Didn't closely read the kvm code :\
> 
> If it was about allowing the optimization then I'd expect the guest to
> enable no-snoopable regions via it's vIOMMU and realize them to the
> hypervisor and plumb the whole thing through. Hence my remark about
> the guest page tables..
> 
> So really the test is just 'were we able to block it' ?

Yup.  Do we really still consider that there's some performance benefit
to be had by enabling a device to use no-snoop?  This seems largely a
legacy thing.

> > This support existed before mdev, IIRC we needed it for direct
> > assignment of NVIDIA GPUs.  
> 
> Probably because they ignored the disable no-snoop bits in the control
> block, or reset them in some insane way to "fix" broken bioses and
> kept using it even though by all rights qemu would have tried hard to
> turn it off via the config space. Processing no-snoop without a
> working wbinvd would be fatal. Yeesh
> 
> But Ok, back the /dev/ioasid. This answers a few lingering questions I
> had..
> 
> 1) Mixing IOMMU_CAP_CACHE_COHERENCY and !IOMMU_CAP_CACHE_COHERENCY
>    domains.
> 
>    This doesn't actually matter. If you mix them together then kvm
>    will turn on wbinvd anyhow, so we don't need to use the DMA_PTE_SNP
>    anywhere in this VM.
> 
>    This if two IOMMU's are joined together into a single /dev/ioasid
>    then we can just make them both pretend to be
>    !IOMMU_CAP_CACHE_COHERENCY and both not set IOMMU_CACHE.

Yes and no.  Yes, if any domain is !IOMMU_CAP_CACHE_COHERENCY then we
need to emulate wbinvd, but no we'll use IOMMU_CACHE any time it's
available based on the per domain support available.  That gives us the
most consistent behavior, ie. we don't have VMs emulating wbinvd
because they used to have a device attached where the domain required
it and we can't atomically remap with new flags to perform the same as
a VM that never had that device attached in the first place.

> 2) How to fit this part of kvm in some new /dev/ioasid world
> 
>    What we want to do here is iterate over every ioasid associated
>    with the group fd that is passed into kvm.

Yeah, we need some better names, binding a device to an ioasid (fd) but
then attaching a device to an allocated ioasid (non-fd)... I assume
you're talking about the latter ioasid.

>    Today the group fd has a single container which specifies the
>    single ioasid so this is being done trivially.
> 
>    To reorg we want to get the ioasid from the device not the
>    group (see my note to David about the groups vs device rational)
> 
>    This is just iterating over each vfio_device in the group and
>    querying the ioasid it is using.

The IOMMU API group interfaces is largely iommu_group_for_each_dev()
anyway, we still need to account for all the RIDs and aliases of a
group.

>    Or perhaps more directly: an op attaching the vfio_device to the
>    kvm and having some simple helper 
>          '(un)register ioasid with kvm (kvm, ioasid)'
>    that the vfio_device driver can call that just sorts this out.

We could almost eliminate the device notion altogether here, use an
ioasidfd_for_each_ioasid() but we really want a way to trigger on each
change to the composition of the device set for the ioasid, which is
why we currently do it on addition or removal of a group, where the
group has a consistent set of IOMMU properties.  Register a notifier
callback via the ioasidfd?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-02 20:37                       ` Alex Williamson
  2021-06-02 22:45                         ` Jason Gunthorpe
@ 2021-06-03  2:52                         ` Jason Wang
  2021-06-03 13:09                           ` Jason Gunthorpe
  1 sibling, 1 reply; 258+ messages in thread
From: Jason Wang @ 2021-06-03  2:52 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: Tian, Kevin, Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok, kvm,
	Jonathan Corbet, Robin Murphy, LKML, iommu, David Gibson,
	Kirti Wankhede, David Woodhouse


在 2021/6/3 上午4:37, Alex Williamson 写道:
> On Wed, 2 Jun 2021 16:54:04 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
>
>> On Wed, Jun 02, 2021 at 01:00:53PM -0600, Alex Williamson wrote:
>>> Right, the device can generate the no-snoop transactions, but it's the
>>> IOMMU that essentially determines whether those transactions are
>>> actually still cache coherent, AIUI.
>> Wow, this is really confusing stuff in the code.
>>
>> At the PCI level there is a TLP bit called no-snoop that is platform
>> specific. The general intention is to allow devices to selectively
>> bypass the CPU caching for DMAs. GPUs like to use this feature for
>> performance.
> Yes
>
>> I assume there is some exciting security issues here. Looks like
>> allowing cache bypass does something bad inside VMs? Looks like
>> allowing the VM to use the cache clear instruction that is mandatory
>> with cache bypass DMA causes some QOS issues? OK.
> IIRC, largely a DoS issue if userspace gets to choose when to emulate
> wbinvd rather than it being demanded for correct operation.
>
>> So how does it work?
>>
>> What I see in the intel/iommu.c is that some domains support "snoop
>> control" or not, based on some HW flag. This indicates if the
>> DMA_PTE_SNP bit is supported on a page by page basis or not.
>>
>> Since x86 always leans toward "DMA cache coherent" I'm reading some
>> tea leaves here:
>>
>> 	IOMMU_CAP_CACHE_COHERENCY,	/* IOMMU can enforce cache coherent DMA
>> 					   transactions */
>>
>> And guessing that IOMMUs that implement DMA_PTE_SNP will ignore the
>> snoop bit in TLPs for IOVA's that have DMA_PTE_SNP set?
> That's my understanding as well.
>
>> Further, I guess IOMMUs that don't support PTE_SNP, or have
>> DMA_PTE_SNP clear will always honour the snoop bit. (backwards compat
>> and all)
> Yes.
>
>> So, IOMMU_CAP_CACHE_COHERENCY does not mean the IOMMU is DMA
>> incoherent with the CPU caches, it just means that that snoop bit in
>> the TLP cannot be enforced. ie the device *could* do no-shoop DMA
>> if it wants. Devices that never do no-snoop remain DMA coherent on
>> x86, as they always have been.
> Yes, IOMMU_CAP_CACHE_COHERENCY=false means we cannot force the device
> DMA to be coherent via the IOMMU.
>
>> IOMMU_CACHE does not mean the IOMMU is DMA cache coherent, it means
>> the PCI device is blocked from using no-snoop in its TLPs.
>>
>> I wonder if ARM implemented this consistently? I see VDPA is
>> confused..


Basically, we don't want to bother with pseudo KVM device like what VFIO 
did. So for simplicity, we rules out the IOMMU that can't enforce 
coherency in vhost-vDPA if the parent purely depends on the platform IOMMU:


         if (!iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY))
                 return -ENOTSUPP;

For the parents that use its own translations logic, an implicit 
assumption is that the hardware will always perform cache coherent DMA.

Thanks



^ permalink raw reply	[flat|nested] 258+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-06-02 17:19   ` Jason Gunthorpe
@ 2021-06-03  3:02     ` Tian, Kevin
  2021-06-03  6:26     ` David Gibson
  1 sibling, 0 replies; 258+ messages in thread
From: Tian, Kevin @ 2021-06-03  3:02 UTC (permalink / raw)
  To: Jason Gunthorpe, David Gibson
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	Kirti Wankhede, Robin Murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, June 3, 2021 1:20 AM
> 
[...]
> > I wonder if there's a way to model this using a nested AS rather than
> > requiring special operations.  e.g.
> >
> > 	'prereg' IOAS
> > 	|
> > 	\- 'rid' IOAS
> > 	   |
> > 	   \- 'pasid' IOAS (maybe)
> >
> > 'prereg' would have a kernel managed pagetable into which (for
> > example) qemu platform code would map all guest memory (using
> > IOASID_MAP_DMA).  qemu's vIOMMU driver would then mirror the guest's
> > IO mappings into the 'rid' IOAS in terms of GPA.
> >
> > This wouldn't quite work as is, because the 'prereg' IOAS would have
> > no devices.  But we could potentially have another call to mark an
> > IOAS as a purely "preregistration" or pure virtual IOAS.  Using that
> > would be an alternative to attaching devices.
> 
> It is one option for sure, this is where I was thinking when we were
> talking in the other thread. I think the decision is best
> implementation driven as the datastructure to store the
> preregsitration data should be rather purpose built.

Yes. For now I prefer to managing prereg through a separate cmd
instead of special-casing it in the IOASID graph. Anyway this is sort
of a per-fd thing.

> 
> > > /*
> > >   * Map/unmap process virtual addresses to I/O virtual addresses.
> > >   *
> > >   * Provide VFIO type1 equivalent semantics. Start with the same
> > >   * restriction e.g. the unmap size should match those used in the
> > >   * original mapping call.
> > >   *
> > >   * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
> > >   * must be already in the preregistered list.
> > >   *
> > >   * Input parameters:
> > >   *	- u32 ioasid;
> > >   *	- refer to vfio_iommu_type1_dma_{un}map
> > >   *
> > >   * Return: 0 on success, -errno on failure.
> > >   */
> > > #define IOASID_MAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 6)
> > > #define IOASID_UNMAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 7)
> >
> > I'm assuming these would be expected to fail if a user managed
> > pagetable has been bound?
> 
> Me too, or a SVA page table.
> 
> This document would do well to have a list of imagined page table
> types and the set of operations that act on them. I think they are all
> pretty disjoint..
> 
> Your presentation of 'kernel owns the table' vs 'userspace owns the
> table' is a useful clarification to call out too

sure, I incorporated this comment in last reply.

> 
> > > 5. Use Cases and Flows
> > >
> > > Here assume VFIO will support a new model where every bound device
> > > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> > > going through legacy container/group interface. For illustration purpose
> > > those devices are just called dev[1...N]:
> > >
> > > 	device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> >
> > Minor detail, but I'd suggest /dev/vfio/pci/DDDD:BB:SS.F for the
> > filenames for actual PCI functions.  Maybe /dev/vfio/mdev/something
> > for mdevs.  That leaves other subdirs of /dev/vfio free for future
> > non-PCI device types, and /dev/vfio itself for the legacy group
> > devices.
> 
> There are a bunch of nice options here if we go this path

Yes, this part is only roughly visited to focus on /dev/iommu first. In later
versions it will be considered more seriously.

> 
> > > 5.2. Multiple IOASIDs (no nesting)
> > > ++++++++++++++++++++++++++++
> > >
> > > Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
> > > both devices are attached to gpa_ioasid.
> >
> > Doesn't really affect your example, but note that the PAPR IOMMU does
> > not have a passthrough mode, so devices will not initially be attached
> > to gpa_ioasid - they will be unusable for DMA until attached to a
> > gIOVA ioasid.

'initially' here is still user-requested action. For PAPR you should do
attach only when it's necessary.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 258+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-06-03  2:50                           ` Alex Williamson
@ 2021-06-03  3:22                             ` Tian, Kevin
  2021-06-03  4:14                               ` Alex Williamson
  2021-06-03 12:40                               ` Jason Gunthorpe
  2021-06-03 12:34                             ` Jason Gunthorpe
  1 sibling, 2 replies; 258+ messages in thread
From: Tian, Kevin @ 2021-06-03  3:22 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok, kvm,
	Jonathan Corbet, Robin Murphy, LKML, iommu, David Gibson,
	Kirti Wankhede, David Woodhouse, Jason Wang

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Thursday, June 3, 2021 10:51 AM
> 
> On Wed, 2 Jun 2021 19:45:36 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Jun 02, 2021 at 02:37:34PM -0600, Alex Williamson wrote:
> >
> > > Right.  I don't follow where you're jumping to relaying DMA_PTE_SNP
> > > from the guest page table... what page table?
> >
> > I see my confusion now, the phrasing in your earlier remark led me
> > think this was about allowing the no-snoop performance enhancement in
> > some restricted way.
> >
> > It is really about blocking no-snoop 100% of the time and then
> > disabling the dangerous wbinvd when the block is successful.
> >
> > Didn't closely read the kvm code :\
> >
> > If it was about allowing the optimization then I'd expect the guest to
> > enable no-snoopable regions via it's vIOMMU and realize them to the
> > hypervisor and plumb the whole thing through. Hence my remark about
> > the guest page tables..
> >
> > So really the test is just 'were we able to block it' ?
> 
> Yup.  Do we really still consider that there's some performance benefit
> to be had by enabling a device to use no-snoop?  This seems largely a
> legacy thing.

Yes, there is indeed performance benefit for device to use no-snoop,
e.g. 8K display and some imaging processing path, etc. The problem is
that the IOMMU for such devices is typically a different one from the
default IOMMU for most devices. This special IOMMU may not have
the ability of enforcing snoop on no-snoop PCI traffic then this fact
must be understood by KVM to do proper mtrr/pat/wbinvd virtualization 
for such devices to work correctly.

> 
> > > This support existed before mdev, IIRC we needed it for direct
> > > assignment of NVIDIA GPUs.
> >
> > Probably because they ignored the disable no-snoop bits in the control
> > block, or reset them in some insane way to "fix" broken bioses and
> > kept using it even though by all rights qemu would have tried hard to
> > turn it off via the config space. Processing no-snoop without a
> > working wbinvd would be fatal. Yeesh
> >
> > But Ok, back the /dev/ioasid. This answers a few lingering questions I
> > had..
> >
> > 1) Mixing IOMMU_CAP_CACHE_COHERENCY
> and !IOMMU_CAP_CACHE_COHERENCY
> >    domains.
> >
> >    This doesn't actually matter. If you mix them together then kvm
> >    will turn on wbinvd anyhow, so we don't need to use the DMA_PTE_SNP
> >    anywhere in this VM.
> >
> >    This if two IOMMU's are joined together into a single /dev/ioasid
> >    then we can just make them both pretend to be
> >    !IOMMU_CAP_CACHE_COHERENCY and both not set IOMMU_CACHE.
> 
> Yes and no.  Yes, if any domain is !IOMMU_CAP_CACHE_COHERENCY then
> we
> need to emulate wbinvd, but no we'll use IOMMU_CACHE any time it's
> available based on the per domain support available.  That gives us the
> most consistent behavior, ie. we don't have VMs emulating wbinvd
> because they used to have a device attached where the domain required
> it and we can't atomically remap with new flags to perform the same as
> a VM that never had that device attached in the first place.
> 
> > 2) How to fit this part of kvm in some new /dev/ioasid world
> >
> >    What we want to do here is iterate over every ioasid associated
> >    with the group fd that is passed into kvm.
> 
> Yeah, we need some better names, binding a device to an ioasid (fd) but
> then attaching a device to an allocated ioasid (non-fd)... I assume
> you're talking about the latter ioasid.
> 
> >    Today the group fd has a single container which specifies the
> >    single ioasid so this is being done trivially.
> >
> >    To reorg we want to get the ioasid from the device not the
> >    group (see my note to David about the groups vs device rational)
> >
> >    This is just iterating over each vfio_device in the group and
> >    querying the ioasid it is using.
> 
> The IOMMU API group interfaces is largely iommu_group_for_each_dev()
> anyway, we still need to account for all the RIDs and aliases of a
> group.
> 
> >    Or perhaps more directly: an op attaching the vfio_device to the
> >    kvm and having some simple helper
> >          '(un)register ioasid with kvm (kvm, ioasid)'
> >    that the vfio_device driver can call that just sorts this out.
> 
> We could almost eliminate the device notion altogether here, use an
> ioasidfd_for_each_ioasid() but we really want a way to trigger on each
> change to the composition of the device set for the ioasid, which is
> why we currently do it on addition or removal of a group, where the
> group has a consistent set of IOMMU properties.  Register a notifier
> callback via the ioasidfd?  Thanks,
> 

When discussing I/O page fault support in another thread, the consensus
is that an device handle will be registered (by user) or allocated (return
to user) in /dev/ioasid when binding the device to ioasid fd. From this 
angle we can register {ioasid_fd, device_handle} to KVM and then call 
something like ioasidfd_device_is_coherent() to get the property. 
Anyway the coherency is a per-device property which is not changed 
by how many I/O page tables are attached to it.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-03  3:22                             ` Tian, Kevin
@ 2021-06-03  4:14                               ` Alex Williamson
  2021-06-03  5:18                                 ` Tian, Kevin
  2021-06-03 12:40                               ` Jason Gunthorpe
  1 sibling, 1 reply; 258+ messages in thread
From: Alex Williamson @ 2021-06-03  4:14 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok,
	kvm, Jonathan Corbet, Robin Murphy, LKML, iommu, David Gibson,
	Kirti Wankhede, David Woodhouse, Jason Wang

On Thu, 3 Jun 2021 03:22:27 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Thursday, June 3, 2021 10:51 AM
> > 
> > On Wed, 2 Jun 2021 19:45:36 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Wed, Jun 02, 2021 at 02:37:34PM -0600, Alex Williamson wrote:
> > >  
> > > > Right.  I don't follow where you're jumping to relaying DMA_PTE_SNP
> > > > from the guest page table... what page table?  
> > >
> > > I see my confusion now, the phrasing in your earlier remark led me
> > > think this was about allowing the no-snoop performance enhancement in
> > > some restricted way.
> > >
> > > It is really about blocking no-snoop 100% of the time and then
> > > disabling the dangerous wbinvd when the block is successful.
> > >
> > > Didn't closely read the kvm code :\
> > >
> > > If it was about allowing the optimization then I'd expect the guest to
> > > enable no-snoopable regions via it's vIOMMU and realize them to the
> > > hypervisor and plumb the whole thing through. Hence my remark about
> > > the guest page tables..
> > >
> > > So really the test is just 'were we able to block it' ?  
> > 
> > Yup.  Do we really still consider that there's some performance benefit
> > to be had by enabling a device to use no-snoop?  This seems largely a
> > legacy thing.  
> 
> Yes, there is indeed performance benefit for device to use no-snoop,
> e.g. 8K display and some imaging processing path, etc. The problem is
> that the IOMMU for such devices is typically a different one from the
> default IOMMU for most devices. This special IOMMU may not have
> the ability of enforcing snoop on no-snoop PCI traffic then this fact
> must be understood by KVM to do proper mtrr/pat/wbinvd virtualization 
> for such devices to work correctly.

The case where the IOMMU does not support snoop-control for such a
device already works fine, we can't prevent no-snoop so KVM will
emulate wbinvd.  The harder one is if we should opt to allow no-snoop
even if the IOMMU does support snoop-control.
 
> > > > This support existed before mdev, IIRC we needed it for direct
> > > > assignment of NVIDIA GPUs.  
> > >
> > > Probably because they ignored the disable no-snoop bits in the control
> > > block, or reset them in some insane way to "fix" broken bioses and
> > > kept using it even though by all rights qemu would have tried hard to
> > > turn it off via the config space. Processing no-snoop without a
> > > working wbinvd would be fatal. Yeesh
> > >
> > > But Ok, back the /dev/ioasid. This answers a few lingering questions I
> > > had..
> > >
> > > 1) Mixing IOMMU_CAP_CACHE_COHERENCY  
> > and !IOMMU_CAP_CACHE_COHERENCY  
> > >    domains.
> > >
> > >    This doesn't actually matter. If you mix them together then kvm
> > >    will turn on wbinvd anyhow, so we don't need to use the DMA_PTE_SNP
> > >    anywhere in this VM.
> > >
> > >    This if two IOMMU's are joined together into a single /dev/ioasid
> > >    then we can just make them both pretend to be
> > >    !IOMMU_CAP_CACHE_COHERENCY and both not set IOMMU_CACHE.  
> > 
> > Yes and no.  Yes, if any domain is !IOMMU_CAP_CACHE_COHERENCY then
> > we
> > need to emulate wbinvd, but no we'll use IOMMU_CACHE any time it's
> > available based on the per domain support available.  That gives us the
> > most consistent behavior, ie. we don't have VMs emulating wbinvd
> > because they used to have a device attached where the domain required
> > it and we can't atomically remap with new flags to perform the same as
> > a VM that never had that device attached in the first place.
> >   
> > > 2) How to fit this part of kvm in some new /dev/ioasid world
> > >
> > >    What we want to do here is iterate over every ioasid associated
> > >    with the group fd that is passed into kvm.  
> > 
> > Yeah, we need some better names, binding a device to an ioasid (fd) but
> > then attaching a device to an allocated ioasid (non-fd)... I assume
> > you're talking about the latter ioasid.
> >   
> > >    Today the group fd has a single container which specifies the
> > >    single ioasid so this is being done trivially.
> > >
> > >    To reorg we want to get the ioasid from the device not the
> > >    group (see my note to David about the groups vs device rational)
> > >
> > >    This is just iterating over each vfio_device in the group and
> > >    querying the ioasid it is using.  
> > 
> > The IOMMU API group interfaces is largely iommu_group_for_each_dev()
> > anyway, we still need to account for all the RIDs and aliases of a
> > group.
> >   
> > >    Or perhaps more directly: an op attaching the vfio_device to the
> > >    kvm and having some simple helper
> > >          '(un)register ioasid with kvm (kvm, ioasid)'
> > >    that the vfio_device driver can call that just sorts this out.  
> > 
> > We could almost eliminate the device notion altogether here, use an
> > ioasidfd_for_each_ioasid() but we really want a way to trigger on each
> > change to the composition of the device set for the ioasid, which is
> > why we currently do it on addition or removal of a group, where the
> > group has a consistent set of IOMMU properties.  Register a notifier
> > callback via the ioasidfd?  Thanks,
> >   
> 
> When discussing I/O page fault support in another thread, the consensus
> is that an device handle will be registered (by user) or allocated (return
> to user) in /dev/ioasid when binding the device to ioasid fd. From this 
> angle we can register {ioasid_fd, device_handle} to KVM and then call 
> something like ioasidfd_device_is_coherent() to get the property. 
> Anyway the coherency is a per-device property which is not changed 
> by how many I/O page tables are attached to it.

The mechanics are different, but this is pretty similar in concept to
KVM learning coherence using the groupfd today.  Do we want to
compromise on kernel control of wbinvd emulation to allow userspace to
make such decisions?  Ownership of a device might be reason enough to
allow the user that privilege.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-03  1:29           ` Tian, Kevin
@ 2021-06-03  5:09             ` David Gibson
  2021-06-03  6:49               ` Tian, Kevin
  0 siblings, 1 reply; 258+ messages in thread
From: David Gibson @ 2021-06-03  5:09 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, Jean-Philippe Brucker,
	Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, Jiang, Dave, David Woodhouse, Jason Wang

[-- Attachment #1: Type: text/plain, Size: 5042 bytes --]

On Thu, Jun 03, 2021 at 01:29:58AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe
> > Sent: Thursday, June 3, 2021 12:09 AM
> > 
> > On Wed, Jun 02, 2021 at 01:33:22AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Wednesday, June 2, 2021 1:42 AM
> > > >
> > > > On Tue, Jun 01, 2021 at 08:10:14AM +0000, Tian, Kevin wrote:
> > > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > > Sent: Saturday, May 29, 2021 1:36 AM
> > > > > >
> > > > > > On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > > > > >
> > > > > > > IOASID nesting can be implemented in two ways: hardware nesting
> > and
> > > > > > > software nesting. With hardware support the child and parent I/O
> > page
> > > > > > > tables are walked consecutively by the IOMMU to form a nested
> > > > translation.
> > > > > > > When it's implemented in software, the ioasid driver is responsible
> > for
> > > > > > > merging the two-level mappings into a single-level shadow I/O page
> > > > table.
> > > > > > > Software nesting requires both child/parent page tables operated
> > > > through
> > > > > > > the dma mapping protocol, so any change in either level can be
> > > > captured
> > > > > > > by the kernel to update the corresponding shadow mapping.
> > > > > >
> > > > > > Why? A SW emulation could do this synchronization during
> > invalidation
> > > > > > processing if invalidation contained an IOVA range.
> > > > >
> > > > > In this proposal we differentiate between host-managed and user-
> > > > > managed I/O page tables. If host-managed, the user is expected to use
> > > > > map/unmap cmd explicitly upon any change required on the page table.
> > > > > If user-managed, the user first binds its page table to the IOMMU and
> > > > > then use invalidation cmd to flush iotlb when necessary (e.g. typically
> > > > > not required when changing a PTE from non-present to present).
> > > > >
> > > > > We expect user to use map+unmap and bind+invalidate respectively
> > > > > instead of mixing them together. Following this policy, map+unmap
> > > > > must be used in both levels for software nesting, so changes in either
> > > > > level are captured timely to synchronize the shadow mapping.
> > > >
> > > > map+unmap or bind+invalidate is a policy of the IOASID itself set when
> > > > it is created. If you put two different types in a tree then each IOASID
> > > > must continue to use its own operation mode.
> > > >
> > > > I don't see a reason to force all IOASIDs in a tree to be consistent??
> > >
> > > only for software nesting. With hardware support the parent uses map
> > > while the child uses bind.
> > >
> > > Yes, the policy is specified per IOASID. But if the policy violates the
> > > requirement in a specific nesting mode, then nesting should fail.
> > 
> > I don't get it.
> > 
> > If the IOASID is a page table then it is bind/invalidate. SW or not SW
> > doesn't matter at all.
> > 
> > > >
> > > > A software emulated two level page table where the leaf level is a
> > > > bound page table in guest memory should continue to use
> > > > bind/invalidate to maintain the guest page table IOASID even though it
> > > > is a SW construct.
> > >
> > > with software nesting the leaf should be a host-managed page table
> > > (or metadata). A bind/invalidate protocol doesn't require the user
> > > to notify the kernel of every page table change.
> > 
> > The purpose of invalidate is to inform the implementation that the
> > page table has changed so it can flush the caches. If the page table
> > is changed and invalidation is not issued then then the implementation
> > is free to ignore the changes.
> > 
> > In this way the SW mode is the same as a HW mode with an infinite
> > cache.
> > 
> > The collaposed shadow page table is really just a cache.
> > 
> 
> OK. One additional thing is that we may need a 'caching_mode"
> thing reported by /dev/ioasid, indicating whether invalidation is
> required when changing non-present to present. For hardware 
> nesting it's not reported as the hardware IOMMU will walk the
> guest page table in cases of iotlb miss. For software nesting 
> caching_mode is reported so the user must issue invalidation 
> upon any change in guest page table so the kernel can update
> the shadow page table timely.

For the fist cut, I'd have the API assume that invalidates are
*always* required.  Some bypass to avoid them in cases where they're
not needed can be an additional extension.

> Following this and your other comment with David, we will mark
> host-managed vs. guest-managed explicitly for I/O page table
> of each IOASID. map+unmap or bind+invalid is decided by
> which owner is specified by the user.
> 
> Thanks
> Kevin
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-02 16:16     ` Jason Gunthorpe
  2021-06-03  2:11       ` Tian, Kevin
@ 2021-06-03  5:13       ` David Gibson
  2021-06-03 11:52         ` Jason Gunthorpe
  1 sibling, 1 reply; 258+ messages in thread
From: David Gibson @ 2021-06-03  5:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	Kirti Wankhede, Robin Murphy

[-- Attachment #1: Type: text/plain, Size: 2747 bytes --]

On Wed, Jun 02, 2021 at 01:16:48PM -0300, Jason Gunthorpe wrote:
> On Wed, Jun 02, 2021 at 04:32:27PM +1000, David Gibson wrote:
> > > I agree with Jean-Philippe - at the very least erasing this
> > > information needs a major rational - but I don't really see why it
> > > must be erased? The HW reports the originating device, is it just a
> > > matter of labeling the devices attached to the /dev/ioasid FD so it
> > > can be reported to userspace?
> > 
> > HW reports the originating device as far as it knows.  In many cases
> > where you have multiple devices in an IOMMU group, it's because
> > although they're treated as separate devices at the kernel level, they
> > have the same RID at the HW level.  Which means a RID for something in
> > the right group is the closest you can count on supplying.
> 
> Granted there may be cases where exact fidelity is not possible, but
> that doesn't excuse eliminating fedelity where it does exist..
> 
> > > If there are no hypervisor traps (does this exist?) then there is no
> > > way to involve the hypervisor here and the child IOASID should simply
> > > be a pointer to the guest's data structure that describes binding. In
> > > this case that IOASID should claim all PASIDs when bound to a
> > > RID. 
> > 
> > And in that case I think we should call that object something other
> > than an IOASID, since it represents multiple address spaces.
> 
> Maybe.. It is certainly a special case.
> 
> We can still consider it a single "address space" from the IOMMU
> perspective. What has happened is that the address table is not just a
> 64 bit IOVA, but an extended ~80 bit IOVA formed by "PASID, IOVA".

True.  This does complexify how we represent what IOVA ranges are
valid, though.  I'll bet you most implementations don't actually
implement a full 64-bit IOVA, which means we effectively have a large
number of windows from (0..max IOVA) for each valid pasid.  This adds
another reason I don't think my concept of IOVA windows is just a
power specific thing.

> If we are already going in the direction of having the IOASID specify
> the page table format and other details, specifying that the page
> tabnle format is the 80 bit "PASID, IOVA" format is a fairly small
> step.

Well, rather I think userspace needs to request what page table format
it wants and the kernel tells it whether it can oblige or not.

> I wouldn't twist things into knots to create a difference, but if it
> is easy to do it wouldn't hurt either.
> 
> Jason
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 258+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-06-03  4:14                               ` Alex Williamson
@ 2021-06-03  5:18                                 ` Tian, Kevin
  0 siblings, 0 replies; 258+ messages in thread
From: Tian, Kevin @ 2021-06-03  5:18 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok,
	kvm, Jonathan Corbet, Robin Murphy, LKML, iommu, David Gibson,
	Kirti Wankhede, David Woodhouse, Jason Wang

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Thursday, June 3, 2021 12:15 PM
> 
> On Thu, 3 Jun 2021 03:22:27 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Thursday, June 3, 2021 10:51 AM
> > >
> > > On Wed, 2 Jun 2021 19:45:36 -0300
> > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > > On Wed, Jun 02, 2021 at 02:37:34PM -0600, Alex Williamson wrote:
> > > >
> > > > > Right.  I don't follow where you're jumping to relaying DMA_PTE_SNP
> > > > > from the guest page table... what page table?
> > > >
> > > > I see my confusion now, the phrasing in your earlier remark led me
> > > > think this was about allowing the no-snoop performance enhancement
> in
> > > > some restricted way.
> > > >
> > > > It is really about blocking no-snoop 100% of the time and then
> > > > disabling the dangerous wbinvd when the block is successful.
> > > >
> > > > Didn't closely read the kvm code :\
> > > >
> > > > If it was about allowing the optimization then I'd expect the guest to
> > > > enable no-snoopable regions via it's vIOMMU and realize them to the
> > > > hypervisor and plumb the whole thing through. Hence my remark about
> > > > the guest page tables..
> > > >
> > > > So really the test is just 'were we able to block it' ?
> > >
> > > Yup.  Do we really still consider that there's some performance benefit
> > > to be had by enabling a device to use no-snoop?  This seems largely a
> > > legacy thing.
> >
> > Yes, there is indeed performance benefit for device to use no-snoop,
> > e.g. 8K display and some imaging processing path, etc. The problem is
> > that the IOMMU for such devices is typically a different one from the
> > default IOMMU for most devices. This special IOMMU may not have
> > the ability of enforcing snoop on no-snoop PCI traffic then this fact
> > must be understood by KVM to do proper mtrr/pat/wbinvd virtualization
> > for such devices to work correctly.
> 
> The case where the IOMMU does not support snoop-control for such a
> device already works fine, we can't prevent no-snoop so KVM will
> emulate wbinvd.  The harder one is if we should opt to allow no-snoop
> even if the IOMMU does support snoop-control.

In other discussion we are leaning toward a per-device capability
reporting scheme through /dev/ioasid (or /dev/iommu as the new
name). It seems natural to also allow setting a capability e.g. no-
snoop for a device if underlying IOMMU driver allows it.

> 
> > > > > This support existed before mdev, IIRC we needed it for direct
> > > > > assignment of NVIDIA GPUs.
> > > >
> > > > Probably because they ignored the disable no-snoop bits in the control
> > > > block, or reset them in some insane way to "fix" broken bioses and
> > > > kept using it even though by all rights qemu would have tried hard to
> > > > turn it off via the config space. Processing no-snoop without a
> > > > working wbinvd would be fatal. Yeesh
> > > >
> > > > But Ok, back the /dev/ioasid. This answers a few lingering questions I
> > > > had..
> > > >
> > > > 1) Mixing IOMMU_CAP_CACHE_COHERENCY
> > > and !IOMMU_CAP_CACHE_COHERENCY
> > > >    domains.
> > > >
> > > >    This doesn't actually matter. If you mix them together then kvm
> > > >    will turn on wbinvd anyhow, so we don't need to use the
> DMA_PTE_SNP
> > > >    anywhere in this VM.
> > > >
> > > >    This if two IOMMU's are joined together into a single /dev/ioasid
> > > >    then we can just make them both pretend to be
> > > >    !IOMMU_CAP_CACHE_COHERENCY and both not set IOMMU_CACHE.
> > >
> > > Yes and no.  Yes, if any domain is !IOMMU_CAP_CACHE_COHERENCY
> then
> > > we
> > > need to emulate wbinvd, but no we'll use IOMMU_CACHE any time it's
> > > available based on the per domain support available.  That gives us the
> > > most consistent behavior, ie. we don't have VMs emulating wbinvd
> > > because they used to have a device attached where the domain required
> > > it and we can't atomically remap with new flags to perform the same as
> > > a VM that never had that device attached in the first place.
> > >
> > > > 2) How to fit this part of kvm in some new /dev/ioasid world
> > > >
> > > >    What we want to do here is iterate over every ioasid associated
> > > >    with the group fd that is passed into kvm.
> > >
> > > Yeah, we need some better names, binding a device to an ioasid (fd) but
> > > then attaching a device to an allocated ioasid (non-fd)... I assume
> > > you're talking about the latter ioasid.
> > >
> > > >    Today the group fd has a single container which specifies the
> > > >    single ioasid so this is being done trivially.
> > > >
> > > >    To reorg we want to get the ioasid from the device not the
> > > >    group (see my note to David about the groups vs device rational)
> > > >
> > > >    This is just iterating over each vfio_device in the group and
> > > >    querying the ioasid it is using.
> > >
> > > The IOMMU API group interfaces is largely iommu_group_for_each_dev()
> > > anyway, we still need to account for all the RIDs and aliases of a
> > > group.
> > >
> > > >    Or perhaps more directly: an op attaching the vfio_device to the
> > > >    kvm and having some simple helper
> > > >          '(un)register ioasid with kvm (kvm, ioasid)'
> > > >    that the vfio_device driver can call that just sorts this out.
> > >
> > > We could almost eliminate the device notion altogether here, use an
> > > ioasidfd_for_each_ioasid() but we really want a way to trigger on each
> > > change to the composition of the device set for the ioasid, which is
> > > why we currently do it on addition or removal of a group, where the
> > > group has a consistent set of IOMMU properties.  Register a notifier
> > > callback via the ioasidfd?  Thanks,
> > >
> >
> > When discussing I/O page fault support in another thread, the consensus
> > is that an device handle will be registered (by user) or allocated (return
> > to user) in /dev/ioasid when binding the device to ioasid fd. From this
> > angle we can register {ioasid_fd, device_handle} to KVM and then call
> > something like ioasidfd_device_is_coherent() to get the property.
> > Anyway the coherency is a per-device property which is not changed
> > by how many I/O page tables are attached to it.
> 
> The mechanics are different, but this is pretty similar in concept to
> KVM learning coherence using the groupfd today.  Do we want to
> compromise on kernel control of wbinvd emulation to allow userspace to
> make such decisions?  Ownership of a device might be reason enough to
> allow the user that privilege.  Thanks,
> 

I think so. In the end it's still decided by the underlying IOMMU driver. 
If the IOMMU driver doesn't allow user to opt for no-snoop, it's exactly 
same as today's groupfd approach. Otherwise an user-opted policy 
implies that the decision is delegated to userspace. 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-02 16:37         ` Jason Gunthorpe
@ 2021-06-03  5:23           ` David Gibson
  2021-06-03 12:28             ` Jason Gunthorpe
  0 siblings, 1 reply; 258+ messages in thread
From: David Gibson @ 2021-06-03  5:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	Kirti Wankhede, Robin Murphy

[-- Attachment #1: Type: text/plain, Size: 3585 bytes --]

On Wed, Jun 02, 2021 at 01:37:53PM -0300, Jason Gunthorpe wrote:
> On Wed, Jun 02, 2021 at 04:57:52PM +1000, David Gibson wrote:
> 
> > I don't think presence or absence of a group fd makes a lot of
> > difference to this design.  Having a group fd just means we attach
> > groups to the ioasid instead of individual devices, and we no longer
> > need the bookkeeping of "partial" devices.
> 
> Oh, I think we really don't want to attach the group to an ioasid, or
> at least not as a first-class idea.
> 
> The fundamental problem that got us here is we now live in a world
> where there are many ways to attach a device to an IOASID:

I'm not seeing that that's necessarily a problem.

>  - A RID binding
>  - A RID,PASID binding
>  - A RID,PASID binding for ENQCMD

I have to admit I haven't fully grasped the differences between these
modes.  I'm hoping we can consolidate at least some of them into the
same sort of binding onto different IOASIDs (which may be linked in
parent/child relationships).

>  - A SW TABLE binding
>  - etc
> 
> The selection of which mode to use is based on the specific
> driver/device operation. Ie the thing that implements the 'struct
> vfio_device' is the thing that has to select the binding mode.

I thought userspace selected the binding mode - although not all modes
will be possible for all devices.

> group attachment was fine when there was only one mode. As you say it
> is fine to just attach every group member with RID binding if RID
> binding is the only option.
> 
> When SW TABLE binding was added the group code was hacked up - now the
> group logic is choosing between RID/SW TABLE in a very hacky and mdev
> specific way, and this is just a mess.

Sounds like it.  What do you propose instead to handle backwards
compatibility for group-based VFIO code?

> The flow must carry the IOASID from the /dev/iommu to the vfio_device
> driver and the vfio_device implementation must choose which binding
> mode and parameters it wants based on driver and HW configuration.
> 
> eg if two PCI devices are in a group then it is perfectly fine that
> one device uses RID binding and the other device uses RID,PASID
> binding.

Uhhhh... I don't see how that can be.  They could well be in the same
group because their RIDs cannot be distinguished from each other.

> The only place I see for a "group bind" in the uAPI is some compat
> layer for the vfio container, and the implementation would be quite
> different, we'd have to call each vfio_device driver in the group and
> execute the IOASID attach IOCTL.
> 
> > > I would say no on the container. /dev/ioasid == the container, having
> > > two competing objects at once in a single process is just a mess.
> > 
> > Right.  I'd assume that for compatibility, creating a container would
> > create a single IOASID under the hood with a compatiblity layer
> > translating the container operations to iosaid operations.
> 
> It is a nice dream for sure
> 
> /dev/vfio could be a special case of /dev/ioasid just with a different
> uapi and ending up with only one IOASID. They could be interchangable
> from then on, which would simplify the internals of VFIO if it
> consistently delt with these new ioasid objects everywhere. But last I
> looked it was complicated enough to best be done later on
> 
> Jason
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-02 16:58     ` Jason Gunthorpe
  2021-06-03  2:49       ` Tian, Kevin
@ 2021-06-03  5:45       ` David Gibson
  2021-06-03 12:11         ` Jason Gunthorpe
  2021-06-04 10:24         ` Jean-Philippe Brucker
  1 sibling, 2 replies; 258+ messages in thread
From: David Gibson @ 2021-06-03  5:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	Kirti Wankhede, Robin Murphy

[-- Attachment #1: Type: text/plain, Size: 6680 bytes --]

On Wed, Jun 02, 2021 at 01:58:38PM -0300, Jason Gunthorpe wrote:
> On Wed, Jun 02, 2021 at 04:48:35PM +1000, David Gibson wrote:
> > > > 	/* Bind guest I/O page table  */
> > > > 	bind_data = {
> > > > 		.ioasid	= gva_ioasid;
> > > > 		.addr	= gva_pgtable1;
> > > > 		// and format information
> > > > 	};
> > > > 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> > > 
> > > Again I do wonder if this should just be part of alloc_ioasid. Is
> > > there any reason to split these things? The only advantage to the
> > > split is the device is known, but the device shouldn't impact
> > > anything..
> > 
> > I'm pretty sure the device(s) could matter, although they probably
> > won't usually. 
> 
> It is a bit subtle, but the /dev/iommu fd itself is connected to the
> devices first. This prevents wildly incompatible devices from being
> joined together, and allows some "get info" to report the capability
> union of all devices if we want to do that.

Right.. but I've not been convinced that having a /dev/iommu fd
instance be the boundary for these types of things actually makes
sense.  For example if we were doing the preregistration thing
(whether by child ASes or otherwise) then that still makes sense
across wildly different devices, but we couldn't share that layer if
we have to open different instances for each of them.

It really seems to me that it's at the granularity of the address
space (including extended RID+PASID ASes) that we need to know what
devices we have, and therefore what capbilities we have for that AS.

> The original concept was that devices joined would all have to support
> the same IOASID format, at least for the kernel owned map/unmap IOASID
> type. Supporting different page table formats maybe is reason to
> revisit that concept.
> 
> There is a small advantage to re-using the IOASID container because of
> the get_user_pages caching and pinned accounting management at the FD
> level.

Right, but at this stage I'm just not seeing a really clear (across
platforms and device typpes) boundary for what things have to be per
IOASID container and what have to be per IOASID, so I'm just not sure
the /dev/iommu instance grouping makes any sense.

> I don't know if that small advantage is worth the extra complexity
> though.
> 
> > But it would certainly be possible for a system to have two
> > different host bridges with two different IOMMUs with different
> > pagetable formats.  Until you know which devices (and therefore
> > which host bridge) you're talking about, you don't know what formats
> > of pagetable to accept.  And if you have devices from *both* bridges
> > you can't bind a page table at all - you could theoretically support
> > a kernel managed pagetable by mirroring each MAP and UNMAP to tables
> > in both formats, but it would be pretty reasonable not to support
> > that.
> 
> The basic process for a user space owned pgtable mode would be:
> 
>  1) qemu has to figure out what format of pgtable to use
> 
>     Presumably it uses query functions using the device label.

No... in the qemu case it would always select the page table format
that it needs to present to the guest.  That's part of the
guest-visible platform that's selected by qemu's configuration.

There's no negotiation here: either the kernel can supply what qemu
needs to pass to the guest, or it can't.  If it can't qemu, will have
to either emulate in SW (if possible, probably using a kernel-managed
IOASID to back it) or fail outright.

>     The
>     kernel code should look at the entire device path through all the
>     IOMMU HW to determine what is possible.
> 
>     Or it already knows because the VM's vIOMMU is running in some
>     fixed page table format, or the VM's vIOMMU already told it, or
>     something.

Again, I think you have the order a bit backwards.  The user selects
the capabilities that the vIOMMU will present to the guest as part of
the qemu configuration.  Qemu then requests that of the host kernel,
and either the host kernel supplies it, qemu emulates it in SW, or
qemu fails to start.

Guest visible properties of the platform never (or *should* never)
depend implicitly on host capabilities - it's impossible to sanely
support migration in such an environment.

>  2) qemu creates an IOASID and based on #1 and says 'I want this format'

Right.

>  3) qemu binds the IOASID to the device. 
> 
>     If qmeu gets it wrong then it just fails.

Right, though it may be fall back to (partial) software emulation.  In
practice that would mean using a kernel-managed IOASID and walking the
guest IO pagetables itself to mirror them into the host kernel.

>  4) For the next device qemu would have to figure out if it can re-use
>     an existing IOASID based on the required proeprties.

Nope.  Again, what devices share an IO address space is a guest
visible part of the platform.  If the host kernel can't supply that,
then qemu must not start (or fail the hotplug if the new device is
being hotplugged).

> You pointed to the case of mixing vIOMMU's of different platforms. So
> it is completely reasonable for qemu to ask for a "ARM 64 bit IOMMU
> page table mode v2" while running on an x86 because that is what the
> vIOMMU is wired to work with.

Yes.

> Presumably qemu will fall back to software emulation if this is not
> possible.

Right.  But even in this case it needs to do some checking of the
capabilities of the backing IOMMU.  At minimum the host IOMMU needs to
be able to map all the IOVAs that the guest expects to be mappable,
and the host IOMMU needs to support a pagesize that's a submultiple of
the pagesize expected in the guest.

For this reason, amongst some others, I think when selecting a kernel
managed pagetable we need to also have userspace explicitly request
which IOVA ranges are mappable, and what (minimum) page size it
needs.

> One interesting option for software emulation is to just transform the
> ARM page table format to a x86 page table format in userspace and use
> nested bind/invalidate to synchronize with the kernel. With SW nesting
> I suspect this would be much faster

It should be possible *if* the backing IOMMU can support the necessary
IOVAs and pagesizes (and maybe some other things I haven't thought
of).  If not, you're simply out of luck and there's no option but to
fail to start the guest.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-03  2:49       ` Tian, Kevin
@ 2021-06-03  5:48         ` David Gibson
  0 siblings, 0 replies; 258+ messages in thread
From: David Gibson @ 2021-06-03  5:48 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	Kirti Wankhede, Robin Murphy

[-- Attachment #1: Type: text/plain, Size: 1750 bytes --]

On Thu, Jun 03, 2021 at 02:49:56AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Thursday, June 3, 2021 12:59 AM
> > 
> > On Wed, Jun 02, 2021 at 04:48:35PM +1000, David Gibson wrote:
> > > > > 	/* Bind guest I/O page table  */
> > > > > 	bind_data = {
> > > > > 		.ioasid	= gva_ioasid;
> > > > > 		.addr	= gva_pgtable1;
> > > > > 		// and format information
> > > > > 	};
> > > > > 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> > > >
> > > > Again I do wonder if this should just be part of alloc_ioasid. Is
> > > > there any reason to split these things? The only advantage to the
> > > > split is the device is known, but the device shouldn't impact
> > > > anything..
> > >
> > > I'm pretty sure the device(s) could matter, although they probably
> > > won't usually.
> > 
> > It is a bit subtle, but the /dev/iommu fd itself is connected to the
> > devices first. This prevents wildly incompatible devices from being
> > joined together, and allows some "get info" to report the capability
> > union of all devices if we want to do that.
> 
> I would expect the capability reported per-device via /dev/iommu. 
> Incompatible devices can bind to the same fd but cannot attach to
> the same IOASID. This allows incompatible devices to share locked
> page accounting.

Yeah... I'm not convinced that everything relevant here can be
reported per-device.  I think we may have edge cases where
combinations of devices have restrictions that individual devices in
the set do not.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-02 23:23         ` Jason Gunthorpe
@ 2021-06-03  5:49           ` Lu Baolu
  0 siblings, 0 replies; 258+ messages in thread
From: Lu Baolu @ 2021-06-03  5:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: baolu.lu, Tian, Kevin, LKML, Joerg Roedel, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

On 6/3/21 7:23 AM, Jason Gunthorpe wrote:
> On Wed, Jun 02, 2021 at 12:01:57PM +0800, Lu Baolu wrote:
>> On 6/2/21 1:26 AM, Jason Gunthorpe wrote:
>>> On Tue, Jun 01, 2021 at 07:09:21PM +0800, Lu Baolu wrote:
>>>
>>>> This version only covers 1) and 4). Do you think we need to support 2),
>>>> 3) and beyond?
>>>
>>> Yes aboslutely. The API should be flexable enough to specify the
>>> creation of all future page table formats we'd want to have and all HW
>>> specific details on those formats.
>>
>> OK, stay in the same line.
>>
>>>> If so, it seems that we need some in-kernel helpers and uAPIs to
>>>> support pre-installing a page table to IOASID.
>>>
>>> Not sure what this means..
>>
>> Sorry that I didn't make this clear.
>>
>> Let me bring back the page table types in my eyes.
>>
>>   1) IOMMU format page table (a.k.a. iommu_domain)
>>   2) user application CPU page table (SVA for example)
>>   3) KVM EPT (future option)
>>   4) VM guest managed page table (nesting mode)
>>
>> Each type of page table should be able to be associated with its IOASID.
>> We have BIND protocol for 4); We explicitly allocate an iommu_domain for
>> 1). But we don't have a clear definition for 2) 3) and others. I think
>> it's necessary to clearly define a time point and kAPI name between
>> IOASID_ALLOC and IOASID_ATTACH, so that other modules have the
>> opportunity to associate their page table with the allocated IOASID
>> before attaching the page table to the real IOMMU hardware.
> 
> In my mind these are all actions of creation..
> 
> #1 is ALLOC_IOASID 'to be compatible with thes devices attached to
>     this FD'
> #2 is ALLOC_IOASID_SVA
> #3 is some ALLOC_IOASID_KVM (and maybe the kvm fd has to issue this ioctl)
> #4 is ALLOC_IOASID_USER_PAGE_TABLE w/ user VA address or
>        ALLOC_IOASID_NESTED_PAGE_TABLE w/ IOVA address
> 
> Each allocation should have a set of operations that are allows
> map/unmap is only legal on #1. invalidate is only legal on #4, etc.

This sounds reasonable. The corresponding page table types and required
callbacks are also part of it.

> 
> How you want to split this up in the ioctl interface is a more
> interesting question. I generally like more calls than giant unwieldly
> multiplexer structs, but some things are naturally flags and optional
> modifications of a single ioctl.
> 
> In any event they should have a similar naming 'ALLOC_IOASID_XXX' and
> then a single 'DESTROY_IOASID' that works on all of them.
> 
>> I/O page fault handling is similar. The provider of the page table
>> should take the responsibility to handle the possible page faults.
> 
> For the faultable types, yes #3 and #4 should hook in the fault
> handler and deal with it.

Agreed.

Best regards,
baolu

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01 11:09   ` Lu Baolu
  2021-06-01 17:26     ` Jason Gunthorpe
@ 2021-06-03  5:54     ` David Gibson
  2021-06-03  6:50       ` Lu Baolu
  1 sibling, 1 reply; 258+ messages in thread
From: David Gibson @ 2021-06-03  5:54 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Jason Gunthorpe, Tian, Kevin, LKML, Joerg Roedel,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	Kirti Wankhede, Robin Murphy

[-- Attachment #1: Type: text/plain, Size: 3090 bytes --]

On Tue, Jun 01, 2021 at 07:09:21PM +0800, Lu Baolu wrote:
> Hi Jason,
> 
> On 2021/5/29 7:36, Jason Gunthorpe wrote:
> > > /*
> > >    * Bind an user-managed I/O page table with the IOMMU
> > >    *
> > >    * Because user page table is untrusted, IOASID nesting must be enabled
> > >    * for this ioasid so the kernel can enforce its DMA isolation policy
> > >    * through the parent ioasid.
> > >    *
> > >    * Pgtable binding protocol is different from DMA mapping. The latter
> > >    * has the I/O page table constructed by the kernel and updated
> > >    * according to user MAP/UNMAP commands. With pgtable binding the
> > >    * whole page table is created and updated by userspace, thus different
> > >    * set of commands are required (bind, iotlb invalidation, page fault, etc.).
> > >    *
> > >    * Because the page table is directly walked by the IOMMU, the user
> > >    * must  use a format compatible to the underlying hardware. It can
> > >    * check the format information through IOASID_GET_INFO.
> > >    *
> > >    * The page table is bound to the IOMMU according to the routing
> > >    * information of each attached device under the specified IOASID. The
> > >    * routing information (RID and optional PASID) is registered when a
> > >    * device is attached to this IOASID through VFIO uAPI.
> > >    *
> > >    * Input parameters:
> > >    *	- child_ioasid;
> > >    *	- address of the user page table;
> > >    *	- formats (vendor, address_width, etc.);
> > >    *
> > >    * Return: 0 on success, -errno on failure.
> > >    */
> > > #define IOASID_BIND_PGTABLE		_IO(IOASID_TYPE, IOASID_BASE + 9)
> > > #define IOASID_UNBIND_PGTABLE	_IO(IOASID_TYPE, IOASID_BASE + 10)
> > Also feels backwards, why wouldn't we specify this, and the required
> > page table format, during alloc time?
> > 
> 
> Thinking of the required page table format, perhaps we should shed more
> light on the page table of an IOASID. So far, an IOASID might represent
> one of the following page tables (might be more):
> 
>  1) an IOMMU format page table (a.k.a. iommu_domain)
>  2) a user application CPU page table (SVA for example)
>  3) a KVM EPT (future option)
>  4) a VM guest managed page table (nesting mode)
> 
> This version only covers 1) and 4). Do you think we need to support 2),

Isn't (2) the equivalent of using the using the host-managed pagetable
then doing a giant MAP of all your user address space into it?  But
maybe we should identify that case explicitly in case the host can
optimize it.

> 3) and beyond? If so, it seems that we need some in-kernel helpers and
> uAPIs to support pre-installing a page table to IOASID. From this point
> of view an IOASID is actually not just a variant of iommu_domain, but an
> I/O page table representation in a broader sense.
> 
> Best regards,
> baolu
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-02 17:19   ` Jason Gunthorpe
  2021-06-03  3:02     ` Tian, Kevin
@ 2021-06-03  6:26     ` David Gibson
  2021-06-03 12:46       ` Jason Gunthorpe
  1 sibling, 1 reply; 258+ messages in thread
From: David Gibson @ 2021-06-03  6:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	Kirti Wankhede, Robin Murphy

[-- Attachment #1: Type: text/plain, Size: 8597 bytes --]

On Wed, Jun 02, 2021 at 02:19:30PM -0300, Jason Gunthorpe wrote:
> On Wed, Jun 02, 2021 at 04:15:07PM +1000, David Gibson wrote:
> 
> > Is there a compelling reason to have all the IOASIDs handled by one
> > FD?
> 
> There was an answer on this, if every PASID needs an IOASID then there
> are too many FDs.

Too many in what regard?  fd limits?  Something else?

It seems to be there are two different cases for PASID handling here.
One is where userspace explicitly creates each valid PASID and
attaches a separate pagetable for each (or handles each with
MAP/UNMAP).  In that case I wouldn't have expected there to be too
many fds.

Then there's the case where we register a whole PASID table, in which
case I think you only need the one FD.  We can treat that as creating
an 84-bit IOAS, whose pagetable format is (PASID table + a bunch of
pagetables for each PASID).

> It is difficult to share the get_user_pages cache across FDs.

Ah... hrm, yes I can see that.

> There are global properties in the /dev/iommu FD, like what devices
> are part of it, that are important for group security operations. This
> becomes confused if it is split to many FDs.

I'm still not seeing those.  I'm really not seeing any well-defined
meaning to devices being attached to the fd, but not to a particular
IOAS.

> > > I/O address space can be managed through two protocols, according to 
> > > whether the corresponding I/O page table is constructed by the kernel or 
> > > the user. When kernel-managed, a dma mapping protocol (similar to 
> > > existing VFIO iommu type1) is provided for the user to explicitly specify 
> > > how the I/O address space is mapped. Otherwise, a different protocol is 
> > > provided for the user to bind an user-managed I/O page table to the 
> > > IOMMU, plus necessary commands for iotlb invalidation and I/O fault 
> > > handling. 
> > > 
> > > Pgtable binding protocol can be used only on the child IOASID's, implying 
> > > IOASID nesting must be enabled. This is because the kernel doesn't trust 
> > > userspace. Nesting allows the kernel to enforce its DMA isolation policy 
> > > through the parent IOASID.
> > 
> > To clarify, I'm guessing that's a restriction of likely practice,
> > rather than a fundamental API restriction.  I can see a couple of
> > theoretical future cases where a user-managed pagetable for a "base"
> > IOASID would be feasible:
> > 
> >   1) On some fancy future MMU allowing free nesting, where the kernel
> >      would insert an implicit extra layer translating user addresses
> >      to physical addresses, and the userspace manages a pagetable with
> >      its own VAs being the target AS
> 
> I would model this by having a "SVA" parent IOASID. A "SVA" IOASID one
> where the IOVA == process VA and the kernel maintains this mapping.

That makes sense.  Needs a different name to avoid Intel and PCI
specificness, but having a trivial "pagetable format" which just says
IOVA == user address is nice idea.

> Since the uAPI is so general I do have a general expecation that the
> drivers/iommu implementations might need to be a bit more complicated,
> like if the HW can optimize certain specific graphs of IOASIDs we
> would still model them as graphs and the HW driver would have to
> "compile" the graph into the optimal hardware.
> 
> This approach has worked reasonable in other kernel areas.

That seems sensible.

> >   2) For a purely software virtual device, where its virtual DMA
> >      engine can interpet user addresses fine
> 
> This also sounds like an SVA IOASID.

Ok.

> Depending on HW if a device can really only bind to a very narrow kind
> of IOASID then it should ask for that (probably platform specific!)
> type during its attachment request to drivers/iommu.
> 
> eg "I am special hardware and only know how to do PLATFORM_BLAH
> transactions, give me an IOASID comatible with that". If the only way
> to create "PLATFORM_BLAH" is with a SVA IOASID because BLAH is
> hardwired to the CPU ASID  then that is just how it is.

Fair enough.

> > I wonder if there's a way to model this using a nested AS rather than
> > requiring special operations.  e.g.
> > 
> > 	'prereg' IOAS
> > 	|
> > 	\- 'rid' IOAS
> > 	   |
> > 	   \- 'pasid' IOAS (maybe)
> > 
> > 'prereg' would have a kernel managed pagetable into which (for
> > example) qemu platform code would map all guest memory (using
> > IOASID_MAP_DMA).  qemu's vIOMMU driver would then mirror the guest's
> > IO mappings into the 'rid' IOAS in terms of GPA.
> > 
> > This wouldn't quite work as is, because the 'prereg' IOAS would have
> > no devices.  But we could potentially have another call to mark an
> > IOAS as a purely "preregistration" or pure virtual IOAS.  Using that
> > would be an alternative to attaching devices.
> 
> It is one option for sure, this is where I was thinking when we were
> talking in the other thread. I think the decision is best
> implementation driven as the datastructure to store the
> preregsitration data should be rather purpose built.

Right.  I think this gets nicer now that we're considering more
specific options at IOAS creation time, and different "types" of
IOAS.  We could add a "preregistration" IOAS type, which supports
MAP/UNMAP of user addresses, and allows *no* devices to be attached,
but does allow other IOAS types to be added as children.

> 
> > > /*
> > >   * Map/unmap process virtual addresses to I/O virtual addresses.
> > >   *
> > >   * Provide VFIO type1 equivalent semantics. Start with the same 
> > >   * restriction e.g. the unmap size should match those used in the 
> > >   * original mapping call. 
> > >   *
> > >   * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
> > >   * must be already in the preregistered list.
> > >   *
> > >   * Input parameters:
> > >   *	- u32 ioasid;
> > >   *	- refer to vfio_iommu_type1_dma_{un}map
> > >   *
> > >   * Return: 0 on success, -errno on failure.
> > >   */
> > > #define IOASID_MAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 6)
> > > #define IOASID_UNMAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 7)
> > 
> > I'm assuming these would be expected to fail if a user managed
> > pagetable has been bound?
> 
> Me too, or a SVA page table.
> 
> This document would do well to have a list of imagined page table
> types and the set of operations that act on them. I think they are all
> pretty disjoint..

Right.  With the possible exception that I can imagine a call for
several types which all support MAP/UNMAP, but have other different
characteristics.

> Your presentation of 'kernel owns the table' vs 'userspace owns the
> table' is a useful clarification to call out too
> 
> > > 5. Use Cases and Flows
> > > 
> > > Here assume VFIO will support a new model where every bound device
> > > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o 
> > > going through legacy container/group interface. For illustration purpose
> > > those devices are just called dev[1...N]:
> > > 
> > > 	device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> > 
> > Minor detail, but I'd suggest /dev/vfio/pci/DDDD:BB:SS.F for the
> > filenames for actual PCI functions.  Maybe /dev/vfio/mdev/something
> > for mdevs.  That leaves other subdirs of /dev/vfio free for future
> > non-PCI device types, and /dev/vfio itself for the legacy group
> > devices.
> 
> There are a bunch of nice options here if we go this path
> 
> > > 5.2. Multiple IOASIDs (no nesting)
> > > ++++++++++++++++++++++++++++
> > > 
> > > Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
> > > both devices are attached to gpa_ioasid.
> > 
> > Doesn't really affect your example, but note that the PAPR IOMMU does
> > not have a passthrough mode, so devices will not initially be attached
> > to gpa_ioasid - they will be unusable for DMA until attached to a
> > gIOVA ioasid.
> 
> I think attachment should always be explicit in the API. If the user
> doesn't explicitly ask for a device to be attached to the IOASID then
> the iommu driver is free to block it.
> 
> If you want passthrough then you have to create a passthrough IOASID
> and attach every device to it. Some of those attaches might be NOP's
> due to groups.

Agreed.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 258+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-05-28 23:36 ` Jason Gunthorpe
                     ` (2 preceding siblings ...)
  2021-06-02  7:22   ` David Gibson
@ 2021-06-03  6:39   ` Tian, Kevin
  2021-06-03 13:05     ` Jason Gunthorpe
  3 siblings, 1 reply; 258+ messages in thread
From: Tian, Kevin @ 2021-06-03  6:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Saturday, May 29, 2021 7:37 AM
> 
> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> 
> > 2.1. /dev/ioasid uAPI
> > +++++++++++++++++
> >
> > /*
> >   * Check whether an uAPI extension is supported.
> >   *
> >   * This is for FD-level capabilities, such as locked page pre-registration.
> >   * IOASID-level capabilities are reported through IOASID_GET_INFO.
> >   *
> >   * Return: 0 if not supported, 1 if supported.
> >   */
> > #define IOASID_CHECK_EXTENSION	_IO(IOASID_TYPE, IOASID_BASE + 0)
> 
> 
> > /*
> >   * Register user space memory where DMA is allowed.
> >   *
> >   * It pins user pages and does the locked memory accounting so sub-
> >   * sequent IOASID_MAP/UNMAP_DMA calls get faster.
> >   *
> >   * When this ioctl is not used, one user page might be accounted
> >   * multiple times when it is mapped by multiple IOASIDs which are
> >   * not nested together.
> >   *
> >   * Input parameters:
> >   *	- vaddr;
> >   *	- size;
> >   *
> >   * Return: 0 on success, -errno on failure.
> >   */
> > #define IOASID_REGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 1)
> > #define IOASID_UNREGISTER_MEMORY	_IO(IOASID_TYPE,
> IOASID_BASE + 2)
> 
> So VA ranges are pinned and stored in a tree and later references to
> those VA ranges by any other IOASID use the pin cached in the tree?

yes.

> 
> It seems reasonable and is similar to the ioasid parent/child I
> suggested for PPC.
> 
> IMHO this should be merged with the all SW IOASID that is required for
> today's mdev drivers. If this can be done while keeping this uAPI then

Agree. Regarding uAPI there is no difference between SW IOASID and
HW IOASID. The main difference is behind /dev/ioasid, that SW IOASID 
is not linked to the IOMMU.

> great, otherwise I don't think it is so bad to weakly nest a physical
> IOASID under a SW one just to optimize page pinning.
> 
> Either way this seems like a smart direction
> 
> > /*
> >   * Allocate an IOASID.
> >   *
> >   * IOASID is the FD-local software handle representing an I/O address
> >   * space. Each IOASID is associated with a single I/O page table. User
> >   * must call this ioctl to get an IOASID for every I/O address space that is
> >   * intended to be enabled in the IOMMU.
> >   *
> >   * A newly-created IOASID doesn't accept any command before it is
> >   * attached to a device. Once attached, an empty I/O page table is
> >   * bound with the IOMMU then the user could use either DMA mapping
> >   * or pgtable binding commands to manage this I/O page table.
> 
> Can the IOASID can be populated before being attached?
> 
> >   * Device attachment is initiated through device driver uAPI (e.g. VFIO)
> >   *
> >   * Return: allocated ioasid on success, -errno on failure.
> >   */
> > #define IOASID_ALLOC	_IO(IOASID_TYPE, IOASID_BASE + 3)
> > #define IOASID_FREE	_IO(IOASID_TYPE, IOASID_BASE + 4)
> 
> I assume alloc will include quite a big structure to satisfy the
> various vendor needs?

I'll skip below /dev/ioasid uAPI comments about alloc/bind. It's already 
covered in other sub-threads.

[...]
 
> >
> > 2.2. /dev/vfio uAPI
> > ++++++++++++++++
> 
> To be clear you mean the 'struct vfio_device' API, these are not
> IOCTLs on the container or group?

Exactly

> 
> > /*
> >    * Bind a vfio_device to the specified IOASID fd
> >    *
> >    * Multiple vfio devices can be bound to a single ioasid_fd, but a single
> >    * vfio device should not be bound to multiple ioasid_fd's.
> >    *
> >    * Input parameters:
> >    *  - ioasid_fd;
> >    *
> >    * Return: 0 on success, -errno on failure.
> >    */
> > #define VFIO_BIND_IOASID_FD           _IO(VFIO_TYPE, VFIO_BASE + 22)
> > #define VFIO_UNBIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 23)
> 
> This is where it would make sense to have an output "device id" that
> allows /dev/ioasid to refer to this "device" by number in events and
> other related things.

As chatted earlier, either an input or output "device id" is fine here.

> 
> >
> > 2.3. KVM uAPI
> > ++++++++++++
> >
> > /*
> >   * Update CPU PASID mapping
> >   *
> >   * This is necessary when ENQCMD will be used in the guest while the
> >   * targeted device doesn't accept the vPASID saved in the CPU MSR.
> >   *
> >   * This command allows user to set/clear the vPASID->pPASID mapping
> >   * in the CPU, by providing the IOASID (and FD) information representing
> >   * the I/O address space marked by this vPASID.
> >   *
> >   * Input parameters:
> >   *	- user_pasid;
> >   *	- ioasid_fd;
> >   *	- ioasid;
> >   */
> > #define KVM_MAP_PASID	_IO(KVMIO, 0xf0)
> > #define KVM_UNMAP_PASID	_IO(KVMIO, 0xf1)
> 
> It seems simple enough.. So the physical PASID can only be assigned if
> the user has an IOASID that points at it? Thus it is secure?

Yes. The kernel doesn't trust user to provide a random physical PASID.

> 
> > 3. Sample structures and helper functions
> >
> > Three helper functions are provided to support VFIO_BIND_IOASID_FD:
> >
> > 	struct ioasid_ctx *ioasid_ctx_fdget(int fd);
> > 	int ioasid_register_device(struct ioasid_ctx *ctx, struct ioasid_dev
> *dev);
> > 	int ioasid_unregister_device(struct ioasid_dev *dev);
> >
> > An ioasid_ctx is created for each fd:
> >
> > 	struct ioasid_ctx {
> > 		// a list of allocated IOASID data's
> > 		struct list_head		ioasid_list;
> 
> Would expect an xarray
> 
> > 		// a list of registered devices
> > 		struct list_head		dev_list;
> 
> xarray of device_id

list of ioasid_dev objects. device_id will be put inside each object.

> 
> > 		// a list of pre-registered virtual address ranges
> > 		struct list_head		prereg_list;
> 
> Should re-use the existing SW IOASID table, and be an interval tree.

What is the existing SW IOASID table?

> 
> > Each registered device is represented by ioasid_dev:
> >
> > 	struct ioasid_dev {
> > 		struct list_head		next;
> > 		struct ioasid_ctx	*ctx;
> > 		// always be the physical device
> > 		struct device 		*device;
> > 		struct kref		kref;
> > 	};
> >
> > Because we assume one vfio_device connected to at most one ioasid_fd,
> > here ioasid_dev could be embedded in vfio_device and then linked to
> > ioasid_ctx->dev_list when registration succeeds. For mdev the struct
> > device should be the pointer to the parent device. PASID marking this
> > mdev is specified later when VFIO_ATTACH_IOASID.
> 
> Don't embed a struct like this in something with vfio_device - that
> just makes a mess of reference counting by having multiple krefs in
> the same memory block. Keep it as a pointer, the attach operation
> should return a pointer to the above struct.

OK. Also based on the agreement that one device can bind to multiple
fd's, this struct embed approach also doesn't work then.

> 
> > An ioasid_data is created when IOASID_ALLOC, as the main object
> > describing characteristics about an I/O page table:
> >
> > 	struct ioasid_data {
> > 		// link to ioasid_ctx->ioasid_list
> > 		struct list_head		next;
> >
> > 		// the IOASID number
> > 		u32			ioasid;
> >
> > 		// the handle to convey iommu operations
> > 		// hold the pgd (TBD until discussing iommu api)
> > 		struct iommu_domain *domain;
> 
> But at least for the first coding draft I would expect to see this API
> presented with no PASID support and a simple 1:1 with iommu_domain.
> How
> PASID gets modeled is the big TBD, right?

yes. As the starting point we will assume 1:1 association. This should
work for PF/VF. But very soon mdev must be considered. I expect 
we can start conversation on PASID support once this uAPI proposal
is settled down.

> 
> > ioasid_data and iommu_domain have overlapping roles as both are
> > introduced to represent an I/O address space. It is still a big TBD how
> > the two should be corelated or even merged, and whether new iommu
> > ops are required to handle RID+PASID explicitly.
> 
> I think it is OK that the uapi and kernel api have different
> structs. The uapi focused one should hold the uapi related data, which
> is what you've shown here, I think.
> 
> > Two helper functions are provided to support VFIO_ATTACH_IOASID:
> >
> > 	struct attach_info {
> > 		u32	ioasid;
> > 		// If valid, the PASID to be used physically
> > 		u32	pasid;
> > 	};
> > 	int ioasid_device_attach(struct ioasid_dev *dev,
> > 		struct attach_info info);
> > 	int ioasid_device_detach(struct ioasid_dev *dev, u32 ioasid);
> 
> Honestly, I still prefer this to be highly explicit as this is where
> all device driver authors get invovled:
> 
> ioasid_pci_device_attach(struct pci_device *pdev, struct ioasid_dev *dev,
> u32 ioasid);
> ioasid_pci_device_pasid_attach(struct pci_device *pdev, u32 *physical_pasid,
> struct ioasid_dev *dev, u32 ioasid);

Then better naming it as pci_device_attach_ioasid since the 1st parameter
is struct pci_device?

By keeping physical_pasid as a pointer, you want to remove the last helper
function (ioasid_get_global_pasid) so the global pasid is returned along
with the attach function?

> 
> And presumably a variant for ARM non-PCI platform (?) devices.
> 
> This could boil down to a __ioasid_device_attach() as you've shown.
> 
> > A new object is introduced and linked to ioasid_data->attach_data for
> > each successful attach operation:
> >
> > 	struct ioasid_attach_data {
> > 		struct list_head		next;
> > 		struct ioasid_dev	*dev;
> > 		u32 			pasid;
> > 	}
> 
> This should be returned as a pointer and detatch should be:
> 
> int ioasid_device_detach(struct ioasid_attach_data *);

ok

> 
> > As explained in the design section, there is no explicit group enforcement
> > in /dev/ioasid uAPI or helper functions. But the ioasid driver does
> > implicit group check - before every device within an iommu group is
> > attached to this IOASID, the previously-attached devices in this group are
> > put in ioasid_data->partial_devices. The IOASID rejects any command if
> > the partial_devices list is not empty.
> 
> It is simple enough. Would be good to design in a diagnostic string so
> userspace can make sense of the failure. Eg return something like
> -EDEADLK and provide an ioctl 'why did EDEADLK happen' ?
> 

Make sense.

> 
> > Then is the last helper function:
> > 	u32 ioasid_get_global_pasid(struct ioasid_ctx *ctx,
> > 		u32 ioasid, bool alloc);
> >
> > ioasid_get_global_pasid is necessary in scenarios where multiple devices
> > want to share a same PASID value on the attached I/O page table (e.g.
> > when ENQCMD is enabled, as explained in next section). We need a
> > centralized place (ioasid_data->pasid) to hold this value (allocated when
> > first called with alloc=true). vfio device driver calls this function (alloc=
> > true) to get the global PASID for an ioasid before calling ioasid_device_
> > attach. KVM also calls this function (alloc=false) to setup PASID translation
> > structure when user calls KVM_MAP_PASID.
> 
> When/why would the VFIO driver do this? isn't this just some varient
> of pasid_attach?
> 
> ioasid_pci_device_enqcmd_attach(struct pci_device *pdev, u32
> *physical_pasid, struct ioasid_dev *dev, u32 ioasid);
> 
> ?

will adopt this way.

> 
> > 4. PASID Virtualization
> >
> > When guest SVA (vSVA) is enabled, multiple GVA address spaces are
> > created on the assigned vfio device. This leads to the concepts of
> > "virtual PASID" (vPASID) vs. "physical PASID" (pPASID). vPASID is assigned
> > by the guest to mark an GVA address space while pPASID is the one
> > selected by the host and actually routed in the wire.
> >
> > vPASID is conveyed to the kernel when user calls VFIO_ATTACH_IOASID.
> 
> Should the vPASID programmed into the IOASID before calling
> VFIO_ATTACH_IOASID?

No. As explained in earlier reply, when multiple devices are attached
to the same IOASID the guest may link the page table to different
vPASID# cross attached devices. Anyway vPASID is a per-RID thing.

> 
> > vfio device driver translates vPASID to pPASID before calling ioasid_attach_
> > device, with two factors to be considered:
> >
> > -    Whether vPASID is directly used (vPASID==pPASID) in the wire, or
> >      should be instead converted to a newly-allocated one (vPASID!=
> >      pPASID);
> >
> > -    If vPASID!=pPASID, whether pPASID is allocated from per-RID PASID
> >      space or a global PASID space (implying sharing pPASID cross devices,
> >      e.g. when supporting Intel ENQCMD which puts PASID in a CPU MSR
> >      as part of the process context);
> 
> This whole section is 4 really confusing. I think it would be more
> understandable to focus on the list below and minimize the vPASID
> 
> > The actual policy depends on pdev vs. mdev, and whether ENQCMD is
> > supported. There are three possible scenarios:
> >
> > (Note: /dev/ioasid uAPI is not affected by underlying PASID virtualization
> > policies.)
> 
> This has become unclear. I think this should start by identifying the
> 6 main type of devices and how they can use pPASID/vPASID:
> 
> 0) Device is a RID and cannot issue PASID
> 1) Device is a mdev and cannot issue PASID
> 2) Device is a mdev and programs a single fixed PASID during bind,
>    does not accept PASID from the guest

There are no vPASID per se in above 3 types. So this section only
focus on the latter 3 types. But I can include them in next version
if it sets the tone clearer.

> 
> 3) Device accepts any PASIDs from the guest. No
>    vPASID/pPASID translation is possible. (classic vfio_pci)
> 4) Device accepts any PASID from the guest and has an
>    internal vPASID/pPASID translation (enhanced vfio_pci)

what is enhanced vfio_pci? In my writing this is for mdev
which doesn't support ENQCMD

> 5) Device accepts and PASID from the guest and relys on
>    external vPASID/pPASID translation via ENQCMD (Intel SIOV mdev)
> 
> 0-2 don't use vPASID at all
> 
> 3-5 consume a vPASID but handle it differently.
> 
> I think the 3-5 map into what you are trying to explain in the table
> below, which is the rules for allocating the vPASID depending on which
> of device types 3-5 are present and or mixed.

Exactly

> 
> For instance device type 3 requires vPASID == pPASID because it can't
> do translation at all.
> 
> This probably all needs to come through clearly in the /dev/ioasid
> interface. Once the attached devices are labled it would make sense to
> have a 'query device' /dev/ioasid IOCTL to report the details based on
> how the device attached and other information.

This is a good point. Another benefit of having a device label.

for 0-2 the device will report no PASID support. Although this may duplicate
with other information (e.g. PCI PASID cap), this provides a vendor-agnostic
way for reporting details around IOASID.

for 3-5 the device will report PASID support. In these cases the user is
expected to always provide a vPASID. 

for 5 in addition the device will report a requirement on CPU PASID 
translation. For such device the user should talk to KVM to setup the PASID
mapping. This way the user doesn't need to know whether a device is
pdev or mdev. Just follows what device capability reports.

> 
> > 2)  mdev: vPASID!=pPASID (per-RID if w/o ENQCMD, otherwise global)
> >
> >      PASIDs are also used by kernel to mark the default I/O address space
> >      for mdev, thus cannot be delegated to the guest. Instead, the mdev
> >      driver must allocate a new pPASID for each vPASID (thus vPASID!=
> >      pPASID) and then use pPASID when attaching this mdev to an ioasid.
> 
> I don't understand this at all.. What does "PASIDs are also used by
> the kernel" mean?

Just refer to your type-2. Because PASIDs on this device are already used
by the parent driver to mark mdev, we cannot delegate the per-RID space
to the guest.

> 
> >      The mdev driver needs cache the PASID mapping so in mediation
> >      path vPASID programmed by the guest can be converted to pPASID
> >      before updating the physical MMIO register.
> 
> This is my scenario #4 above. Device and internally virtualize
> vPASID/pPASID - how that is done is up to the device. But this is all
> just labels, when such a device attaches, it should use some specific
> API:
> 
> ioasid_pci_device_vpasid_attach(struct pci_device *pdev,
>  u32 *physical_pasid, u32 *virtual_pasid, struct ioasid_dev *dev, u32 ioasid);

yes.

> 
> And then maintain its internal translation
> 
> >      In previous thread a PASID range split scheme was discussed to support
> >      this combination, but we haven't worked out a clean uAPI design yet.
> >      Therefore in this proposal we decide to not support it, implying the
> >      user should have some intelligence to avoid such scenario. It could be
> >      a TODO task for future.
> 
> It really just boils down to how to allocate the PASIDs to get around
> the bad viommu interface that assumes all PASIDs are usable by all
> devices.

viommu (e.g. Intel VT-d) has good interface to restrict how many PASIDs
are available to the guest. There is a PASID size filed in the viommu 
register. Here the puzzle is just about how to design a good uAPI to 
handle this mixed scenario where vPASID/pPASID are in split range and 
must be linked to the same I/O page table together.

I'll see whether this can be afforded after addressing other comments
in this section.

> 
> > In spite of those subtle considerations, the kernel implementation could
> > start simple, e.g.:
> >
> > -    v==p for pdev;
> > -    v!=p and always use a global PASID pool for all mdev's;
> 
> Regardless all this mess needs to be hidden from the consuming drivers
> with some simple APIs as above. The driver should indicate what its HW
> can do and the PASID #'s that magically come out of /dev/ioasid should
> be appropriate.
> 

Yes, I see how it should work now.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 258+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-06-03  5:09             ` David Gibson
@ 2021-06-03  6:49               ` Tian, Kevin
  2021-06-03 11:47                 ` Jason Gunthorpe
  2021-06-08  0:49                 ` David Gibson
  0 siblings, 2 replies; 258+ messages in thread
From: Tian, Kevin @ 2021-06-03  6:49 UTC (permalink / raw)
  To: David Gibson
  Cc: Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok, kvm,
	Jonathan Corbet, David Woodhouse, Jason Wang, LKML,
	Kirti Wankhede, Alex Williamson (alex.williamson@redhat.com),
	iommu, Jason Gunthorpe, Robin Murphy

> From: David Gibson
> Sent: Thursday, June 3, 2021 1:09 PM
[...]
> > > In this way the SW mode is the same as a HW mode with an infinite
> > > cache.
> > >
> > > The collaposed shadow page table is really just a cache.
> > >
> >
> > OK. One additional thing is that we may need a 'caching_mode"
> > thing reported by /dev/ioasid, indicating whether invalidation is
> > required when changing non-present to present. For hardware
> > nesting it's not reported as the hardware IOMMU will walk the
> > guest page table in cases of iotlb miss. For software nesting
> > caching_mode is reported so the user must issue invalidation
> > upon any change in guest page table so the kernel can update
> > the shadow page table timely.
> 
> For the fist cut, I'd have the API assume that invalidates are
> *always* required.  Some bypass to avoid them in cases where they're
> not needed can be an additional extension.
> 

Isn't a typical TLB semantics is that non-present entries are not
cached thus invalidation is not required when making non-present
to present? It's true to both CPU TLB and IOMMU TLB. In reality
I feel there are more usages built on hardware nesting than software
nesting thus making default following hardware TLB behavior makes
more sense...

Thanks
Kevin 

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-03  5:54     ` David Gibson
@ 2021-06-03  6:50       ` Lu Baolu
  2021-06-03 12:56         ` Jason Gunthorpe
  0 siblings, 1 reply; 258+ messages in thread
From: Lu Baolu @ 2021-06-03  6:50 UTC (permalink / raw)
  To: David Gibson
  Cc: baolu.lu, Jason Gunthorpe, Tian, Kevin, LKML, Joerg Roedel,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	Kirti Wankhede, Robin Murphy

Hi David,

On 6/3/21 1:54 PM, David Gibson wrote:
> On Tue, Jun 01, 2021 at 07:09:21PM +0800, Lu Baolu wrote:
>> Hi Jason,
>>
>> On 2021/5/29 7:36, Jason Gunthorpe wrote:
>>>> /*
>>>>     * Bind an user-managed I/O page table with the IOMMU
>>>>     *
>>>>     * Because user page table is untrusted, IOASID nesting must be enabled
>>>>     * for this ioasid so the kernel can enforce its DMA isolation policy
>>>>     * through the parent ioasid.
>>>>     *
>>>>     * Pgtable binding protocol is different from DMA mapping. The latter
>>>>     * has the I/O page table constructed by the kernel and updated
>>>>     * according to user MAP/UNMAP commands. With pgtable binding the
>>>>     * whole page table is created and updated by userspace, thus different
>>>>     * set of commands are required (bind, iotlb invalidation, page fault, etc.).
>>>>     *
>>>>     * Because the page table is directly walked by the IOMMU, the user
>>>>     * must  use a format compatible to the underlying hardware. It can
>>>>     * check the format information through IOASID_GET_INFO.
>>>>     *
>>>>     * The page table is bound to the IOMMU according to the routing
>>>>     * information of each attached device under the specified IOASID. The
>>>>     * routing information (RID and optional PASID) is registered when a
>>>>     * device is attached to this IOASID through VFIO uAPI.
>>>>     *
>>>>     * Input parameters:
>>>>     *	- child_ioasid;
>>>>     *	- address of the user page table;
>>>>     *	- formats (vendor, address_width, etc.);
>>>>     *
>>>>     * Return: 0 on success, -errno on failure.
>>>>     */
>>>> #define IOASID_BIND_PGTABLE		_IO(IOASID_TYPE, IOASID_BASE + 9)
>>>> #define IOASID_UNBIND_PGTABLE	_IO(IOASID_TYPE, IOASID_BASE + 10)
>>> Also feels backwards, why wouldn't we specify this, and the required
>>> page table format, during alloc time?
>>>
>> Thinking of the required page table format, perhaps we should shed more
>> light on the page table of an IOASID. So far, an IOASID might represent
>> one of the following page tables (might be more):
>>
>>   1) an IOMMU format page table (a.k.a. iommu_domain)
>>   2) a user application CPU page table (SVA for example)
>>   3) a KVM EPT (future option)
>>   4) a VM guest managed page table (nesting mode)
>>
>> This version only covers 1) and 4). Do you think we need to support 2),
> Isn't (2) the equivalent of using the using the host-managed pagetable
> then doing a giant MAP of all your user address space into it?  But
> maybe we should identify that case explicitly in case the host can
> optimize it.
> 

Conceptually, yes. Current SVA implementation just reuses the
application's cpu page table w/o map/unmap operations.

Best regards,
baolu

^ permalink raw reply	[flat|nested] 258+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-06-02  6:15 ` David Gibson
  2021-06-02 17:19   ` Jason Gunthorpe
@ 2021-06-03  7:17   ` Tian, Kevin
  2021-06-03 12:49     ` Jason Gunthorpe
  2021-06-08  5:49     ` David Gibson
  2021-06-03  8:12   ` Tian, Kevin
  2 siblings, 2 replies; 258+ messages in thread
From: Tian, Kevin @ 2021-06-03  7:17 UTC (permalink / raw)
  To: David Gibson
  Cc: LKML, Joerg Roedel, Jason Gunthorpe, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	Kirti Wankhede, Robin Murphy

> From: David Gibson <david@gibson.dropbear.id.au>
> Sent: Wednesday, June 2, 2021 2:15 PM
> 
[...] 
> > An I/O address space takes effect in the IOMMU only after it is attached
> > to a device. The device in the /dev/ioasid context always refers to a
> > physical one or 'pdev' (PF or VF).
> 
> What you mean by "physical" device here isn't really clear - VFs
> aren't really physical devices, and the PF/VF terminology also doesn't
> extent to non-PCI devices (which I think we want to consider for the
> API, even if we're not implemenenting it any time soon).

Yes, it's not very clear, and more in PCI context to simplify the 
description. A "physical" one here means an PCI endpoint function
which has a unique RID. It's more to differentiate with later mdev/
subdevice which uses both RID+PASID. Naming is always a hard
exercise to me... Possibly I'll just use device vs. subdevice in future
versions.

> 
> Now, it's clear that we can't program things into the IOMMU before
> attaching a device - we might not even know which IOMMU to use.

yes

> However, I'm not sure if its wise to automatically make the AS "real"
> as soon as we attach a device:
> 
>  * If we're going to attach a whole bunch of devices, could we (for at
>    least some IOMMU models) end up doing a lot of work which then has
>    to be re-done for each extra device we attach?

which extra work did you specifically refer to? each attach just implies
writing the base address of the I/O page table to the IOMMU structure
corresponding to this device (either being a per-device entry, or per
device+PASID entry).

and generally device attach should not be in a hot path.

> 
>  * With kernel managed IO page tables could attaching a second device
>    (at least on some IOMMU models) require some operation which would
>    require discarding those tables?  e.g. if the second device somehow
>    forces a different IO page size

Then the attach should fail and the user should create another IOASID
for the second device.

> 
> For that reason I wonder if we want some sort of explicit enable or
> activate call.  Device attaches would only be valid before, map or
> attach pagetable calls would only be valid after.

I'm interested in learning a real example requiring explicit enable...

> 
> > One I/O address space could be attached to multiple devices. In this case,
> > /dev/ioasid uAPI applies to all attached devices under the specified IOASID.
> >
> > Based on the underlying IOMMU capability one device might be allowed
> > to attach to multiple I/O address spaces, with DMAs accessing them by
> > carrying different routing information. One of them is the default I/O
> > address space routed by PCI Requestor ID (RID) or ARM Stream ID. The
> > remaining are routed by RID + Process Address Space ID (PASID) or
> > Stream+Substream ID. For simplicity the following context uses RID and
> > PASID when talking about the routing information for I/O address spaces.
> 
> I'm not really clear on how this interacts with nested ioasids.  Would
> you generally expect the RID+PASID IOASes to be children of the base
> RID IOAS, or not?

No. With Intel SIOV both parent/children could be RID+PASID, e.g.
when one enables vSVA on a mdev.

> 
> If the PASID ASes are children of the RID AS, can we consider this not
> as the device explicitly attaching to multiple IOASIDs, but instead
> attaching to the parent IOASID with awareness of the child ones?
> 
> > Device attachment is initiated through passthrough framework uAPI (use
> > VFIO for simplicity in following context). VFIO is responsible for identifying
> > the routing information and registering it to the ioasid driver when calling
> > ioasid attach helper function. It could be RID if the assigned device is
> > pdev (PF/VF) or RID+PASID if the device is mediated (mdev). In addition,
> > user might also provide its view of virtual routing information (vPASID) in
> > the attach call, e.g. when multiple user-managed I/O address spaces are
> > attached to the vfio_device. In this case VFIO must figure out whether
> > vPASID should be directly used (for pdev) or converted to a kernel-
> > allocated one (pPASID, for mdev) for physical routing (see section 4).
> >
> > Device must be bound to an IOASID FD before attach operation can be
> > conducted. This is also through VFIO uAPI. In this proposal one device
> > should not be bound to multiple FD's. Not sure about the gain of
> > allowing it except adding unnecessary complexity. But if others have
> > different view we can further discuss.
> >
> > VFIO must ensure its device composes DMAs with the routing information
> > attached to the IOASID. For pdev it naturally happens since vPASID is
> > directly programmed to the device by guest software. For mdev this
> > implies any guest operation carrying a vPASID on this device must be
> > trapped into VFIO and then converted to pPASID before sent to the
> > device. A detail explanation about PASID virtualization policies can be
> > found in section 4.
> >
> > Modern devices may support a scalable workload submission interface
> > based on PCI DMWr capability, allowing a single work queue to access
> > multiple I/O address spaces. One example is Intel ENQCMD, having
> > PASID saved in the CPU MSR and carried in the instruction payload
> > when sent out to the device. Then a single work queue shared by
> > multiple processes can compose DMAs carrying different PASIDs.
> 
> Is the assumption here that the processes share the IOASID FD
> instance, but not memory?

I didn't get this question

> 
> > When executing ENQCMD in the guest, the CPU MSR includes a vPASID
> > which, if targeting a mdev, must be converted to pPASID before sent
> > to the wire. Intel CPU provides a hardware PASID translation capability
> > for auto-conversion in the fast path. The user is expected to setup the
> > PASID mapping through KVM uAPI, with information about {vpasid,
> > ioasid_fd, ioasid}. The ioasid driver provides helper function for KVM
> > to figure out the actual pPASID given an IOASID.
> >
> > With above design /dev/ioasid uAPI is all about I/O address spaces.
> > It doesn't include any device routing information, which is only
> > indirectly registered to the ioasid driver through VFIO uAPI. For
> > example, I/O page fault is always reported to userspace per IOASID,
> > although it's physically reported per device (RID+PASID). If there is a
> > need of further relaying this fault into the guest, the user is responsible
> > of identifying the device attached to this IOASID (randomly pick one if
> > multiple attached devices) and then generates a per-device virtual I/O
> > page fault into guest. Similarly the iotlb invalidation uAPI describes the
> > granularity in the I/O address space (all, or a range), different from the
> > underlying IOMMU semantics (domain-wide, PASID-wide, range-based).
> >
> > I/O page tables routed through PASID are installed in a per-RID PASID
> > table structure. Some platforms implement the PASID table in the guest
> > physical space (GPA), expecting it managed by the guest. The guest
> > PASID table is bound to the IOMMU also by attaching to an IOASID,
> > representing the per-RID vPASID space.
> 
> Do we need to consider two management modes here, much as we have for
> the pagetables themsleves: either kernel managed, in which we have
> explicit calls to bind a vPASID to a parent PASID, or user managed in
> which case we register a table in some format.

yes, this is related to PASID virtualization in section 4. And based on 
suggestion from Jason, the vPASID requirement will be reported to
user space via the per-device reporting interface.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 258+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-06-02  6:15 ` David Gibson
  2021-06-02 17:19   ` Jason Gunthorpe
  2021-06-03  7:17   ` Tian, Kevin
@ 2021-06-03  8:12   ` Tian, Kevin
  2021-06-17  4:07     ` David Gibson
  2 siblings, 1 reply; 258+ messages in thread
From: Tian, Kevin @ 2021-06-03  8:12 UTC (permalink / raw)
  To: David Gibson
  Cc: LKML, Joerg Roedel, Jason Gunthorpe, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	Kirti Wankhede, Robin Murphy

> From: David Gibson <david@gibson.dropbear.id.au>
> Sent: Wednesday, June 2, 2021 2:15 PM
>
[...]
 
> >
> > /*
> >   * Get information about an I/O address space
> >   *
> >   * Supported capabilities:
> >   *	- VFIO type1 map/unmap;
> >   *	- pgtable/pasid_table binding
> >   *	- hardware nesting vs. software nesting;
> >   *	- ...
> >   *
> >   * Related attributes:
> >   * 	- supported page sizes, reserved IOVA ranges (DMA mapping);
> 
> Can I request we represent this in terms of permitted IOVA ranges,
> rather than reserved IOVA ranges.  This works better with the "window"
> model I have in mind for unifying the restrictions of the POWER IOMMU
> with Type1 like mapping.

Can you elaborate how permitted range work better here?

> > #define IOASID_MAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 6)
> > #define IOASID_UNMAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 7)
> 
> I'm assuming these would be expected to fail if a user managed
> pagetable has been bound?

yes. Following Jason's suggestion the format will be specified when
creating an IOASID, thus incompatible cmd will be simply rejected.

> > #define IOASID_BIND_PGTABLE		_IO(IOASID_TYPE,
> IOASID_BASE + 9)
> > #define IOASID_UNBIND_PGTABLE	_IO(IOASID_TYPE, IOASID_BASE + 10)
> 
> I'm assuming that UNBIND would return the IOASID to a kernel-managed
> pagetable?

There will be no UNBIND call in the next version. unbind will be
automatically handled when destroying the IOASID.

> 
> For debugging and certain hypervisor edge cases it might be useful to
> have a call to allow userspace to lookup and specific IOVA in a guest
> managed pgtable.

Since all the mapping metadata is from userspace, why would one 
rely on the kernel to provide such service? Or are you simply asking
for some debugfs node to dump the I/O page table for a given 
IOASID?

> 
> 
> > /*
> >   * Bind an user-managed PASID table to the IOMMU
> >   *
> >   * This is required for platforms which place PASID table in the GPA space.
> >   * In this case the specified IOASID represents the per-RID PASID space.
> >   *
> >   * Alternatively this may be replaced by IOASID_BIND_PGTABLE plus a
> >   * special flag to indicate the difference from normal I/O address spaces.
> >   *
> >   * The format info of the PASID table is reported in IOASID_GET_INFO.
> >   *
> >   * As explained in the design section, user-managed I/O page tables must
> >   * be explicitly bound to the kernel even on these platforms. It allows
> >   * the kernel to uniformly manage I/O address spaces cross all platforms.
> >   * Otherwise, the iotlb invalidation and page faulting uAPI must be hacked
> >   * to carry device routing information to indirectly mark the hidden I/O
> >   * address spaces.
> >   *
> >   * Input parameters:
> >   *	- child_ioasid;
> 
> Wouldn't this be the parent ioasid, rather than one of the potentially
> many child ioasids?

there is just one child IOASID (per device) for this PASID table

parent ioasid in this case carries the GPA mapping.

> >
> > /*
> >   * Invalidate IOTLB for an user-managed I/O page table
> >   *
> >   * Unlike what's defined in include/uapi/linux/iommu.h, this command
> >   * doesn't allow the user to specify cache type and likely support only
> >   * two granularities (all, or a specified range) in the I/O address space.
> >   *
> >   * Physical IOMMU have three cache types (iotlb, dev_iotlb and pasid
> >   * cache). If the IOASID represents an I/O address space, the invalidation
> >   * always applies to the iotlb (and dev_iotlb if enabled). If the IOASID
> >   * represents a vPASID space, then this command applies to the PASID
> >   * cache.
> >   *
> >   * Similarly this command doesn't provide IOMMU-like granularity
> >   * info (domain-wide, pasid-wide, range-based), since it's all about the
> >   * I/O address space itself. The ioasid driver walks the attached
> >   * routing information to match the IOMMU semantics under the
> >   * hood.
> >   *
> >   * Input parameters:
> >   *	- child_ioasid;
> 
> And couldn't this be be any ioasid, not just a child one, depending on
> whether you want PASID scope or RID scope invalidation?

yes, any ioasid could accept invalidation cmd. This was based on
the old assumption that bind+invalidate only applies to child, which
will be fixed in next version.

> > /*
> >   * Attach a vfio device to the specified IOASID
> >   *
> >   * Multiple vfio devices can be attached to the same IOASID, and vice
> >   * versa.
> >   *
> >   * User may optionally provide a "virtual PASID" to mark an I/O page
> >   * table on this vfio device. Whether the virtual PASID is physically used
> >   * or converted to another kernel-allocated PASID is a policy in vfio device
> >   * driver.
> >   *
> >   * There is no need to specify ioasid_fd in this call due to the assumption
> >   * of 1:1 connection between vfio device and the bound fd.
> >   *
> >   * Input parameter:
> >   *	- ioasid;
> >   *	- flag;
> >   *	- user_pasid (if specified);
> 
> Wouldn't the PASID be communicated by whether you give a parent or
> child ioasid, rather than needing an extra value?

No. ioasid is just the software handle. 

> > 	struct ioasid_data {
> > 		// link to ioasid_ctx->ioasid_list
> > 		struct list_head		next;
> >
> > 		// the IOASID number
> > 		u32			ioasid;
> >
> > 		// the handle to convey iommu operations
> > 		// hold the pgd (TBD until discussing iommu api)
> > 		struct iommu_domain *domain;
> >
> > 		// map metadata (vfio type1 semantics)
> > 		struct rb_node		dma_list;
> 
> Why do you need this?  Can't you just store the kernel managed
> mappings in the host IO pgtable?

A simple reason is that to implement vfio type1 semantics we
need make sure unmap with size same as what is used for map.
The metadata allows verifying this assumption. Another reason
is when doing software nesting, the page table linked into the
iommu domain is the shadow one. It's better to keep the original 
metadata so it can be used to update the shadow when another 
level (parent or child) changes the mapping.

> >
> > 5.3. IOASID nesting (software)
> > +++++++++++++++++++++++++
> >
> > Same usage scenario as 5.2, with software-based IOASID nesting
> > available. In this mode it is the kernel instead of user to create the
> > shadow mapping.
> 
> In this case, I feel like the preregistration is redundant with the
> GPA level mapping.  As long as the gIOVA mappings (which might be
> frequent) can piggyback on the accounting done for the GPA mapping we
> accomplish what we need from preregistration.

yes, preregistration makes more sense when multiple IOASIDs are
used but are not nested together.

> > 5.5. Guest SVA (vSVA)
> > ++++++++++++++++++
> >
> > After boots the guest further create a GVA address spaces (gpasid1) on
> > dev1. Dev2 is not affected (still attached to giova_ioasid).
> >
> > As explained in section 4, user should avoid expose ENQCMD on both
> > pdev and mdev.
> >
> > The sequence applies to all device types (being pdev or mdev), except
> > one additional step to call KVM for ENQCMD-capable mdev:
> >
> > 	/* After boots */
> > 	/* Make GVA space nested on GPA space */
> > 	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > 				gpa_ioasid);
> 
> I'm not clear what gva_ioasid is representing.  Is it representing a
> single vPASID's address space, or a whole bunch of vPASIDs address
> spaces?

a single vPASID's address space.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-03  6:49               ` Tian, Kevin
@ 2021-06-03 11:47                 ` Jason Gunthorpe
  2021-06-04  2:15                   ` Tian, Kevin
  2021-06-08  0:49                 ` David Gibson
  1 sibling, 1 reply; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-03 11:47 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: David Gibson, Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok,
	kvm, Jonathan Corbet, David Woodhouse, Jason Wang, LKML,
	Kirti Wankhede, Alex Williamson (alex.williamson@redhat.com),
	iommu, Robin Murphy

On Thu, Jun 03, 2021 at 06:49:20AM +0000, Tian, Kevin wrote:
> > From: David Gibson
> > Sent: Thursday, June 3, 2021 1:09 PM
> [...]
> > > > In this way the SW mode is the same as a HW mode with an infinite
> > > > cache.
> > > >
> > > > The collaposed shadow page table is really just a cache.
> > > >
> > >
> > > OK. One additional thing is that we may need a 'caching_mode"
> > > thing reported by /dev/ioasid, indicating whether invalidation is
> > > required when changing non-present to present. For hardware
> > > nesting it's not reported as the hardware IOMMU will walk the
> > > guest page table in cases of iotlb miss. For software nesting
> > > caching_mode is reported so the user must issue invalidation
> > > upon any change in guest page table so the kernel can update
> > > the shadow page table timely.
> > 
> > For the fist cut, I'd have the API assume that invalidates are
> > *always* required.  Some bypass to avoid them in cases where they're
> > not needed can be an additional extension.
> > 
> 
> Isn't a typical TLB semantics is that non-present entries are not
> cached thus invalidation is not required when making non-present
> to present? It's true to both CPU TLB and IOMMU TLB. In reality
> I feel there are more usages built on hardware nesting than software
> nesting thus making default following hardware TLB behavior makes
> more sense...

From a modelling perspective it makes sense to have the most general
be the default and if an implementation can elide certain steps then
describing those as additional behaviors on the universal baseline is
cleaner

I'm surprised to hear your remarks about the not-present though,
how does the vIOMMU emulation work if there are not hypervisor
invalidation traps for not-present/present transitions?

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-03  5:13       ` David Gibson
@ 2021-06-03 11:52         ` Jason Gunthorpe
  2021-06-08  0:53           ` David Gibson
  0 siblings, 1 reply; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-03 11:52 UTC (permalink / raw)
  To: David Gibson
  Cc: Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	Kirti Wankhede, Robin Murphy

On Thu, Jun 03, 2021 at 03:13:44PM +1000, David Gibson wrote:

> > We can still consider it a single "address space" from the IOMMU
> > perspective. What has happened is that the address table is not just a
> > 64 bit IOVA, but an extended ~80 bit IOVA formed by "PASID, IOVA".
> 
> True.  This does complexify how we represent what IOVA ranges are
> valid, though.  I'll bet you most implementations don't actually
> implement a full 64-bit IOVA, which means we effectively have a large
> number of windows from (0..max IOVA) for each valid pasid.  This adds
> another reason I don't think my concept of IOVA windows is just a
> power specific thing.

Yes

Things rapidly get into weird hardware specific stuff though, the
request will be for things like:
  "ARM PASID&IO page table format from SMMU IP block vXX"

Which may have a bunch of (possibly very weird!) format specific data
to describe and/or configure it.

The uAPI needs to be suitably general here. :(

> > If we are already going in the direction of having the IOASID specify
> > the page table format and other details, specifying that the page
> > tabnle format is the 80 bit "PASID, IOVA" format is a fairly small
> > step.
> 
> Well, rather I think userspace needs to request what page table format
> it wants and the kernel tells it whether it can oblige or not.

Yes, this is what I ment.

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-03  5:45       ` David Gibson
@ 2021-06-03 12:11         ` Jason Gunthorpe
  2021-06-04  6:08           ` Tian, Kevin
  2021-06-08  6:13           ` David Gibson
  2021-06-04 10:24         ` Jean-Philippe Brucker
  1 sibling, 2 replies; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-03 12:11 UTC (permalink / raw)
  To: David Gibson
  Cc: Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	Kirti Wankhede, Robin Murphy

On Thu, Jun 03, 2021 at 03:45:09PM +1000, David Gibson wrote:
> On Wed, Jun 02, 2021 at 01:58:38PM -0300, Jason Gunthorpe wrote:
> > On Wed, Jun 02, 2021 at 04:48:35PM +1000, David Gibson wrote:
> > > > > 	/* Bind guest I/O page table  */
> > > > > 	bind_data = {
> > > > > 		.ioasid	= gva_ioasid;
> > > > > 		.addr	= gva_pgtable1;
> > > > > 		// and format information
> > > > > 	};
> > > > > 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> > > > 
> > > > Again I do wonder if this should just be part of alloc_ioasid. Is
> > > > there any reason to split these things? The only advantage to the
> > > > split is the device is known, but the device shouldn't impact
> > > > anything..
> > > 
> > > I'm pretty sure the device(s) could matter, although they probably
> > > won't usually. 
> > 
> > It is a bit subtle, but the /dev/iommu fd itself is connected to the
> > devices first. This prevents wildly incompatible devices from being
> > joined together, and allows some "get info" to report the capability
> > union of all devices if we want to do that.
> 
> Right.. but I've not been convinced that having a /dev/iommu fd
> instance be the boundary for these types of things actually makes
> sense.  For example if we were doing the preregistration thing
> (whether by child ASes or otherwise) then that still makes sense
> across wildly different devices, but we couldn't share that layer if
> we have to open different instances for each of them.

It is something that still seems up in the air.. What seems clear for
/dev/iommu is that it
 - holds a bunch of IOASID's organized into a tree
 - holds a bunch of connected devices
 - holds a pinned memory cache

One thing it must do is enforce IOMMU group security. A device cannot
be attached to an IOASID unless all devices in its IOMMU group are
part of the same /dev/iommu FD.

The big open question is what parameters govern allowing devices to
connect to the /dev/iommu:
 - all devices can connect and we model the differences inside the API
   somehow.
 - Only sufficiently "similar" devices can be connected
 - The FD's capability is the minimum of all the connected devices

There are some practical problems here, when an IOASID is created the
kernel does need to allocate a page table for it, and that has to be
in some definite format.

It may be that we had a false start thinking the FD container should
be limited. Perhaps creating an IOASID should pass in a list
of the "device labels" that the IOASID will be used with and that can
guide the kernel what to do?

> Right, but at this stage I'm just not seeing a really clear (across
> platforms and device typpes) boundary for what things have to be per
> IOASID container and what have to be per IOASID, so I'm just not sure
> the /dev/iommu instance grouping makes any sense.

I would push as much stuff as possible to be per-IOASID..
 
> > I don't know if that small advantage is worth the extra complexity
> > though.
> > 
> > > But it would certainly be possible for a system to have two
> > > different host bridges with two different IOMMUs with different
> > > pagetable formats.  Until you know which devices (and therefore
> > > which host bridge) you're talking about, you don't know what formats
> > > of pagetable to accept.  And if you have devices from *both* bridges
> > > you can't bind a page table at all - you could theoretically support
> > > a kernel managed pagetable by mirroring each MAP and UNMAP to tables
> > > in both formats, but it would be pretty reasonable not to support
> > > that.
> > 
> > The basic process for a user space owned pgtable mode would be:
> > 
> >  1) qemu has to figure out what format of pgtable to use
> > 
> >     Presumably it uses query functions using the device label.
> 
> No... in the qemu case it would always select the page table format
> that it needs to present to the guest.  That's part of the
> guest-visible platform that's selected by qemu's configuration.

I should have said "vfio user" here because apps like DPDK might use
this path
 
> >  4) For the next device qemu would have to figure out if it can re-use
> >     an existing IOASID based on the required proeprties.
> 
> Nope.  Again, what devices share an IO address space is a guest
> visible part of the platform.  If the host kernel can't supply that,
> then qemu must not start (or fail the hotplug if the new device is
> being hotplugged).

qemu can always emulate. If the config requires to devices that cannot
share an IOASID because the local platform is wonky then qemu needs to
shadow and duplicate the IO page table from the guest into two IOASID
objects to make it work. This is a SW emulation option.

> For this reason, amongst some others, I think when selecting a kernel
> managed pagetable we need to also have userspace explicitly request
> which IOVA ranges are mappable, and what (minimum) page size it
> needs.

It does make sense

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-03  5:23           ` David Gibson
@ 2021-06-03 12:28             ` Jason Gunthorpe
  2021-06-08  6:04               ` David Gibson
  0 siblings, 1 reply; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-03 12:28 UTC (permalink / raw)
  To: David Gibson
  Cc: Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	Kirti Wankhede, Robin Murphy

On Thu, Jun 03, 2021 at 03:23:17PM +1000, David Gibson wrote:
> On Wed, Jun 02, 2021 at 01:37:53PM -0300, Jason Gunthorpe wrote:
> > On Wed, Jun 02, 2021 at 04:57:52PM +1000, David Gibson wrote:
> > 
> > > I don't think presence or absence of a group fd makes a lot of
> > > difference to this design.  Having a group fd just means we attach
> > > groups to the ioasid instead of individual devices, and we no longer
> > > need the bookkeeping of "partial" devices.
> > 
> > Oh, I think we really don't want to attach the group to an ioasid, or
> > at least not as a first-class idea.
> > 
> > The fundamental problem that got us here is we now live in a world
> > where there are many ways to attach a device to an IOASID:
> 
> I'm not seeing that that's necessarily a problem.
> 
> >  - A RID binding
> >  - A RID,PASID binding
> >  - A RID,PASID binding for ENQCMD
> 
> I have to admit I haven't fully grasped the differences between these
> modes.  I'm hoping we can consolidate at least some of them into the
> same sort of binding onto different IOASIDs (which may be linked in
> parent/child relationships).

What I would like is that the /dev/iommu side managing the IOASID
doesn't really care much, but the device driver has to tell
drivers/iommu what it is going to do when it attaches.

It makes sense, in PCI terms, only the driver knows what TLPs the
device will generate. The IOMMU needs to know what TLPs it will
recieve to configure properly.

PASID or not is major device specific variation, as is the ENQCMD/etc

Having the device be explicit when it tells the IOMMU what it is going
to be sending is a major plus to me. I actually don't want to see this
part of the interface be made less strong.

> > The selection of which mode to use is based on the specific
> > driver/device operation. Ie the thing that implements the 'struct
> > vfio_device' is the thing that has to select the binding mode.
> 
> I thought userspace selected the binding mode - although not all modes
> will be possible for all devices.

/dev/iommu is concerned with setting up the IOAS and filling the IO
page tables with information

The driver behind "struct vfio_device" is responsible to "route" its
HW into that IOAS.

They are two halfs of the problem, one is only the io page table, and one
the is connection of a PCI TLP to a specific io page table.

Only the driver knows what format of TLPs the device will generate so
only the driver can specify the "route"
 
> > eg if two PCI devices are in a group then it is perfectly fine that
> > one device uses RID binding and the other device uses RID,PASID
> > binding.
> 
> Uhhhh... I don't see how that can be.  They could well be in the same
> group because their RIDs cannot be distinguished from each other.

Inability to match the RID is rare, certainly I would expect any IOMMU
HW that can do PCIEe PASID matching can also do RID matching. With
such HW the above is perfectly fine - the group may not be secure
between members (eg !ACS), but the TLPs still carry valid RIDs and
PASID and the IOMMU can still discriminate.

I think you are talking about really old IOMMU's that could only
isolate based on ingress port or something.. I suppose modern PCIe has
some cases like this in the NTB stuff too.

Oh, I hadn't spent time thinking about any of those.. It is messy but
it can still be forced to work, I guess. A device centric model means
all the devices using the same routing ID have to be connected to the
same IOASID by userspace. So some of the connections will be NOPs.

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-03  2:50                           ` Alex Williamson
  2021-06-03  3:22                             ` Tian, Kevin
@ 2021-06-03 12:34                             ` Jason Gunthorpe
  2021-06-03 20:01                               ` Alex Williamson
  1 sibling, 1 reply; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-03 12:34 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok, kvm,
	Jonathan Corbet, Robin Murphy, LKML, iommu, David Gibson,
	Kirti Wankhede, David Woodhouse, Jason Wang

On Wed, Jun 02, 2021 at 08:50:54PM -0600, Alex Williamson wrote:
> On Wed, 2 Jun 2021 19:45:36 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Jun 02, 2021 at 02:37:34PM -0600, Alex Williamson wrote:
> > 
> > > Right.  I don't follow where you're jumping to relaying DMA_PTE_SNP
> > > from the guest page table... what page table?    
> > 
> > I see my confusion now, the phrasing in your earlier remark led me
> > think this was about allowing the no-snoop performance enhancement in
> > some restricted way.
> > 
> > It is really about blocking no-snoop 100% of the time and then
> > disabling the dangerous wbinvd when the block is successful.
> > 
> > Didn't closely read the kvm code :\
> > 
> > If it was about allowing the optimization then I'd expect the guest to
> > enable no-snoopable regions via it's vIOMMU and realize them to the
> > hypervisor and plumb the whole thing through. Hence my remark about
> > the guest page tables..
> > 
> > So really the test is just 'were we able to block it' ?
> 
> Yup.  Do we really still consider that there's some performance benefit
> to be had by enabling a device to use no-snoop?  This seems largely a
> legacy thing.

I've recently had some no-snoopy discussions lately.. The issue didn't
vanish, it is still expensive going through all that cache hardware.

> > But Ok, back the /dev/ioasid. This answers a few lingering questions I
> > had..
> > 
> > 1) Mixing IOMMU_CAP_CACHE_COHERENCY and !IOMMU_CAP_CACHE_COHERENCY
> >    domains.
> > 
> >    This doesn't actually matter. If you mix them together then kvm
> >    will turn on wbinvd anyhow, so we don't need to use the DMA_PTE_SNP
> >    anywhere in this VM.
> > 
> >    This if two IOMMU's are joined together into a single /dev/ioasid
> >    then we can just make them both pretend to be
> >    !IOMMU_CAP_CACHE_COHERENCY and both not set IOMMU_CACHE.
> 
> Yes and no.  Yes, if any domain is !IOMMU_CAP_CACHE_COHERENCY then we
> need to emulate wbinvd, but no we'll use IOMMU_CACHE any time it's
> available based on the per domain support available.  That gives us the
> most consistent behavior, ie. we don't have VMs emulating wbinvd
> because they used to have a device attached where the domain required
> it and we can't atomically remap with new flags to perform the same as
> a VM that never had that device attached in the first place.

I think we are saying the same thing..
 
> > 2) How to fit this part of kvm in some new /dev/ioasid world
> > 
> >    What we want to do here is iterate over every ioasid associated
> >    with the group fd that is passed into kvm.
> 
> Yeah, we need some better names, binding a device to an ioasid (fd) but
> then attaching a device to an allocated ioasid (non-fd)... I assume
> you're talking about the latter ioasid.

Fingers crossed on RFCv2.. Here I mean the IOASID object inside the
/dev/iommu FD. The vfio_device would have some kref handle to the
in-kernel representation of it. So we can interact with it..

> >    Or perhaps more directly: an op attaching the vfio_device to the
> >    kvm and having some simple helper 
> >          '(un)register ioasid with kvm (kvm, ioasid)'
> >    that the vfio_device driver can call that just sorts this out.
>
> We could almost eliminate the device notion altogether here, use an
> ioasidfd_for_each_ioasid() but we really want a way to trigger on each
> change to the composition of the device set for the ioasid, which is
> why we currently do it on addition or removal of a group, where the
> group has a consistent set of IOMMU properties.

That is another quite good option, just forget about trying to be
highly specific and feed in the /dev/ioasid FD and have kvm ask "does
anything in here not enforce snoop?"

With something appropriate to track/block changing that answer.

It doesn't solve the problem to connect kvm to AP and kvmgt though

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-03  3:22                             ` Tian, Kevin
  2021-06-03  4:14                               ` Alex Williamson
@ 2021-06-03 12:40                               ` Jason Gunthorpe
  2021-06-03 20:41                                 ` Alex Williamson
  2021-06-04  7:33                                 ` Tian, Kevin
  1 sibling, 2 replies; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-03 12:40 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok,
	kvm, Jonathan Corbet, Robin Murphy, LKML, iommu, David Gibson,
	Kirti Wankhede, David Woodhouse, Jason Wang

On Thu, Jun 03, 2021 at 03:22:27AM +0000, Tian, Kevin wrote:
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Thursday, June 3, 2021 10:51 AM
> > 
> > On Wed, 2 Jun 2021 19:45:36 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > 
> > > On Wed, Jun 02, 2021 at 02:37:34PM -0600, Alex Williamson wrote:
> > >
> > > > Right.  I don't follow where you're jumping to relaying DMA_PTE_SNP
> > > > from the guest page table... what page table?
> > >
> > > I see my confusion now, the phrasing in your earlier remark led me
> > > think this was about allowing the no-snoop performance enhancement in
> > > some restricted way.
> > >
> > > It is really about blocking no-snoop 100% of the time and then
> > > disabling the dangerous wbinvd when the block is successful.
> > >
> > > Didn't closely read the kvm code :\
> > >
> > > If it was about allowing the optimization then I'd expect the guest to
> > > enable no-snoopable regions via it's vIOMMU and realize them to the
> > > hypervisor and plumb the whole thing through. Hence my remark about
> > > the guest page tables..
> > >
> > > So really the test is just 'were we able to block it' ?
> > 
> > Yup.  Do we really still consider that there's some performance benefit
> > to be had by enabling a device to use no-snoop?  This seems largely a
> > legacy thing.
> 
> Yes, there is indeed performance benefit for device to use no-snoop,
> e.g. 8K display and some imaging processing path, etc. The problem is
> that the IOMMU for such devices is typically a different one from the
> default IOMMU for most devices. This special IOMMU may not have
> the ability of enforcing snoop on no-snoop PCI traffic then this fact
> must be understood by KVM to do proper mtrr/pat/wbinvd virtualization 
> for such devices to work correctly.

Or stated another way:

We in Linux don't have a way to control if the VFIO IO page table will
be snoop or no snoop from userspace so Intel has forced the platform's
IOMMU path for the integrated GPU to be unable to enforce snoop, thus
"solving" the problem.

I don't think that is sustainable in the oveall ecosystem though.

'qemu --allow-no-snoop' makes more sense to me

> When discussing I/O page fault support in another thread, the consensus
> is that an device handle will be registered (by user) or allocated (return
> to user) in /dev/ioasid when binding the device to ioasid fd. From this 
> angle we can register {ioasid_fd, device_handle} to KVM and then call 
> something like ioasidfd_device_is_coherent() to get the property. 
> Anyway the coherency is a per-device property which is not changed 
> by how many I/O page tables are attached to it.

It is not device specific, it is driver specific

As I said before, the question is if the IOASID itself can enforce
snoop, or not. AND if the device will issue no-snoop or not.

Devices that are hard wired to never issue no-snoop are safe even with
an IOASID that cannot enforce snoop. AFAIK really only GPUs use this
feature. Eg I would be comfortable to say mlx5 never uses the no-snoop
TLP flag.

Only the vfio_driver could know this.

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-03  6:26     ` David Gibson
@ 2021-06-03 12:46       ` Jason Gunthorpe
  2021-06-04  6:27         ` Tian, Kevin
  0 siblings, 1 reply; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-03 12:46 UTC (permalink / raw)
  To: David Gibson
  Cc: Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	Kirti Wankhede, Robin Murphy

On Thu, Jun 03, 2021 at 04:26:08PM +1000, David Gibson wrote:

> > There are global properties in the /dev/iommu FD, like what devices
> > are part of it, that are important for group security operations. This
> > becomes confused if it is split to many FDs.
> 
> I'm still not seeing those.  I'm really not seeing any well-defined
> meaning to devices being attached to the fd, but not to a particular
> IOAS.

Kevin can you add a section on how group security would have to work
to the RFC? This is the idea you can't attach a device to an IOASID
unless all devices in the IOMMU group are joined to the /dev/iommu FD.

The basic statement is that userspace must present the entire group
membership to /dev/iommu to prove that it has the security right to
manipulate their DMA translation.

It is the device centric analog to what the group FD is doing for
security.

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-03  7:17   ` Tian, Kevin
@ 2021-06-03 12:49     ` Jason Gunthorpe
  2021-06-08  5:49     ` David Gibson
  1 sibling, 0 replies; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-03 12:49 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: David Gibson, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	Kirti Wankhede, Robin Murphy

On Thu, Jun 03, 2021 at 07:17:23AM +0000, Tian, Kevin wrote:
> > From: David Gibson <david@gibson.dropbear.id.au>
> > Sent: Wednesday, June 2, 2021 2:15 PM
> > 
> [...] 
> > > An I/O address space takes effect in the IOMMU only after it is attached
> > > to a device. The device in the /dev/ioasid context always refers to a
> > > physical one or 'pdev' (PF or VF).
> > 
> > What you mean by "physical" device here isn't really clear - VFs
> > aren't really physical devices, and the PF/VF terminology also doesn't
> > extent to non-PCI devices (which I think we want to consider for the
> > API, even if we're not implemenenting it any time soon).
> 
> Yes, it's not very clear, and more in PCI context to simplify the 
> description. A "physical" one here means an PCI endpoint function
> which has a unique RID. It's more to differentiate with later mdev/
> subdevice which uses both RID+PASID. Naming is always a hard
> exercise to me... Possibly I'll just use device vs. subdevice in future
> versions.

Using PCI words:

A "physical" device is RID matching.

A "subdevice" is (RID, PASID) matching.

A "SW mdev" is performing DMA isolation in a device specific way - all
DMA's from the device are routed to the hypervisor's IOMMU page
tables.

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-03  6:50       ` Lu Baolu
@ 2021-06-03 12:56         ` Jason Gunthorpe
  0 siblings, 0 replies; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-03 12:56 UTC (permalink / raw)
  To: Lu Baolu
  Cc: David Gibson, Tian, Kevin, LKML, Joerg Roedel, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	Kirti Wankhede, Robin Murphy

On Thu, Jun 03, 2021 at 02:50:11PM +0800, Lu Baolu wrote:
> Hi David,
> 
> On 6/3/21 1:54 PM, David Gibson wrote:
> > On Tue, Jun 01, 2021 at 07:09:21PM +0800, Lu Baolu wrote:
> > > Hi Jason,
> > > 
> > > On 2021/5/29 7:36, Jason Gunthorpe wrote:
> > > > > /*
> > > > >     * Bind an user-managed I/O page table with the IOMMU
> > > > >     *
> > > > >     * Because user page table is untrusted, IOASID nesting must be enabled
> > > > >     * for this ioasid so the kernel can enforce its DMA isolation policy
> > > > >     * through the parent ioasid.
> > > > >     *
> > > > >     * Pgtable binding protocol is different from DMA mapping. The latter
> > > > >     * has the I/O page table constructed by the kernel and updated
> > > > >     * according to user MAP/UNMAP commands. With pgtable binding the
> > > > >     * whole page table is created and updated by userspace, thus different
> > > > >     * set of commands are required (bind, iotlb invalidation, page fault, etc.).
> > > > >     *
> > > > >     * Because the page table is directly walked by the IOMMU, the user
> > > > >     * must  use a format compatible to the underlying hardware. It can
> > > > >     * check the format information through IOASID_GET_INFO.
> > > > >     *
> > > > >     * The page table is bound to the IOMMU according to the routing
> > > > >     * information of each attached device under the specified IOASID. The
> > > > >     * routing information (RID and optional PASID) is registered when a
> > > > >     * device is attached to this IOASID through VFIO uAPI.
> > > > >     *
> > > > >     * Input parameters:
> > > > >     *	- child_ioasid;
> > > > >     *	- address of the user page table;
> > > > >     *	- formats (vendor, address_width, etc.);
> > > > >     *
> > > > >     * Return: 0 on success, -errno on failure.
> > > > >     */
> > > > > #define IOASID_BIND_PGTABLE		_IO(IOASID_TYPE, IOASID_BASE + 9)
> > > > > #define IOASID_UNBIND_PGTABLE	_IO(IOASID_TYPE, IOASID_BASE + 10)
> > > > Also feels backwards, why wouldn't we specify this, and the required
> > > > page table format, during alloc time?
> > > > 
> > > Thinking of the required page table format, perhaps we should shed more
> > > light on the page table of an IOASID. So far, an IOASID might represent
> > > one of the following page tables (might be more):
> > > 
> > >   1) an IOMMU format page table (a.k.a. iommu_domain)
> > >   2) a user application CPU page table (SVA for example)
> > >   3) a KVM EPT (future option)
> > >   4) a VM guest managed page table (nesting mode)
> > > 
> > > This version only covers 1) and 4). Do you think we need to support 2),
> > Isn't (2) the equivalent of using the using the host-managed pagetable
> > then doing a giant MAP of all your user address space into it?  But
> > maybe we should identify that case explicitly in case the host can
> > optimize it.
> 
> Conceptually, yes. Current SVA implementation just reuses the
> application's cpu page table w/o map/unmap operations.

The key distinction is faulting, and this goes back to the importance
of having the device tell drivers/iommu what TLPs it is generating.

A #1 table with a map of 'all user space memory' does not have IO DMA
faults. The pages should be pinned and this object should be
compatible with any DMA user.

A #2/#3 table allows page faulting, and it can only be used with a
device that supports the page faulting protocol. For instance a PCI
device needs to say it is running in ATS mode and supports PRI. This
is where you might fit in CAPI generically.

As the other case in my other email, the kind of TLPs the device
generates is only known by the driver when it connects to the IOASID
and must be communicated to the IOMMU so it knows how to set things
up. ATS/PRI w/ faulting is a very different setup than simple RID
matching.

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-03  6:39   ` Tian, Kevin
@ 2021-06-03 13:05     ` Jason Gunthorpe
  2021-06-04  6:37       ` Tian, Kevin
  2021-06-15  8:59       ` Tian, Kevin
  0 siblings, 2 replies; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-03 13:05 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

On Thu, Jun 03, 2021 at 06:39:30AM +0000, Tian, Kevin wrote:
> > > Two helper functions are provided to support VFIO_ATTACH_IOASID:
> > >
> > > 	struct attach_info {
> > > 		u32	ioasid;
> > > 		// If valid, the PASID to be used physically
> > > 		u32	pasid;
> > > 	};
> > > 	int ioasid_device_attach(struct ioasid_dev *dev,
> > > 		struct attach_info info);
> > > 	int ioasid_device_detach(struct ioasid_dev *dev, u32 ioasid);
> > 
> > Honestly, I still prefer this to be highly explicit as this is where
> > all device driver authors get invovled:
> > 
> > ioasid_pci_device_attach(struct pci_device *pdev, struct ioasid_dev *dev,
> > u32 ioasid);
> > ioasid_pci_device_pasid_attach(struct pci_device *pdev, u32 *physical_pasid,
> > struct ioasid_dev *dev, u32 ioasid);
> 
> Then better naming it as pci_device_attach_ioasid since the 1st parameter
> is struct pci_device?

No, the leading tag indicates the API's primary subystem, in this case
it is iommu (and if you prefer list the iommu related arguments first)

> By keeping physical_pasid as a pointer, you want to remove the last helper
> function (ioasid_get_global_pasid) so the global pasid is returned along
> with the attach function?

It is just a thought.. It allows the caller to both specify a fixed
PASID and request an allocation

I still dont have a clear idea how all this PASID complexity should
work, sorry.

> > > The actual policy depends on pdev vs. mdev, and whether ENQCMD is
> > > supported. There are three possible scenarios:
> > >
> > > (Note: /dev/ioasid uAPI is not affected by underlying PASID virtualization
> > > policies.)
> > 
> > This has become unclear. I think this should start by identifying the
> > 6 main type of devices and how they can use pPASID/vPASID:
> > 
> > 0) Device is a RID and cannot issue PASID
> > 1) Device is a mdev and cannot issue PASID
> > 2) Device is a mdev and programs a single fixed PASID during bind,
> >    does not accept PASID from the guest
> 
> There are no vPASID per se in above 3 types. So this section only
> focus on the latter 3 types. But I can include them in next version
> if it sets the tone clearer.

I think it helps

> > 
> > 3) Device accepts any PASIDs from the guest. No
> >    vPASID/pPASID translation is possible. (classic vfio_pci)
> > 4) Device accepts any PASID from the guest and has an
> >    internal vPASID/pPASID translation (enhanced vfio_pci)
> 
> what is enhanced vfio_pci? In my writing this is for mdev
> which doesn't support ENQCMD

This is a vfio_pci that mediates some element of the device interface
to communicate the vPASID/pPASID table to the device, using Max's
series for vfio_pci drivers to inject itself into VFIO.

For instance a device might send a message through the PF that the VF
has a certain vPASID/pPASID translation table. This would be useful
for devices that cannot use ENQCMD but still want to support migration
and thus need vPASID.

> for 0-2 the device will report no PASID support. Although this may duplicate
> with other information (e.g. PCI PASID cap), this provides a vendor-agnostic
> way for reporting details around IOASID.

We have to consider mdevs too here, so PCI caps are not general enough
 
> for 3-5 the device will report PASID support. In these cases the user is
> expected to always provide a vPASID. 
> 
> for 5 in addition the device will report a requirement on CPU PASID 
> translation. For such device the user should talk to KVM to setup the PASID
> mapping. This way the user doesn't need to know whether a device is
> pdev or mdev. Just follows what device capability reports.

Something like that. Needs careful documentation

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-03  2:52                         ` Jason Wang
@ 2021-06-03 13:09                           ` Jason Gunthorpe
  2021-06-04  1:11                             ` Jason Wang
  0 siblings, 1 reply; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-03 13:09 UTC (permalink / raw)
  To: Jason Wang
  Cc: Alex Williamson, Tian, Kevin, Jean-Philippe Brucker, Jiang, Dave,
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML, iommu,
	David Gibson, Kirti Wankhede, David Woodhouse

On Thu, Jun 03, 2021 at 10:52:51AM +0800, Jason Wang wrote:

> Basically, we don't want to bother with pseudo KVM device like what VFIO
> did. So for simplicity, we rules out the IOMMU that can't enforce coherency
> in vhost-vDPA if the parent purely depends on the platform IOMMU:

VDPA HW cannot issue no-snoop TLPs in the first place.

virtio does not define a protocol to discover such a functionality,
nor do any virtio drivers implement the required platform specific
cache flushing to make no-snoop TLPs work.

It is fundamentally part of the virtio HW PCI API that a device vendor
cannot alter.

Basically since we already know that the virtio kernel drivers do not
call the cache flush instruction we don't need the weird KVM logic to
turn it on at all.

Enforcing no-snoop at the IOMMU here is redundant/confusing.

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-02  4:50           ` Shenming Lu
@ 2021-06-03 18:19             ` Jacob Pan
  2021-06-04  1:30               ` Jason Wang
  2021-06-04  2:03               ` Shenming Lu
  0 siblings, 2 replies; 258+ messages in thread
From: Jacob Pan @ 2021-06-03 18:19 UTC (permalink / raw)
  To: Shenming Lu
  Cc: Jason Gunthorpe, Lu Baolu, Tian, Kevin, LKML, Joerg Roedel,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jean-Philippe Brucker, David Gibson,
	Kirti Wankhede, Robin Murphy, Zenghui Yu, wanghaibin.wang,
	jacob.jun.pan

Hi Shenming,

On Wed, 2 Jun 2021 12:50:26 +0800, Shenming Lu <lushenming@huawei.com>
wrote:

> On 2021/6/2 1:33, Jason Gunthorpe wrote:
> > On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote:
> >   
> >> The drivers register per page table fault handlers to /dev/ioasid which
> >> will then register itself to iommu core to listen and route the per-
> >> device I/O page faults.   
> > 
> > I'm still confused why drivers need fault handlers at all?  
> 
> Essentially it is the userspace that needs the fault handlers,
> one case is to deliver the faults to the vIOMMU, and another
> case is to enable IOPF on the GPA address space for on-demand
> paging, it seems that both could be specified in/through the
> IOASID_ALLOC ioctl?
> 
I would think IOASID_BIND_PGTABLE is where fault handler should be
registered. There wouldn't be any IO page fault without the binding anyway.

I also don't understand why device drivers should register the fault
handler, the fault is detected by the pIOMMU and injected to the vIOMMU. So
I think it should be the IOASID itself register the handler.

> Thanks,
> Shenming
> 


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-03 12:34                             ` Jason Gunthorpe
@ 2021-06-03 20:01                               ` Alex Williamson
  2021-06-03 20:10                                 ` Jason Gunthorpe
  0 siblings, 1 reply; 258+ messages in thread
From: Alex Williamson @ 2021-06-03 20:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok, kvm,
	Jonathan Corbet, Robin Murphy, LKML, iommu, David Gibson,
	Kirti Wankhede, David Woodhouse, Jason Wang

On Thu, 3 Jun 2021 09:34:01 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Jun 02, 2021 at 08:50:54PM -0600, Alex Williamson wrote:
> > On Wed, 2 Jun 2021 19:45:36 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Wed, Jun 02, 2021 at 02:37:34PM -0600, Alex Williamson wrote:
> > >   
> > > > Right.  I don't follow where you're jumping to relaying DMA_PTE_SNP
> > > > from the guest page table... what page table?      
> > > 
> > > I see my confusion now, the phrasing in your earlier remark led me
> > > think this was about allowing the no-snoop performance enhancement in
> > > some restricted way.
> > > 
> > > It is really about blocking no-snoop 100% of the time and then
> > > disabling the dangerous wbinvd when the block is successful.
> > > 
> > > Didn't closely read the kvm code :\
> > > 
> > > If it was about allowing the optimization then I'd expect the guest to
> > > enable no-snoopable regions via it's vIOMMU and realize them to the
> > > hypervisor and plumb the whole thing through. Hence my remark about
> > > the guest page tables..
> > > 
> > > So really the test is just 'were we able to block it' ?  
> > 
> > Yup.  Do we really still consider that there's some performance benefit
> > to be had by enabling a device to use no-snoop?  This seems largely a
> > legacy thing.  
> 
> I've recently had some no-snoopy discussions lately.. The issue didn't
> vanish, it is still expensive going through all that cache hardware.
> 
> > > But Ok, back the /dev/ioasid. This answers a few lingering questions I
> > > had..
> > > 
> > > 1) Mixing IOMMU_CAP_CACHE_COHERENCY and !IOMMU_CAP_CACHE_COHERENCY
> > >    domains.
> > > 
> > >    This doesn't actually matter. If you mix them together then kvm
> > >    will turn on wbinvd anyhow, so we don't need to use the DMA_PTE_SNP
> > >    anywhere in this VM.
> > > 
> > >    This if two IOMMU's are joined together into a single /dev/ioasid
> > >    then we can just make them both pretend to be
> > >    !IOMMU_CAP_CACHE_COHERENCY and both not set IOMMU_CACHE.  
> > 
> > Yes and no.  Yes, if any domain is !IOMMU_CAP_CACHE_COHERENCY then we
> > need to emulate wbinvd, but no we'll use IOMMU_CACHE any time it's
> > available based on the per domain support available.  That gives us the
> > most consistent behavior, ie. we don't have VMs emulating wbinvd
> > because they used to have a device attached where the domain required
> > it and we can't atomically remap with new flags to perform the same as
> > a VM that never had that device attached in the first place.  
> 
> I think we are saying the same thing..

Hrm?  I think I'm saying the opposite of your "both not set
IOMMU_CACHE".  IOMMU_CACHE is the mapping flag that enables
DMA_PTE_SNP.  Maybe you're using IOMMU_CACHE as the state reported to
KVM?

> > > 2) How to fit this part of kvm in some new /dev/ioasid world
> > > 
> > >    What we want to do here is iterate over every ioasid associated
> > >    with the group fd that is passed into kvm.  
> > 
> > Yeah, we need some better names, binding a device to an ioasid (fd) but
> > then attaching a device to an allocated ioasid (non-fd)... I assume
> > you're talking about the latter ioasid.  
> 
> Fingers crossed on RFCv2.. Here I mean the IOASID object inside the
> /dev/iommu FD. The vfio_device would have some kref handle to the
> in-kernel representation of it. So we can interact with it..
> 
> > >    Or perhaps more directly: an op attaching the vfio_device to the
> > >    kvm and having some simple helper 
> > >          '(un)register ioasid with kvm (kvm, ioasid)'
> > >    that the vfio_device driver can call that just sorts this out.  
> >
> > We could almost eliminate the device notion altogether here, use an
> > ioasidfd_for_each_ioasid() but we really want a way to trigger on each
> > change to the composition of the device set for the ioasid, which is
> > why we currently do it on addition or removal of a group, where the
> > group has a consistent set of IOMMU properties.  
> 
> That is another quite good option, just forget about trying to be
> highly specific and feed in the /dev/ioasid FD and have kvm ask "does
> anything in here not enforce snoop?"
> 
> With something appropriate to track/block changing that answer.
> 
> It doesn't solve the problem to connect kvm to AP and kvmgt though

It does not, we'll probably need a vfio ioctl to gratuitously announce
the KVM fd to each device.  I think some devices might currently fail
their open callback if that linkage isn't already available though, so
it's not clear when that should happen, ie. it can't currently be a
VFIO_DEVICE ioctl as getting the device fd requires an open, but this
proposal requires some availability of the vfio device fd without any
setup, so presumably that won't yet call the driver open callback.
Maybe that's part of the attach phase now... I'm not sure, it's not
clear when the vfio device uAPI starts being available in the process
of setting up the ioasid.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-03 20:01                               ` Alex Williamson
@ 2021-06-03 20:10                                 ` Jason Gunthorpe
  2021-06-03 21:44                                   ` Alex Williamson
  0 siblings, 1 reply; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-03 20:10 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok, kvm,
	Jonathan Corbet, Robin Murphy, LKML, iommu, David Gibson,
	Kirti Wankhede, David Woodhouse, Jason Wang

On Thu, Jun 03, 2021 at 02:01:46PM -0600, Alex Williamson wrote:

> > > > 1) Mixing IOMMU_CAP_CACHE_COHERENCY and !IOMMU_CAP_CACHE_COHERENCY
> > > >    domains.
> > > > 
> > > >    This doesn't actually matter. If you mix them together then kvm
> > > >    will turn on wbinvd anyhow, so we don't need to use the DMA_PTE_SNP
> > > >    anywhere in this VM.
> > > > 
> > > >    This if two IOMMU's are joined together into a single /dev/ioasid
> > > >    then we can just make them both pretend to be
> > > >    !IOMMU_CAP_CACHE_COHERENCY and both not set IOMMU_CACHE.  
> > > 
> > > Yes and no.  Yes, if any domain is !IOMMU_CAP_CACHE_COHERENCY then we
> > > need to emulate wbinvd, but no we'll use IOMMU_CACHE any time it's
> > > available based on the per domain support available.  That gives us the
> > > most consistent behavior, ie. we don't have VMs emulating wbinvd
> > > because they used to have a device attached where the domain required
> > > it and we can't atomically remap with new flags to perform the same as
> > > a VM that never had that device attached in the first place.  
> > 
> > I think we are saying the same thing..
> 
> Hrm?  I think I'm saying the opposite of your "both not set
> IOMMU_CACHE".  IOMMU_CACHE is the mapping flag that enables
> DMA_PTE_SNP.  Maybe you're using IOMMU_CACHE as the state reported to
> KVM?

I'm saying if we enable wbinvd in the guest then no IOASIDs used by
that guest need to set DMA_PTE_SNP. If we disable wbinvd in the guest
then all IOASIDs must enforce DMA_PTE_SNP (or we otherwise guarentee
no-snoop is not possible).

This is not what VFIO does today, but it is a reasonable choice.

Based on that observation we can say as soon as the user wants to use
an IOMMU that does not support DMA_PTE_SNP in the guest we can still
share the IO page table with IOMMUs that do support DMA_PTE_SNP.

> > It doesn't solve the problem to connect kvm to AP and kvmgt though
> 
> It does not, we'll probably need a vfio ioctl to gratuitously announce
> the KVM fd to each device.  I think some devices might currently fail
> their open callback if that linkage isn't already available though, so
> it's not clear when that should happen, ie. it can't currently be a
> VFIO_DEVICE ioctl as getting the device fd requires an open, but this
> proposal requires some availability of the vfio device fd without any
> setup, so presumably that won't yet call the driver open callback.
> Maybe that's part of the attach phase now... I'm not sure, it's not
> clear when the vfio device uAPI starts being available in the process
> of setting up the ioasid.  Thanks,

At a certain point we maybe just have to stick to backward compat, I
think. Though it is useful to think about green field alternates to
try to guide the backward compat design..

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-03 12:40                               ` Jason Gunthorpe
@ 2021-06-03 20:41                                 ` Alex Williamson
  2021-06-04  9:19                                   ` Tian, Kevin
  2021-06-04 12:13                                   ` Jason Gunthorpe
  2021-06-04  7:33                                 ` Tian, Kevin
  1 sibling, 2 replies; 258+ messages in thread
From: Alex Williamson @ 2021-06-03 20:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok, kvm,
	Jonathan Corbet, Robin Murphy, LKML, iommu, David Gibson,
	Kirti Wankhede, David Woodhouse, Jason Wang

On Thu, 3 Jun 2021 09:40:36 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Thu, Jun 03, 2021 at 03:22:27AM +0000, Tian, Kevin wrote:
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Thursday, June 3, 2021 10:51 AM
> > > 
> > > On Wed, 2 Jun 2021 19:45:36 -0300
> > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >   
> > > > On Wed, Jun 02, 2021 at 02:37:34PM -0600, Alex Williamson wrote:
> > > >  
> > > > > Right.  I don't follow where you're jumping to relaying DMA_PTE_SNP
> > > > > from the guest page table... what page table?  
> > > >
> > > > I see my confusion now, the phrasing in your earlier remark led me
> > > > think this was about allowing the no-snoop performance enhancement in
> > > > some restricted way.
> > > >
> > > > It is really about blocking no-snoop 100% of the time and then
> > > > disabling the dangerous wbinvd when the block is successful.
> > > >
> > > > Didn't closely read the kvm code :\
> > > >
> > > > If it was about allowing the optimization then I'd expect the guest to
> > > > enable no-snoopable regions via it's vIOMMU and realize them to the
> > > > hypervisor and plumb the whole thing through. Hence my remark about
> > > > the guest page tables..
> > > >
> > > > So really the test is just 'were we able to block it' ?  
> > > 
> > > Yup.  Do we really still consider that there's some performance benefit
> > > to be had by enabling a device to use no-snoop?  This seems largely a
> > > legacy thing.  
> > 
> > Yes, there is indeed performance benefit for device to use no-snoop,
> > e.g. 8K display and some imaging processing path, etc. The problem is
> > that the IOMMU for such devices is typically a different one from the
> > default IOMMU for most devices. This special IOMMU may not have
> > the ability of enforcing snoop on no-snoop PCI traffic then this fact
> > must be understood by KVM to do proper mtrr/pat/wbinvd virtualization 
> > for such devices to work correctly.  
> 
> Or stated another way:
> 
> We in Linux don't have a way to control if the VFIO IO page table will
> be snoop or no snoop from userspace so Intel has forced the platform's
> IOMMU path for the integrated GPU to be unable to enforce snoop, thus
> "solving" the problem.

That's giving vfio a lot of credit for influencing VT-d design.

> I don't think that is sustainable in the oveall ecosystem though.

Our current behavior is a reasonable default IMO, but I agree more
control will probably benefit us in the long run.

> 'qemu --allow-no-snoop' makes more sense to me

I'd be tempted to attach it to the -device vfio-pci option, it's
specific drivers for specific devices that are going to want this and
those devices may not be permanently attached to the VM.  But I see in
the other thread you're trying to optimize IOMMU page table sharing.

There's a usability question in either case though and I'm not sure how
to get around it other than QEMU or the kernel knowing a list of
devices (explicit IDs or vendor+class) to select per device defaults.

> > When discussing I/O page fault support in another thread, the consensus
> > is that an device handle will be registered (by user) or allocated (return
> > to user) in /dev/ioasid when binding the device to ioasid fd. From this 
> > angle we can register {ioasid_fd, device_handle} to KVM and then call 
> > something like ioasidfd_device_is_coherent() to get the property. 
> > Anyway the coherency is a per-device property which is not changed 
> > by how many I/O page tables are attached to it.  
> 
> It is not device specific, it is driver specific
> 
> As I said before, the question is if the IOASID itself can enforce
> snoop, or not. AND if the device will issue no-snoop or not.
> 
> Devices that are hard wired to never issue no-snoop are safe even with
> an IOASID that cannot enforce snoop. AFAIK really only GPUs use this
> feature. Eg I would be comfortable to say mlx5 never uses the no-snoop
> TLP flag.
> 
> Only the vfio_driver could know this.

Could you clarify "vfio_driver"?  The existing vfio-pci driver can't
know this, beyond perhaps probing if the Enable No-snoop bit is
hardwired to zero.  It's the driver running on top of vfio that
ultimately controls whether a capable device actually issues no-snoop
TLPs, but that can't be known to us.  A vendor variant of vfio-pci
might certainly know more about how its device is used by those
userspace/VM drivers.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-01 17:30 ` Parav Pandit
@ 2021-06-03 20:58   ` Jacob Pan
  2021-06-08  6:30     ` Parav Pandit
  0 siblings, 1 reply; 258+ messages in thread
From: Jacob Pan @ 2021-06-03 20:58 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Tian, Kevin, LKML, Joerg Roedel, Jason Gunthorpe, Lu Baolu,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jean-Philippe Brucker, David Gibson,
	Kirti Wankhede, Robin Murphy, jacob.jun.pan

Hi Parav,

On Tue, 1 Jun 2021 17:30:51 +0000, Parav Pandit <parav@nvidia.com> wrote:

> > From: Tian, Kevin <kevin.tian@intel.com>
> > Sent: Thursday, May 27, 2021 1:28 PM  
> 
> > 5.6. I/O page fault
> > +++++++++++++++
> > 
> > (uAPI is TBD. Here is just about the high-level flow from host IOMMU
> > driver to guest IOMMU driver and backwards).
> > 
> > -   Host IOMMU driver receives a page request with raw fault_data {rid,
> >     pasid, addr};
> > 
> > -   Host IOMMU driver identifies the faulting I/O page table according
> > to information registered by IOASID fault handler;
> > 
> > -   IOASID fault handler is called with raw fault_data (rid, pasid,
> > addr), which is saved in ioasid_data->fault_data (used for response);
> > 
> > -   IOASID fault handler generates an user fault_data (ioasid, addr),
> > links it to the shared ring buffer and triggers eventfd to userspace;
> > 
> > -   Upon received event, Qemu needs to find the virtual routing
> > information (v_rid + v_pasid) of the device attached to the faulting
> > ioasid. If there are multiple, pick a random one. This should be fine
> > since the purpose is to fix the I/O page table on the guest;
> > 
> > -   Qemu generates a virtual I/O page fault through vIOMMU into guest,
> >     carrying the virtual fault data (v_rid, v_pasid, addr);
> >   
> Why does it have to be through vIOMMU?
I think this flow is for fully emulated IOMMU, the same IOMMU and device
drivers run in the host and guest. Page request interrupt is reported by
the IOMMU, thus reporting to vIOMMU in the guest.

> For a VFIO PCI device, have you considered to reuse the same PRI
> interface to inject page fault in the guest? This eliminates any new
> v_rid. It will also route the page fault request and response through the
> right vfio device.
> 
I am curious how would PCI PRI can be used to inject fault. Are you talking
about PCI config PRI extended capability structure? The control is very
limited, only enable and reset. Can you explain how would page fault
handled in generic PCI cap?
Some devices may have device specific way to handle page faults, but I
guess this is not the PCI PRI method you are referring to?

> > -   Guest IOMMU driver fixes up the fault, updates the I/O page table,
> > and then sends a page response with virtual completion data (v_rid,
> > v_pasid, response_code) to vIOMMU;
> >   
> What about fixing up the fault for mmu page table as well in guest?
> Or you meant both when above you said "updates the I/O page table"?
> 
> It is unclear to me that if there is single nested page table maintained
> or two (one for cr3 references and other for iommu). Can you please
> clarify?
> 
I think it is just one, at least for VT-d, guest cr3 in GPA is stored
in the host iommu. Guest iommu driver calls handle_mm_fault to fix the mmu
page tables which is shared by the iommu.

> > -   Qemu finds the pending fault event, converts virtual completion data
> >     into (ioasid, response_code), and then calls a /dev/ioasid ioctl to
> >     complete the pending fault;
> >   
> For VFIO PCI device a virtual PRI request response interface is done, it
> can be generic interface among multiple vIOMMUs.
> 
same question above, not sure how this works in terms of interrupts and
response queuing etc.

> > -   /dev/ioasid finds out the pending fault data {rid, pasid, addr}
> > saved in ioasid_data->fault_data, and then calls iommu api to complete
> > it with {rid, pasid, response_code};
> >  


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-03 20:10                                 ` Jason Gunthorpe
@ 2021-06-03 21:44                                   ` Alex Williamson
  2021-06-04  8:38                                     ` Tian, Kevin
  0 siblings, 1 reply; 258+ messages in thread
From: Alex Williamson @ 2021-06-03 21:44 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok, kvm,
	Jonathan Corbet, Robin Murphy, LKML, iommu, David Gibson,
	Kirti Wankhede, David Woodhouse, Jason Wang

On Thu, 3 Jun 2021 17:10:18 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Thu, Jun 03, 2021 at 02:01:46PM -0600, Alex Williamson wrote:
> 
> > > > > 1) Mixing IOMMU_CAP_CACHE_COHERENCY and !IOMMU_CAP_CACHE_COHERENCY
> > > > >    domains.
> > > > > 
> > > > >    This doesn't actually matter. If you mix them together then kvm
> > > > >    will turn on wbinvd anyhow, so we don't need to use the DMA_PTE_SNP
> > > > >    anywhere in this VM.
> > > > > 
> > > > >    This if two IOMMU's are joined together into a single /dev/ioasid
> > > > >    then we can just make them both pretend to be
> > > > >    !IOMMU_CAP_CACHE_COHERENCY and both not set IOMMU_CACHE.    
> > > > 
> > > > Yes and no.  Yes, if any domain is !IOMMU_CAP_CACHE_COHERENCY then we
> > > > need to emulate wbinvd, but no we'll use IOMMU_CACHE any time it's
> > > > available based on the per domain support available.  That gives us the
> > > > most consistent behavior, ie. we don't have VMs emulating wbinvd
> > > > because they used to have a device attached where the domain required
> > > > it and we can't atomically remap with new flags to perform the same as
> > > > a VM that never had that device attached in the first place.    
> > > 
> > > I think we are saying the same thing..  
> > 
> > Hrm?  I think I'm saying the opposite of your "both not set
> > IOMMU_CACHE".  IOMMU_CACHE is the mapping flag that enables
> > DMA_PTE_SNP.  Maybe you're using IOMMU_CACHE as the state reported to
> > KVM?  
> 
> I'm saying if we enable wbinvd in the guest then no IOASIDs used by
> that guest need to set DMA_PTE_SNP.

Yes

> If we disable wbinvd in the guest
> then all IOASIDs must enforce DMA_PTE_SNP (or we otherwise guarentee
> no-snoop is not possible).

Yes, but we can't get from one of these to the other atomically wrt to
the device DMA.

> This is not what VFIO does today, but it is a reasonable choice.
> 
> Based on that observation we can say as soon as the user wants to use
> an IOMMU that does not support DMA_PTE_SNP in the guest we can still
> share the IO page table with IOMMUs that do support DMA_PTE_SNP.

If your goal is to prioritize IO page table sharing, sure.  But because
we cannot atomically transition from one to the other, each device is
stuck with the pages tables it has, so the history of the VM becomes a
factor in the performance characteristics.

For example if device {A} is backed by an IOMMU capable of blocking
no-snoop and device {B} is backed by an IOMMU which cannot block
no-snoop, then booting VM1 with {A,B} and later removing device {B}
would result in ongoing wbinvd emulation versus a VM2 only booted with
{A}.

Type1 would use separate IO page tables (domains/ioasids) for these such
that VM1 and VM2 have the same characteristics at the end.

Does this become user defined policy in the IOASID model?  There's
quite a mess of exposing sufficient GET_INFO for an IOASID for the user
to know such properties of the IOMMU, plus maybe we need mapping flags
equivalent to IOMMU_CACHE exposed to the user, preventing sharing an
IOASID that could generate IOMMU faults, etc.

> > > It doesn't solve the problem to connect kvm to AP and kvmgt though  
> > 
> > It does not, we'll probably need a vfio ioctl to gratuitously announce
> > the KVM fd to each device.  I think some devices might currently fail
> > their open callback if that linkage isn't already available though, so
> > it's not clear when that should happen, ie. it can't currently be a
> > VFIO_DEVICE ioctl as getting the device fd requires an open, but this
> > proposal requires some availability of the vfio device fd without any
> > setup, so presumably that won't yet call the driver open callback.
> > Maybe that's part of the attach phase now... I'm not sure, it's not
> > clear when the vfio device uAPI starts being available in the process
> > of setting up the ioasid.  Thanks,  
> 
> At a certain point we maybe just have to stick to backward compat, I
> think. Though it is useful to think about green field alternates to
> try to guide the backward compat design..

I think more to drive the replacement design; if we can't figure out
how to do something other than backwards compatibility trickery in the
kernel, it's probably going to bite us.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-03 13:09                           ` Jason Gunthorpe
@ 2021-06-04  1:11                             ` Jason Wang
  2021-06-04 11:58                               ` Jason Gunthorpe
  0 siblings, 1 reply; 258+ messages in thread
From: Jason Wang @ 2021-06-04  1:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Tian, Kevin, Jean-Philippe Brucker, Jiang, Dave,
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML, iommu,
	David Gibson, Kirti Wankhede, David Woodhouse


在 2021/6/3 下午9:09, Jason Gunthorpe 写道:
> On Thu, Jun 03, 2021 at 10:52:51AM +0800, Jason Wang wrote:
>
>> Basically, we don't want to bother with pseudo KVM device like what VFIO
>> did. So for simplicity, we rules out the IOMMU that can't enforce coherency
>> in vhost-vDPA if the parent purely depends on the platform IOMMU:
> VDPA HW cannot issue no-snoop TLPs in the first place.


Note that virtio/vDPA is not necessarily a PCI device.


>
> virtio does not define a protocol to discover such a functionality,


Actually we had:

VIRTIO_F_ACCESS_PLATFORM(33)
This feature indicates that the device can be used on a platform where 
device access to data in memory is limited and/or translated. E.g. this 
is the case if the device can be located behind an IOMMU that translates 
bus addresses from the device into physical addresses in memory, if the 
device can be limited to only access certain memory addresses or if 
special commands such as a cache flush can be needed to synchronise data 
in memory with the device.


> nor do any virtio drivers implement the required platform specific
> cache flushing to make no-snoop TLPs work.


I don't get why virtio drivers needs to do that. I think DMA API should 
hide those arch/platform specific stuffs from us.


>
> It is fundamentally part of the virtio HW PCI API that a device vendor
> cannot alter.


The spec doesn't forbid this, and it just leave the detection and action 
to the driver in a platform specific way.

Thanks


>
> Basically since we already know that the virtio kernel drivers do not
> call the cache flush instruction we don't need the weird KVM logic to
> turn it on at all.
>
> Enforcing no-snoop at the IOMMU here is redundant/confusing.
>
> Jason
>


^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-03 18:19             ` Jacob Pan
@ 2021-06-04  1:30               ` Jason Wang
  2021-06-04 16:22                 ` Jacob Pan
  2021-06-04  2:03               ` Shenming Lu
  1 sibling, 1 reply; 258+ messages in thread
From: Jason Wang @ 2021-06-04  1:30 UTC (permalink / raw)
  To: Jacob Pan, Shenming Lu
  Cc: Jason Gunthorpe, Lu Baolu, Tian, Kevin, LKML, Joerg Roedel,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu, Hao,
	Jiang, Dave, Jean-Philippe Brucker, David Gibson, Kirti Wankhede,
	Robin Murphy, Zenghui Yu, wanghaibin.wang


在 2021/6/4 上午2:19, Jacob Pan 写道:
> Hi Shenming,
>
> On Wed, 2 Jun 2021 12:50:26 +0800, Shenming Lu <lushenming@huawei.com>
> wrote:
>
>> On 2021/6/2 1:33, Jason Gunthorpe wrote:
>>> On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote:
>>>    
>>>> The drivers register per page table fault handlers to /dev/ioasid which
>>>> will then register itself to iommu core to listen and route the per-
>>>> device I/O page faults.
>>> I'm still confused why drivers need fault handlers at all?
>> Essentially it is the userspace that needs the fault handlers,
>> one case is to deliver the faults to the vIOMMU, and another
>> case is to enable IOPF on the GPA address space for on-demand
>> paging, it seems that both could be specified in/through the
>> IOASID_ALLOC ioctl?
>>
> I would think IOASID_BIND_PGTABLE is where fault handler should be
> registered. There wouldn't be any IO page fault without the binding anyway.
>
> I also don't understand why device drivers should register the fault
> handler, the fault is detected by the pIOMMU and injected to the vIOMMU. So
> I think it should be the IOASID itself register the handler.


As discussed in another thread.

I think the reason is that ATS doesn't forbid the #PF to be reported via 
a device specific way.

Thanks


>
>> Thanks,
>> Shenming
>>
>
> Thanks,
>
> Jacob
>


^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-03 18:19             ` Jacob Pan
  2021-06-04  1:30               ` Jason Wang
@ 2021-06-04  2:03               ` Shenming Lu
  2021-06-07 12:19                 ` Liu, Yi L
  1 sibling, 1 reply; 258+ messages in thread
From: Shenming Lu @ 2021-06-04  2:03 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Jason Gunthorpe, Lu Baolu, Tian, Kevin, LKML, Joerg Roedel,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jean-Philippe Brucker, David Gibson,
	Kirti Wankhede, Robin Murphy, Zenghui Yu, wanghaibin.wang

On 2021/6/4 2:19, Jacob Pan wrote:
> Hi Shenming,
> 
> On Wed, 2 Jun 2021 12:50:26 +0800, Shenming Lu <lushenming@huawei.com>
> wrote:
> 
>> On 2021/6/2 1:33, Jason Gunthorpe wrote:
>>> On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote:
>>>   
>>>> The drivers register per page table fault handlers to /dev/ioasid which
>>>> will then register itself to iommu core to listen and route the per-
>>>> device I/O page faults.   
>>>
>>> I'm still confused why drivers need fault handlers at all?  
>>
>> Essentially it is the userspace that needs the fault handlers,
>> one case is to deliver the faults to the vIOMMU, and another
>> case is to enable IOPF on the GPA address space for on-demand
>> paging, it seems that both could be specified in/through the
>> IOASID_ALLOC ioctl?
>>
> I would think IOASID_BIND_PGTABLE is where fault handler should be
> registered. There wouldn't be any IO page fault without the binding anyway.

Yeah, I also proposed this before, registering the handler in the BIND_PGTABLE
ioctl does make sense for the guest page faults. :-)

But how about the page faults from the GPA address space (it's page table is
mapped through the MAP_DMA ioctl)? From your point of view, it seems that we
should register the handler for the GPA address space in the (first) MAP_DMA
ioctl.

> 
> I also don't understand why device drivers should register the fault
> handler, the fault is detected by the pIOMMU and injected to the vIOMMU. So
> I think it should be the IOASID itself register the handler.

Yeah, and it can also be said that the provider of the page table registers the
handler (Baolu).

Thanks,
Shenming


^ permalink raw reply	[flat|nested] 258+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-06-03 11:47                 ` Jason Gunthorpe
@ 2021-06-04  2:15                   ` Tian, Kevin
  0 siblings, 0 replies; 258+ messages in thread
From: Tian, Kevin @ 2021-06-04  2:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: David Gibson, Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok,
	kvm, Jonathan Corbet, David Woodhouse, Jason Wang, LKML,
	Kirti Wankhede, Alex Williamson (alex.williamson@redhat.com),
	iommu, Robin Murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, June 3, 2021 7:47 PM
> 
> On Thu, Jun 03, 2021 at 06:49:20AM +0000, Tian, Kevin wrote:
> > > From: David Gibson
> > > Sent: Thursday, June 3, 2021 1:09 PM
> > [...]
> > > > > In this way the SW mode is the same as a HW mode with an infinite
> > > > > cache.
> > > > >
> > > > > The collaposed shadow page table is really just a cache.
> > > > >
> > > >
> > > > OK. One additional thing is that we may need a 'caching_mode"
> > > > thing reported by /dev/ioasid, indicating whether invalidation is
> > > > required when changing non-present to present. For hardware
> > > > nesting it's not reported as the hardware IOMMU will walk the
> > > > guest page table in cases of iotlb miss. For software nesting
> > > > caching_mode is reported so the user must issue invalidation
> > > > upon any change in guest page table so the kernel can update
> > > > the shadow page table timely.
> > >
> > > For the fist cut, I'd have the API assume that invalidates are
> > > *always* required.  Some bypass to avoid them in cases where they're
> > > not needed can be an additional extension.
> > >
> >
> > Isn't a typical TLB semantics is that non-present entries are not
> > cached thus invalidation is not required when making non-present
> > to present? It's true to both CPU TLB and IOMMU TLB. In reality
> > I feel there are more usages built on hardware nesting than software
> > nesting thus making default following hardware TLB behavior makes
> > more sense...
> 
> From a modelling perspective it makes sense to have the most general
> be the default and if an implementation can elide certain steps then
> describing those as additional behaviors on the universal baseline is
> cleaner
> 
> I'm surprised to hear your remarks about the not-present though,
> how does the vIOMMU emulation work if there are not hypervisor
> invalidation traps for not-present/present transitions?
> 

Such invalidation traps matter only for shadow I/O page table (software
nesting). For hardware nesting no trap is required for non-present/
present transition since physical IOTLB doesn't cache non-present 
entries. The IOMMU will walk the guest I/O page table in case of IOTLB 
miss.

The vIOMMU should be constructed according to whether software
or hardware nesting is used. For Intel (and AMD iirc), a caching_mode 
capability decides whether the guest needs to do invalidation for
non-present/present transition. Such vIOMMU should clear this bit
for hardware nesting or set it for software nesting. ARM SMMU doesn't
have this capability. Therefore their vSMMU can only work with a
hardware nested IOASID.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 258+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-06-03 12:11         ` Jason Gunthorpe
@ 2021-06-04  6:08           ` Tian, Kevin
  2021-06-04 12:33             ` Jason Gunthorpe
  2021-06-08  6:13           ` David Gibson
  1 sibling, 1 reply; 258+ messages in thread
From: Tian, Kevin @ 2021-06-04  6:08 UTC (permalink / raw)
  To: Jason Gunthorpe, David Gibson
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	Kirti Wankhede, Robin Murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, June 3, 2021 8:11 PM
> 
> On Thu, Jun 03, 2021 at 03:45:09PM +1000, David Gibson wrote:
> > On Wed, Jun 02, 2021 at 01:58:38PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Jun 02, 2021 at 04:48:35PM +1000, David Gibson wrote:
> > > > > > 	/* Bind guest I/O page table  */
> > > > > > 	bind_data = {
> > > > > > 		.ioasid	= gva_ioasid;
> > > > > > 		.addr	= gva_pgtable1;
> > > > > > 		// and format information
> > > > > > 	};
> > > > > > 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> > > > >
> > > > > Again I do wonder if this should just be part of alloc_ioasid. Is
> > > > > there any reason to split these things? The only advantage to the
> > > > > split is the device is known, but the device shouldn't impact
> > > > > anything..
> > > >
> > > > I'm pretty sure the device(s) could matter, although they probably
> > > > won't usually.
> > >
> > > It is a bit subtle, but the /dev/iommu fd itself is connected to the
> > > devices first. This prevents wildly incompatible devices from being
> > > joined together, and allows some "get info" to report the capability
> > > union of all devices if we want to do that.
> >
> > Right.. but I've not been convinced that having a /dev/iommu fd
> > instance be the boundary for these types of things actually makes
> > sense.  For example if we were doing the preregistration thing
> > (whether by child ASes or otherwise) then that still makes sense
> > across wildly different devices, but we couldn't share that layer if
> > we have to open different instances for each of them.
> 
> It is something that still seems up in the air.. What seems clear for
> /dev/iommu is that it
>  - holds a bunch of IOASID's organized into a tree
>  - holds a bunch of connected devices
>  - holds a pinned memory cache
> 
> One thing it must do is enforce IOMMU group security. A device cannot
> be attached to an IOASID unless all devices in its IOMMU group are
> part of the same /dev/iommu FD.
> 
> The big open question is what parameters govern allowing devices to
> connect to the /dev/iommu:
>  - all devices can connect and we model the differences inside the API
>    somehow.

I prefer to this option if no significant block ahead. 

>  - Only sufficiently "similar" devices can be connected
>  - The FD's capability is the minimum of all the connected devices
> 
> There are some practical problems here, when an IOASID is created the
> kernel does need to allocate a page table for it, and that has to be
> in some definite format.
> 
> It may be that we had a false start thinking the FD container should
> be limited. Perhaps creating an IOASID should pass in a list
> of the "device labels" that the IOASID will be used with and that can
> guide the kernel what to do?

In Qemu case the problem is that it doesn't know the list of devices
that will be attached to an IOASID when it's created. This is a guest-
side knowledge which is conveyed one device at a time to Qemu 
though vIOMMU.

I feel it's fair to say that before user wants to create an IOASID he
should already check the format information about the device which
is intended to be attached right after then when creating the IOASID
the user should specify a format compatible to the device. There is 
format check when IOASID is created, since its I/O page table is not
installed to the IOMMU yet. Later when the intended device is attached
to this IOASID, then verify the format and fail the attach request if
incompatible.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 258+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-06-03 12:46       ` Jason Gunthorpe
@ 2021-06-04  6:27         ` Tian, Kevin
  0 siblings, 0 replies; 258+ messages in thread
From: Tian, Kevin @ 2021-06-04  6:27 UTC (permalink / raw)
  To: Jason Gunthorpe, David Gibson
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	Kirti Wankhede, Robin Murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, June 3, 2021 8:46 PM
> 
> On Thu, Jun 03, 2021 at 04:26:08PM +1000, David Gibson wrote:
> 
> > > There are global properties in the /dev/iommu FD, like what devices
> > > are part of it, that are important for group security operations. This
> > > becomes confused if it is split to many FDs.
> >
> > I'm still not seeing those.  I'm really not seeing any well-defined
> > meaning to devices being attached to the fd, but not to a particular
> > IOAS.
> 
> Kevin can you add a section on how group security would have to work
> to the RFC? This is the idea you can't attach a device to an IOASID
> unless all devices in the IOMMU group are joined to the /dev/iommu FD.
> 
> The basic statement is that userspace must present the entire group
> membership to /dev/iommu to prove that it has the security right to
> manipulate their DMA translation.
> 
> It is the device centric analog to what the group FD is doing for
> security.
> 

Yes, will do.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 258+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-06-03 13:05     ` Jason Gunthorpe
@ 2021-06-04  6:37       ` Tian, Kevin
  2021-06-04 12:09         ` Jason Gunthorpe
  2021-06-15  8:59       ` Tian, Kevin
  1 sibling, 1 reply; 258+ messages in thread
From: Tian, Kevin @ 2021-06-04  6:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker,
	Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, David Gibson, Jiang, Dave,
	David Woodhouse, Jason Wang

> From: Jason Gunthorpe
> Sent: Thursday, June 3, 2021 9:05 PM
> 
> > >
> > > 3) Device accepts any PASIDs from the guest. No
> > >    vPASID/pPASID translation is possible. (classic vfio_pci)
> > > 4) Device accepts any PASID from the guest and has an
> > >    internal vPASID/pPASID translation (enhanced vfio_pci)
> >
> > what is enhanced vfio_pci? In my writing this is for mdev
> > which doesn't support ENQCMD
> 
> This is a vfio_pci that mediates some element of the device interface
> to communicate the vPASID/pPASID table to the device, using Max's
> series for vfio_pci drivers to inject itself into VFIO.
> 
> For instance a device might send a message through the PF that the VF
> has a certain vPASID/pPASID translation table. This would be useful
> for devices that cannot use ENQCMD but still want to support migration
> and thus need vPASID.

I still don't quite get. If it's a PCI device why is PASID translation required?
Just delegate the per-RID PASID space to user as type-3 then migrating the 
vPASID space is just straightforward. 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 258+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-06-03 12:40                               ` Jason Gunthorpe
  2021-06-03 20:41                                 ` Alex Williamson
@ 2021-06-04  7:33                                 ` Tian, Kevin
  1 sibling, 0 replies; 258+ messages in thread
From: Tian, Kevin @ 2021-06-04  7:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok,
	kvm, Jonathan Corbet, Robin Murphy, LKML, iommu, David Gibson,
	Kirti Wankhede, David Woodhouse, Jason Wang

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, June 3, 2021 8:41 PM
> 
> > When discussing I/O page fault support in another thread, the consensus
> > is that an device handle will be registered (by user) or allocated (return
> > to user) in /dev/ioasid when binding the device to ioasid fd. From this
> > angle we can register {ioasid_fd, device_handle} to KVM and then call
> > something like ioasidfd_device_is_coherent() to get the property.
> > Anyway the coherency is a per-device property which is not changed
> > by how many I/O page tables are attached to it.
> 
> It is not device specific, it is driver specific
> 
> As I said before, the question is if the IOASID itself can enforce
> snoop, or not. AND if the device will issue no-snoop or not.

Sure. My earlier comment was based on the assumption that all IOASIDs
attached to a device should inherit the same snoop/no-snoop fact. But
looks it doesn't prevent a device driver from setting PTE_SNP only for
selected I/O page tables, according to whether isoch agents are involved.

An user space driver could figure out per-IOASID requirements itself.

A guest device driver can indirectly convey this information through 
vIOMMU.

Registering {IOASID_FD, IOASID} to KVM has another merit, as we also
need it to update CPU PASID mapping for ENQCMD. We can define
one interface for both requirements. 😊

Thanks
Kevin

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-02  1:25       ` Tian, Kevin
  2021-06-02 23:27         ` Jason Gunthorpe
@ 2021-06-04  8:17         ` Jean-Philippe Brucker
  2021-06-04  8:43           ` Tian, Kevin
  1 sibling, 1 reply; 258+ messages in thread
From: Jean-Philippe Brucker @ 2021-06-04  8:17 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, David Gibson, Kirti Wankhede,
	Robin Murphy

On Wed, Jun 02, 2021 at 01:25:00AM +0000, Tian, Kevin wrote:
> > > This implies that VFIO_BOUND_IOASID will be extended to allow user
> > > specify a device label. This label will be recorded in /dev/iommu to
> > > serve per-device invalidation request from and report per-device
> > > fault data to the user.
> > 
> > I wonder which of the user providing a 64 bit cookie or the kernel
> > returning a small IDA is the best choice here? Both have merits
> > depending on what qemu needs..
> 
> Yes, either way can work. I don't have a strong preference. Jean?

I don't see an issue with either solution, maybe it will show up while
prototyping. First one uses IDs that do mean something for someone, and
userspace may inject faults slightly faster since it doesn't need an
ID->vRID lookup, so that's my preference.

> > > In addition, vPASID (if provided by user) will
> > > be also recorded in /dev/iommu so vPASID<->pPASID conversion
> > > is conducted properly. e.g. invalidation request from user carries
> > > a vPASID which must be converted into pPASID before calling iommu
> > > driver. Vice versa for raw fault data which carries pPASID while the
> > > user expects a vPASID.
> > 
> > I don't think the PASID should be returned at all. It should return
> > the IOASID number in the FD and/or a u64 cookie associated with that
> > IOASID. Userspace should figure out what the IOASID & device
> > combination means.
> 
> This is true for Intel. But what about ARM which has only one IOASID
> (pasid table) per device to represent all guest I/O page tables?

In that case vPASID = pPASID though. The vPASID allocated by the guest is
the same from the vIOMMU inval to the pIOMMU inval. I don't think host
kernel or userspace need to alter it.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 258+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-06-03 21:44                                   ` Alex Williamson
@ 2021-06-04  8:38                                     ` Tian, Kevin
  2021-06-04 12:28                                       ` Jason Gunthorpe
  0 siblings, 1 reply; 258+ messages in thread
From: Tian, Kevin @ 2021-06-04  8:38 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok, kvm,
	Jonathan Corbet, Robin Murphy, LKML, iommu, David Gibson,
	Kirti Wankhede, David Woodhouse, Jason Wang

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, June 4, 2021 5:44 AM
> 
> > Based on that observation we can say as soon as the user wants to use
> > an IOMMU that does not support DMA_PTE_SNP in the guest we can still
> > share the IO page table with IOMMUs that do support DMA_PTE_SNP.

page table sharing between incompatible IOMMUs is not a critical
thing. I prefer to disallowing sharing in such case as the starting point,
i.e. the user needs to create separate IOASIDs for such devices.

> 
> If your goal is to prioritize IO page table sharing, sure.  But because
> we cannot atomically transition from one to the other, each device is
> stuck with the pages tables it has, so the history of the VM becomes a
> factor in the performance characteristics.
> 
> For example if device {A} is backed by an IOMMU capable of blocking
> no-snoop and device {B} is backed by an IOMMU which cannot block
> no-snoop, then booting VM1 with {A,B} and later removing device {B}
> would result in ongoing wbinvd emulation versus a VM2 only booted with
> {A}.
> 
> Type1 would use separate IO page tables (domains/ioasids) for these such
> that VM1 and VM2 have the same characteristics at the end.
> 
> Does this become user defined policy in the IOASID model?  There's
> quite a mess of exposing sufficient GET_INFO for an IOASID for the user
> to know such properties of the IOMMU, plus maybe we need mapping flags
> equivalent to IOMMU_CACHE exposed to the user, preventing sharing an
> IOASID that could generate IOMMU faults, etc.

IOMMU_CACHE is a fixed attribute given an IOMMU. So it's better to
convey this info to userspace via GET_INFO for a device_label, before 
creating any IOASID. But overall I agree that careful thinking is required
about how to organize those info reporting (per-fd, per-device, per-ioasid)
to userspace.

> 
> > > > It doesn't solve the problem to connect kvm to AP and kvmgt though
> > >
> > > It does not, we'll probably need a vfio ioctl to gratuitously announce
> > > the KVM fd to each device.  I think some devices might currently fail
> > > their open callback if that linkage isn't already available though, so
> > > it's not clear when that should happen, ie. it can't currently be a
> > > VFIO_DEVICE ioctl as getting the device fd requires an open, but this
> > > proposal requires some availability of the vfio device fd without any
> > > setup, so presumably that won't yet call the driver open callback.
> > > Maybe that's part of the attach phase now... I'm not sure, it's not
> > > clear when the vfio device uAPI starts being available in the process
> > > of setting up the ioasid.  Thanks,
> >
> > At a certain point we maybe just have to stick to backward compat, I
> > think. Though it is useful to think about green field alternates to
> > try to guide the backward compat design..
> 
> I think more to drive the replacement design; if we can't figure out
> how to do something other than backwards compatibility trickery in the
> kernel, it's probably going to bite us.  Thanks,
> 

I'm a bit lost on the desired flow in your minds. Here is one flow based
on my understanding of this discussion. Please comment whether it
matches your thinking:

0) ioasid_fd is created and registered to KVM via KVM_ADD_IOASID_FD;

1) Qemu binds dev1 to ioasid_fd;

2) Qemu calls IOASID_GET_DEV_INFO for dev1. This will carry IOMMU_
     CACHE info i.e. whether underlying IOMMU can enforce snoop;

3) Qemu plans to create a gpa_ioasid, and attach dev1 to it. Here Qemu
    needs to figure out whether dev1 wants to do no-snoop. This might
    be based a fixed vendor/class list or specified by user;

4) gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC); At this point a 'snoop'
     flag is specified to decide the page table format, which is supposed
     to match dev1;

5) Qemu attaches dev1 to gpa_ioasid via VFIO_ATTACH_IOASID. At this 
     point, specify snoop/no-snoop again. If not supported by related 
     iommu or different from what gpa_ioasid has, attach fails.

6) call KVM to update the snoop requirement via KVM_UPADTE_IOASID_FD.
    this triggers ioasidfd_for_each_ioasid();

later when dev2 is attached to gpa_ioasid, same flow is followed. This
implies that KVM_UPDATE_IOASID_FD is called only when new IOASID is
created or existing IOASID is destroyed, because all devices under an 
IOASID should have the same snoop requirement.

Thanks
Kevin
     

^ permalink raw reply	[flat|nested] 258+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-06-04  8:17         ` Jean-Philippe Brucker
@ 2021-06-04  8:43           ` Tian, Kevin
  0 siblings, 0 replies; 258+ messages in thread
From: Tian, Kevin @ 2021-06-04  8:43 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Jason Gunthorpe, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, David Gibson, Kirti Wankhede,
	Robin Murphy

> From: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Sent: Friday, June 4, 2021 4:18 PM
> 
> On Wed, Jun 02, 2021 at 01:25:00AM +0000, Tian, Kevin wrote:
> > > > This implies that VFIO_BOUND_IOASID will be extended to allow user
> > > > specify a device label. This label will be recorded in /dev/iommu to
> > > > serve per-device invalidation request from and report per-device
> > > > fault data to the user.
> > >
> > > I wonder which of the user providing a 64 bit cookie or the kernel
> > > returning a small IDA is the best choice here? Both have merits
> > > depending on what qemu needs..
> >
> > Yes, either way can work. I don't have a strong preference. Jean?
> 
> I don't see an issue with either solution, maybe it will show up while
> prototyping. First one uses IDs that do mean something for someone, and
> userspace may inject faults slightly faster since it doesn't need an
> ID->vRID lookup, so that's my preference.

ok, will go for the first option in v2.

> 
> > > > In addition, vPASID (if provided by user) will
> > > > be also recorded in /dev/iommu so vPASID<->pPASID conversion
> > > > is conducted properly. e.g. invalidation request from user carries
> > > > a vPASID which must be converted into pPASID before calling iommu
> > > > driver. Vice versa for raw fault data which carries pPASID while the
> > > > user expects a vPASID.
> > >
> > > I don't think the PASID should be returned at all. It should return
> > > the IOASID number in the FD and/or a u64 cookie associated with that
> > > IOASID. Userspace should figure out what the IOASID & device
> > > combination means.
> >
> > This is true for Intel. But what about ARM which has only one IOASID
> > (pasid table) per device to represent all guest I/O page tables?
> 
> In that case vPASID = pPASID though. The vPASID allocated by the guest is
> the same from the vIOMMU inval to the pIOMMU inval. I don't think host
> kernel or userspace need to alter it.
> 

yes. So responding to Jason's earlier comment we do need return
PASID (although no conversion is required) to userspace in this
case. 😊

Thanks
Kevin

^ permalink raw reply	[flat|nested] 258+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-06-03 20:41                                 ` Alex Williamson
@ 2021-06-04  9:19                                   ` Tian, Kevin
  2021-06-04 15:37                                     ` Alex Williamson
  2021-06-04 12:13                                   ` Jason Gunthorpe
  1 sibling, 1 reply; 258+ messages in thread
From: Tian, Kevin @ 2021-06-04  9:19 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok, kvm,
	Jonathan Corbet, Robin Murphy, LKML, iommu, David Gibson,
	Kirti Wankhede, David Woodhouse, Jason Wang

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, June 4, 2021 4:42 AM
> 
> > 'qemu --allow-no-snoop' makes more sense to me
> 
> I'd be tempted to attach it to the -device vfio-pci option, it's
> specific drivers for specific devices that are going to want this and
> those devices may not be permanently attached to the VM.  But I see in
> the other thread you're trying to optimize IOMMU page table sharing.
> 
> There's a usability question in either case though and I'm not sure how
> to get around it other than QEMU or the kernel knowing a list of
> devices (explicit IDs or vendor+class) to select per device defaults.
> 

"-device vfio-pci" is a per-device option, which implies that the
no-snoop choice is given to the admin then no need to maintain 
a fixed device list in Qemu?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-03  5:45       ` David Gibson
  2021-06-03 12:11         ` Jason Gunthorpe
@ 2021-06-04 10:24         ` Jean-Philippe Brucker
  2021-06-04 12:05           ` Jason Gunthorpe
  2021-06-08  6:31           ` David Gibson
  1 sibling, 2 replies; 258+ messages in thread
From: Jean-Philippe Brucker @ 2021-06-04 10:24 UTC (permalink / raw)
  To: David Gibson
  Cc: Jason Gunthorpe, Tian, Kevin, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Kirti Wankhede, Robin Murphy

On Thu, Jun 03, 2021 at 03:45:09PM +1000, David Gibson wrote:
> > > But it would certainly be possible for a system to have two
> > > different host bridges with two different IOMMUs with different
> > > pagetable formats.  Until you know which devices (and therefore
> > > which host bridge) you're talking about, you don't know what formats
> > > of pagetable to accept.  And if you have devices from *both* bridges
> > > you can't bind a page table at all - you could theoretically support
> > > a kernel managed pagetable by mirroring each MAP and UNMAP to tables
> > > in both formats, but it would be pretty reasonable not to support
> > > that.
> > 
> > The basic process for a user space owned pgtable mode would be:
> > 
> >  1) qemu has to figure out what format of pgtable to use
> > 
> >     Presumably it uses query functions using the device label.
> 
> No... in the qemu case it would always select the page table format
> that it needs to present to the guest.  That's part of the
> guest-visible platform that's selected by qemu's configuration.
> 
> There's no negotiation here: either the kernel can supply what qemu
> needs to pass to the guest, or it can't.  If it can't qemu, will have
> to either emulate in SW (if possible, probably using a kernel-managed
> IOASID to back it) or fail outright.
> 
> >     The
> >     kernel code should look at the entire device path through all the
> >     IOMMU HW to determine what is possible.
> > 
> >     Or it already knows because the VM's vIOMMU is running in some
> >     fixed page table format, or the VM's vIOMMU already told it, or
> >     something.
> 
> Again, I think you have the order a bit backwards.  The user selects
> the capabilities that the vIOMMU will present to the guest as part of
> the qemu configuration.  Qemu then requests that of the host kernel,
> and either the host kernel supplies it, qemu emulates it in SW, or
> qemu fails to start.

Hm, how fine a capability are we talking about?  If it's just "give me
VT-d capabilities" or "give me Arm capabilities" that would work, but
probably isn't useful. Anything finer will be awkward because userspace
will have to try combinations of capabilities to see what sticks, and
supporting new hardware will drop compatibility for older one.

For example depending whether the hardware IOMMU is SMMUv2 or SMMUv3, that
completely changes the capabilities offered to the guest (some v2
implementations support nesting page tables, but never PASID nor PRI
unlike v3.) The same vIOMMU could support either, presenting different
capabilities to the guest, even multiple page table formats if we wanted
to be exhaustive (SMMUv2 supports the older 32-bit descriptor), but it
needs to know early on what the hardware is precisely. Then some new page
table format shows up and, although the vIOMMU can support that in
addition to older ones, QEMU will have to pick a single one, that it
assumes the guest knows how to drive?

I think once it binds a device to an IOASID fd, QEMU will want to probe
what hardware features are available before going further with the vIOMMU
setup (is there PASID, PRI, which page table formats are supported,
address size, page granule, etc). Obtaining precise information about the
hardware would be less awkward than trying different configurations until
one succeeds. Binding an additional device would then fail if its pIOMMU
doesn't support exactly the features supported for the first device,
because we don't know which ones the guest will choose. QEMU will have to
open a new IOASID fd for that device.

Thanks,
Jean

> 
> Guest visible properties of the platform never (or *should* never)
> depend implicitly on host capabilities - it's impossible to sanely
> support migration in such an environment.
> 
> >  2) qemu creates an IOASID and based on #1 and says 'I want this format'
> 
> Right.
> 
> >  3) qemu binds the IOASID to the device. 
> > 
> >     If qmeu gets it wrong then it just fails.
> 
> Right, though it may be fall back to (partial) software emulation.  In
> practice that would mean using a kernel-managed IOASID and walking the
> guest IO pagetables itself to mirror them into the host kernel.
> 
> >  4) For the next device qemu would have to figure out if it can re-use
> >     an existing IOASID based on the required proeprties.
> 
> Nope.  Again, what devices share an IO address space is a guest
> visible part of the platform.  If the host kernel can't supply that,
> then qemu must not start (or fail the hotplug if the new device is
> being hotplugged).

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-02 17:24   ` Jason Gunthorpe
@ 2021-06-04 10:44     ` Enrico Weigelt, metux IT consult
  2021-06-04 12:30       ` Jason Gunthorpe
  0 siblings, 1 reply; 258+ messages in thread
From: Enrico Weigelt, metux IT consult @ 2021-06-04 10:44 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

On 02.06.21 19:24, Jason Gunthorpe wrote:

Hi,

 >> If I understand this correctly, /dev/ioasid is a kind of "common 
supplier"
 >> to other APIs / devices. Why can't the fd be acquired by the
 >> consumer APIs (eg. kvm, vfio, etc) ?
 >
 > /dev/ioasid would be similar to /dev/vfio, and everything already
 > deals with exposing /dev/vfio and /dev/vfio/N together
 >
 > I don't see it as a problem, just more work.

One of the problems I'm seeing is in container environments: when
passing in an vfio device, we now also need to pass in /dev/ioasid,
thus increasing the complexity in container setup (or orchestration).

And in such scenarios you usually want to pass in one specific device,
not all of the same class, and usually orchestration shall pick the
next free one.

Can we make sure that a process having full access to /dev/ioasid
while only supposed to have to specific consumer devices, can't do
any harm (eg. influencing other containers that might use a different
consumer device) ?

Note that we don't have device namespaces yet (device isolation still
has to be done w/ complicated bpf magic). I'm already working on that,
but even "simple" things like loopdev allocation turns out to be not
entirely easy.

 > Having FDs spawn other FDs is pretty ugly, it defeats the "everything
 > is a file" model of UNIX.

Unfortunately, this is already defeated in many other places :(
(I'd even claim that ioctls already break it :p)

It seems your approach also breaks this, since we now need to open two
files in order to talk to one device.

By the way: my idea does keep the "everything's a file" concept - we
just have a file that allows opening "sub-files". Well, it would be
better if devices could also have directory semantics.


--mtx

---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@metux.net -- +49-151-27565287

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-04  1:11                             ` Jason Wang
@ 2021-06-04 11:58                               ` Jason Gunthorpe
  2021-06-07  3:18                                 ` Jason Wang
  2021-06-30  7:05                                 ` Christoph Hellwig
  0 siblings, 2 replies; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-04 11:58 UTC (permalink / raw)
  To: Jason Wang
  Cc: Alex Williamson, Tian, Kevin, Jean-Philippe Brucker, Jiang, Dave,
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML, iommu,
	David Gibson, Kirti Wankhede, David Woodhouse

On Fri, Jun 04, 2021 at 09:11:03AM +0800, Jason Wang wrote:
> > nor do any virtio drivers implement the required platform specific
> > cache flushing to make no-snoop TLPs work.
> 
> I don't get why virtio drivers needs to do that. I think DMA API should hide
> those arch/platform specific stuffs from us.

It is not arch/platform stuff. If the device uses no-snoop then a
very platform specific recovery is required in the device driver.

It is not part of the normal DMA API, it is side APIs like
flush_agp_cache() or wbinvd() that are used by GPU drivers only.

If drivers/virtio doesn't explicitly call these things it doesn't
support no-snoop - hence no VDPA device can ever use no-snoop.

Since VIRTIO_F_ACCESS_PLATFORM doesn't trigger wbinvd on x86 it has
nothing to do with no-snoop.

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-04 10:24         ` Jean-Philippe Brucker
@ 2021-06-04 12:05           ` Jason Gunthorpe
  2021-06-04 17:27             ` Jacob Pan
  2021-06-08  6:31           ` David Gibson
  1 sibling, 1 reply; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-04 12:05 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: David Gibson, Tian, Kevin, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Kirti Wankhede, Robin Murphy

On Fri, Jun 04, 2021 at 12:24:08PM +0200, Jean-Philippe Brucker wrote:

> I think once it binds a device to an IOASID fd, QEMU will want to probe
> what hardware features are available before going further with the vIOMMU
> setup (is there PASID, PRI, which page table formats are supported,

I think David's point was that qemu should be told what vIOMMU it is
emulating exactly (right down to what features it has) and then
the goal is simply to match what the vIOMMU needs with direct HW
support via /dev/ioasid and fall back to SW emulation when not
possible.

If qemu wants to have some auto-configuration: 'pass host IOMMU
capabilities' similar to the CPU flags then qemu should probe the
/dev/ioasid - and maybe we should just return some highly rolled up
"this is IOMMU HW ID ARM SMMU vXYZ" out of some query to guide qemu in
doing this.

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-04  6:37       ` Tian, Kevin
@ 2021-06-04 12:09         ` Jason Gunthorpe
  2021-06-04 23:10           ` Tian, Kevin
  0 siblings, 1 reply; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-04 12:09 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jean-Philippe Brucker,
	Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, David Gibson, Jiang, Dave,
	David Woodhouse, Jason Wang

On Fri, Jun 04, 2021 at 06:37:26AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe
> > Sent: Thursday, June 3, 2021 9:05 PM
> > 
> > > >
> > > > 3) Device accepts any PASIDs from the guest. No
> > > >    vPASID/pPASID translation is possible. (classic vfio_pci)
> > > > 4) Device accepts any PASID from the guest and has an
> > > >    internal vPASID/pPASID translation (enhanced vfio_pci)
> > >
> > > what is enhanced vfio_pci? In my writing this is for mdev
> > > which doesn't support ENQCMD
> > 
> > This is a vfio_pci that mediates some element of the device interface
> > to communicate the vPASID/pPASID table to the device, using Max's
> > series for vfio_pci drivers to inject itself into VFIO.
> > 
> > For instance a device might send a message through the PF that the VF
> > has a certain vPASID/pPASID translation table. This would be useful
> > for devices that cannot use ENQCMD but still want to support migration
> > and thus need vPASID.
> 
> I still don't quite get. If it's a PCI device why is PASID translation required?
> Just delegate the per-RID PASID space to user as type-3 then migrating the 
> vPASID space is just straightforward.

This is only possible if we get rid of the global pPASID allocation
(honestly is my preference as it makes the HW a lot simpler)

Without vPASID the migration would need pPASID's on the RID that are
guarenteed free.

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-03 20:41                                 ` Alex Williamson
  2021-06-04  9:19                                   ` Tian, Kevin
@ 2021-06-04 12:13                                   ` Jason Gunthorpe
  2021-06-04 21:45                                     ` Alex Williamson
  1 sibling, 1 reply; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-04 12:13 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok, kvm,
	Jonathan Corbet, Robin Murphy, LKML, iommu, David Gibson,
	Kirti Wankhede, David Woodhouse, Jason Wang

On Thu, Jun 03, 2021 at 02:41:36PM -0600, Alex Williamson wrote:

> Could you clarify "vfio_driver"?  

This is the thing providing the vfio_device_ops function pointers.

So vfio-pci can't know anything about this (although your no-snoop
control probing idea makes sense to me)

But vfio_mlx5_pci can know

So can mdev_idxd

And kvmgt

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-04  8:38                                     ` Tian, Kevin
@ 2021-06-04 12:28                                       ` Jason Gunthorpe
  2021-06-04 15:26                                         ` Alex Williamson
  0 siblings, 1 reply; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-04 12:28 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok,
	kvm, Jonathan Corbet, Robin Murphy, LKML, iommu, David Gibson,
	Kirti Wankhede, David Woodhouse, Jason Wang

On Fri, Jun 04, 2021 at 08:38:26AM +0000, Tian, Kevin wrote:
> > I think more to drive the replacement design; if we can't figure out
> > how to do something other than backwards compatibility trickery in the
> > kernel, it's probably going to bite us.  Thanks,
> 
> I'm a bit lost on the desired flow in your minds. Here is one flow based
> on my understanding of this discussion. Please comment whether it
> matches your thinking:
> 
> 0) ioasid_fd is created and registered to KVM via KVM_ADD_IOASID_FD;
> 
> 1) Qemu binds dev1 to ioasid_fd;
> 
> 2) Qemu calls IOASID_GET_DEV_INFO for dev1. This will carry IOMMU_
>      CACHE info i.e. whether underlying IOMMU can enforce snoop;
> 
> 3) Qemu plans to create a gpa_ioasid, and attach dev1 to it. Here Qemu
>     needs to figure out whether dev1 wants to do no-snoop. This might
>     be based a fixed vendor/class list or specified by user;
> 
> 4) gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC); At this point a 'snoop'
>      flag is specified to decide the page table format, which is supposed
>      to match dev1;

> 5) Qemu attaches dev1 to gpa_ioasid via VFIO_ATTACH_IOASID. At this 
>      point, specify snoop/no-snoop again. If not supported by related 
>      iommu or different from what gpa_ioasid has, attach fails.

Why do we need to specify it again?

If the IOASID was created with the "block no-snoop" flag then it is
blocked in that IOASID, and that blocking sets the page table format.

The only question is if we can successfully attach a device to the
page table, or not.

The KVM interface is a bit tricky because Alex said this is partially
security, wbinvd is only enabled if someone has a FD to a device that
can support no-snoop. 

Personally I think this got way too complicated, the KVM interface
should simply be

ioctl(KVM_ALLOW_INCOHERENT_DMA, ioasidfd, device_label)
ioctl(KVM_DISALLOW_INCOHERENT_DMA, ioasidfd, device_label)

and let qemu sort it out based on command flags, detection, whatever.

'ioasidfd, device_label' is the security proof that Alex asked
for. This needs to be some device in the ioasidfd that declares it is
capabale of no-snoop. Eg vfio_pci would always declare it is capable
of no-snoop.

No kernel call backs, no kernel auto-sync/etc. If qemu mismatches the
IOASID block no-snoop flag with the KVM_x_INCOHERENT_DMA state then it
is just a kernel-harmless uerspace bug.

Then user space can decide which of the various axis's it wants to
optimize for.

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-04 10:44     ` Enrico Weigelt, metux IT consult
@ 2021-06-04 12:30       ` Jason Gunthorpe
  2021-06-08  1:15         ` David Gibson
  2021-06-08 10:43         ` Enrico Weigelt, metux IT consult
  0 siblings, 2 replies; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-04 12:30 UTC (permalink / raw)
  To: Enrico Weigelt, metux IT consult
  Cc: Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	David Gibson, Kirti Wankhede, Robin Murphy

On Fri, Jun 04, 2021 at 12:44:28PM +0200, Enrico Weigelt, metux IT consult wrote:
> On 02.06.21 19:24, Jason Gunthorpe wrote:
> 
> Hi,
> 
> >> If I understand this correctly, /dev/ioasid is a kind of "common
> supplier"
> >> to other APIs / devices. Why can't the fd be acquired by the
> >> consumer APIs (eg. kvm, vfio, etc) ?
> >
> > /dev/ioasid would be similar to /dev/vfio, and everything already
> > deals with exposing /dev/vfio and /dev/vfio/N together
> >
> > I don't see it as a problem, just more work.
> 
> One of the problems I'm seeing is in container environments: when
> passing in an vfio device, we now also need to pass in /dev/ioasid,
> thus increasing the complexity in container setup (or orchestration).

Containers already needed to do this today. Container orchestration is
hard.

> And in such scenarios you usually want to pass in one specific device,
> not all of the same class, and usually orchestration shall pick the
> next free one.
> 
> Can we make sure that a process having full access to /dev/ioasid
> while only supposed to have to specific consumer devices, can't do
> any harm (eg. influencing other containers that might use a different
> consumer device) ?

Yes, /dev/ioasid shouldn't do anything unless you have a device to
connect it with. In this way it is probably safe to stuff it into
every container.

> > Having FDs spawn other FDs is pretty ugly, it defeats the "everything
> > is a file" model of UNIX.
> 
> Unfortunately, this is already defeated in many other places :(
> (I'd even claim that ioctls already break it :p)

I think you are reaching a bit :)

> It seems your approach also breaks this, since we now need to open two
> files in order to talk to one device.

It is two devices, thus two files.

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-04  6:08           ` Tian, Kevin
@ 2021-06-04 12:33             ` Jason Gunthorpe
  2021-06-04 23:20               ` Tian, Kevin
  0 siblings, 1 reply; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-04 12:33 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: David Gibson, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	Kirti Wankhede, Robin Murphy

On Fri, Jun 04, 2021 at 06:08:28AM +0000, Tian, Kevin wrote:

> In Qemu case the problem is that it doesn't know the list of devices
> that will be attached to an IOASID when it's created. This is a guest-
> side knowledge which is conveyed one device at a time to Qemu 
> though vIOMMU.

At least for the guest side it is alot simpler because the vIOMMU
being emulated will define nearly everything. 

qemu will just have to ask the kernel for whatever it is the guest is
doing. If the kernel can't do it then qemu has to SW emulate.

The no-snoop block may be the only thing that is under qemu's control
because it is transparent to the guest.

This will probably become clearer as people start to define what the
get_info should return.

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-04 12:28                                       ` Jason Gunthorpe
@ 2021-06-04 15:26                                         ` Alex Williamson
  2021-06-04 15:40                                           ` Paolo Bonzini
  0 siblings, 1 reply; 258+ messages in thread
From: Alex Williamson @ 2021-06-04 15:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok, kvm,
	Jonathan Corbet, Robin Murphy, LKML, iommu, David Gibson,
	Kirti Wankhede, David Woodhouse, Jason Wang, Bonzini, Paolo

[Cc +Paolo]

On Fri, 4 Jun 2021 09:28:30 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Fri, Jun 04, 2021 at 08:38:26AM +0000, Tian, Kevin wrote:
> > > I think more to drive the replacement design; if we can't figure out
> > > how to do something other than backwards compatibility trickery in the
> > > kernel, it's probably going to bite us.  Thanks,  
> > 
> > I'm a bit lost on the desired flow in your minds. Here is one flow based
> > on my understanding of this discussion. Please comment whether it
> > matches your thinking:
> > 
> > 0) ioasid_fd is created and registered to KVM via KVM_ADD_IOASID_FD;
> > 
> > 1) Qemu binds dev1 to ioasid_fd;
> > 
> > 2) Qemu calls IOASID_GET_DEV_INFO for dev1. This will carry IOMMU_
> >      CACHE info i.e. whether underlying IOMMU can enforce snoop;
> > 
> > 3) Qemu plans to create a gpa_ioasid, and attach dev1 to it. Here Qemu
> >     needs to figure out whether dev1 wants to do no-snoop. This might
> >     be based a fixed vendor/class list or specified by user;
> > 
> > 4) gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC); At this point a 'snoop'
> >      flag is specified to decide the page table format, which is supposed
> >      to match dev1;  
> 
> > 5) Qemu attaches dev1 to gpa_ioasid via VFIO_ATTACH_IOASID. At this 
> >      point, specify snoop/no-snoop again. If not supported by related 
> >      iommu or different from what gpa_ioasid has, attach fails.  
> 
> Why do we need to specify it again?

My thought as well.

> If the IOASID was created with the "block no-snoop" flag then it is
> blocked in that IOASID, and that blocking sets the page table format.
> 
> The only question is if we can successfully attach a device to the
> page table, or not.
> 
> The KVM interface is a bit tricky because Alex said this is partially
> security, wbinvd is only enabled if someone has a FD to a device that
> can support no-snoop. 
> 
> Personally I think this got way too complicated, the KVM interface
> should simply be
> 
> ioctl(KVM_ALLOW_INCOHERENT_DMA, ioasidfd, device_label)
> ioctl(KVM_DISALLOW_INCOHERENT_DMA, ioasidfd, device_label)
> 
> and let qemu sort it out based on command flags, detection, whatever.
> 
> 'ioasidfd, device_label' is the security proof that Alex asked
> for. This needs to be some device in the ioasidfd that declares it is
> capabale of no-snoop. Eg vfio_pci would always declare it is capable
> of no-snoop.
> 
> No kernel call backs, no kernel auto-sync/etc. If qemu mismatches the
> IOASID block no-snoop flag with the KVM_x_INCOHERENT_DMA state then it
> is just a kernel-harmless uerspace bug.
> 
> Then user space can decide which of the various axis's it wants to
> optimize for.

Let's make sure the KVM folks are part of this decision; a re-cap for
them, KVM currently automatically enables wbinvd emulation when
potentially non-coherent devices are present which is determined solely
based on the IOMMU's (or platform's, as exposed via the IOMMU) ability
to essentially force no-snoop transactions from a device to be cache
coherent.  This synchronization is triggered via the kvm-vfio device,
where QEMU creates the device and adds/removes vfio group fd
descriptors as an additionally layer to prevent the user from enabling
wbinvd emulation on a whim.

IIRC, this latter association was considered a security/DoS issue to
prevent a malicious guest/userspace from creating a disproportionate
system load.

Where would KVM stand on allowing more direct userspace control of
wbinvd behavior?  Would arbitrary control be acceptable or should we
continue to require it only in association to a device requiring it for
correct operation.

A wrinkle in "correct operation" is that while the IOMMU may be able to
force no-snoop transactions to be coherent, in the scenario described
in the previous reply, the user may intend to use non-coherent DMA
regardless of the IOMMU capabilities due to their own optimization
policy.  There's a whole spectrum here, including aspects we can't
determine around the device driver's intentions to use non-coherent
transactions, the user's policy in trading hypervisor overhead for
cache coherence overhead, etc.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-04  9:19                                   ` Tian, Kevin
@ 2021-06-04 15:37                                     ` Alex Williamson
  0 siblings, 0 replies; 258+ messages in thread
From: Alex Williamson @ 2021-06-04 15:37 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok,
	kvm, Jonathan Corbet, Robin Murphy, LKML, iommu, David Gibson,
	Kirti Wankhede, David Woodhouse, Jason Wang

On Fri, 4 Jun 2021 09:19:50 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Friday, June 4, 2021 4:42 AM
> >   
> > > 'qemu --allow-no-snoop' makes more sense to me  
> > 
> > I'd be tempted to attach it to the -device vfio-pci option, it's
> > specific drivers for specific devices that are going to want this and
> > those devices may not be permanently attached to the VM.  But I see in
> > the other thread you're trying to optimize IOMMU page table sharing.
> > 
> > There's a usability question in either case though and I'm not sure how
> > to get around it other than QEMU or the kernel knowing a list of
> > devices (explicit IDs or vendor+class) to select per device defaults.
> >   
> 
> "-device vfio-pci" is a per-device option, which implies that the
> no-snoop choice is given to the admin then no need to maintain 
> a fixed device list in Qemu?

I think we want to look at where we put it to have the best default
user experience.  For example the QEMU vfio-pci device option could use
on/off/auto semantics where auto is the default and QEMU maintains a
list of IDs or vendor/class configurations where we've determined the
"optimal" auto configuration.  Management tools could provide an
override, but we're imposing some pretty technical requirements for a
management tool to be able to come up with good per device defaults.
Seems like we should consolidate that technical decision in one place.
Thanks,

Alex


^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-04 15:26                                         ` Alex Williamson
@ 2021-06-04 15:40                                           ` Paolo Bonzini
  2021-06-04 15:50                                             ` Jason Gunthorpe
  0 siblings, 1 reply; 258+ messages in thread
From: Paolo Bonzini @ 2021-06-04 15:40 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: Tian, Kevin, Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok, kvm,
	Jonathan Corbet, Robin Murphy, LKML, iommu, David Gibson,
	Kirti Wankhede, David Woodhouse, Jason Wang

On 04/06/21 17:26, Alex Williamson wrote:
> Let's make sure the KVM folks are part of this decision; a re-cap for
> them, KVM currently automatically enables wbinvd emulation when
> potentially non-coherent devices are present which is determined solely
> based on the IOMMU's (or platform's, as exposed via the IOMMU) ability
> to essentially force no-snoop transactions from a device to be cache
> coherent.  This synchronization is triggered via the kvm-vfio device,
> where QEMU creates the device and adds/removes vfio group fd
> descriptors as an additionally layer to prevent the user from enabling
> wbinvd emulation on a whim.
> 
> IIRC, this latter association was considered a security/DoS issue to
> prevent a malicious guest/userspace from creating a disproportionate
> system load.
> 
> Where would KVM stand on allowing more direct userspace control of
> wbinvd behavior?  Would arbitrary control be acceptable or should we
> continue to require it only in association to a device requiring it for
> correct operation.

Extending the scenarios where WBINVD is not a nop is not a problem for 
me.  If possible I wouldn't mind keeping the existing kvm-vfio 
connection via the device, if only because then the decision remains in 
the VFIO camp (whose judgment I trust more than mine on this kind of issue).

For example, would it make sense if *VFIO* (not KVM) gets an API that 
says "I am going to do incoherent DMA"?  Then that API causes WBINVD to 
become not-a-nop even on otherwise coherent platforms.  (Would this make 
sense at all without a hypervisor that indirectly lets userspace execute 
WBINVD?  Perhaps VFIO would benefit from a WBINVD ioctl too).

Paolo


^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-04 15:40                                           ` Paolo Bonzini
@ 2021-06-04 15:50                                             ` Jason Gunthorpe
  2021-06-04 15:57                                               ` Paolo Bonzini
  0 siblings, 1 reply; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-04 15:50 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Alex Williamson, Tian, Kevin, Jean-Philippe Brucker, Jiang, Dave,
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML, iommu,
	David Gibson, Kirti Wankhede, David Woodhouse, Jason Wang

On Fri, Jun 04, 2021 at 05:40:34PM +0200, Paolo Bonzini wrote:
> On 04/06/21 17:26, Alex Williamson wrote:
> > Let's make sure the KVM folks are part of this decision; a re-cap for
> > them, KVM currently automatically enables wbinvd emulation when
> > potentially non-coherent devices are present which is determined solely
> > based on the IOMMU's (or platform's, as exposed via the IOMMU) ability
> > to essentially force no-snoop transactions from a device to be cache
> > coherent.  This synchronization is triggered via the kvm-vfio device,
> > where QEMU creates the device and adds/removes vfio group fd
> > descriptors as an additionally layer to prevent the user from enabling
> > wbinvd emulation on a whim.
> > 
> > IIRC, this latter association was considered a security/DoS issue to
> > prevent a malicious guest/userspace from creating a disproportionate
> > system load.
> > 
> > Where would KVM stand on allowing more direct userspace control of
> > wbinvd behavior?  Would arbitrary control be acceptable or should we
> > continue to require it only in association to a device requiring it for
> > correct operation.
> 
> Extending the scenarios where WBINVD is not a nop is not a problem for me.
> If possible I wouldn't mind keeping the existing kvm-vfio connection via the
> device, if only because then the decision remains in the VFIO camp (whose
> judgment I trust more than mine on this kind of issue).

Really the question to answer is what "security proof" do you want
before the wbinvd can be enabled

 1) User has access to a device that can issue no-snoop TLPS
 2) User has access to an IOMMU that can not block no-snoop (today)
 3) Require CAP_SYS_RAW_IO
 4) Anyone

#1 is an improvement because it allows userspace to enable wbinvd and
no-snoop optimizations based on user choice

#2 is where we are today and wbinvd effectively becomes a fixed
platform choice. Userspace has no say

#3 is "there is a problem, but not so serious, root is powerful
   enough to override"

#4 is "there is no problem here"

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-04 15:50                                             ` Jason Gunthorpe
@ 2021-06-04 15:57                                               ` Paolo Bonzini
  2021-06-04 16:03                                                 ` Jason Gunthorpe
  0 siblings, 1 reply; 258+ messages in thread
From: Paolo Bonzini @ 2021-06-04 15:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Tian, Kevin, Jean-Philippe Brucker, Jiang, Dave,
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML, iommu,
	David Gibson, Kirti Wankhede, David Woodhouse, Jason Wang

On 04/06/21 17:50, Jason Gunthorpe wrote:
>> Extending the scenarios where WBINVD is not a nop is not a problem for me.
>> If possible I wouldn't mind keeping the existing kvm-vfio connection via the
>> device, if only because then the decision remains in the VFIO camp (whose
>> judgment I trust more than mine on this kind of issue).
> Really the question to answer is what "security proof" do you want
> before the wbinvd can be enabled

I don't want a security proof myself; I want to trust VFIO to make the 
right judgment and I'm happy to defer to it (via the KVM-VFIO device).

Given how KVM is just a device driver inside Linux, VMs should be a 
slightly more roundabout way to do stuff that is accessible to bare 
metal; not a way to gain extra privilege.

Paolo

>   1) User has access to a device that can issue no-snoop TLPS
>   2) User has access to an IOMMU that can not block no-snoop (today)
>   3) Require CAP_SYS_RAW_IO
>   4) Anyone
> 
> #1 is an improvement because it allows userspace to enable wbinvd and
> no-snoop optimizations based on user choice
> 
> #2 is where we are today and wbinvd effectively becomes a fixed
> platform choice. Userspace has no say
> 
> #3 is "there is a problem, but not so serious, root is powerful
>     enough to override"


^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-04 15:57                                               ` Paolo Bonzini
@ 2021-06-04 16:03                                                 ` Jason Gunthorpe
  2021-06-04 16:10                                                   ` Paolo Bonzini
  0 siblings, 1 reply; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-04 16:03 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Alex Williamson, Tian, Kevin, Jean-Philippe Brucker, Jiang, Dave,
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML, iommu,
	David Gibson, Kirti Wankhede, David Woodhouse, Jason Wang

On Fri, Jun 04, 2021 at 05:57:19PM +0200, Paolo Bonzini wrote:
> On 04/06/21 17:50, Jason Gunthorpe wrote:
> > > Extending the scenarios where WBINVD is not a nop is not a problem for me.
> > > If possible I wouldn't mind keeping the existing kvm-vfio connection via the
> > > device, if only because then the decision remains in the VFIO camp (whose
> > > judgment I trust more than mine on this kind of issue).
> > Really the question to answer is what "security proof" do you want
> > before the wbinvd can be enabled
> 
> I don't want a security proof myself; I want to trust VFIO to make the right
> judgment and I'm happy to defer to it (via the KVM-VFIO device).
> 
> Given how KVM is just a device driver inside Linux, VMs should be a slightly
> more roundabout way to do stuff that is accessible to bare metal; not a way
> to gain extra privilege.

Okay, fine, lets turn the question on its head then.

VFIO should provide a IOCTL VFIO_EXECUTE_WBINVD so that userspace VFIO
application can make use of no-snoop optimizations. The ability of KVM
to execute wbinvd should be tied to the ability of that IOCTL to run
in a normal process context.

So, under what conditions do we want to allow VFIO to giave a process
elevated access to the CPU:

> >   1) User has access to a device that can issue no-snoop TLPS
> >   2) User has access to an IOMMU that can not block no-snoop (today)
> >   3) Require CAP_SYS_RAW_IO
> >   4) Anyone

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-04 16:03                                                 ` Jason Gunthorpe
@ 2021-06-04 16:10                                                   ` Paolo Bonzini
  2021-06-04 17:22                                                     ` Jason Gunthorpe
  0 siblings, 1 reply; 258+ messages in thread
From: Paolo Bonzini @ 2021-06-04 16:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Tian, Kevin, Jean-Philippe Brucker, Jiang, Dave,
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML, iommu,
	David Gibson, Kirti Wankhede, David Woodhouse, Jason Wang

On 04/06/21 18:03, Jason Gunthorpe wrote:
> On Fri, Jun 04, 2021 at 05:57:19PM +0200, Paolo Bonzini wrote:
>> I don't want a security proof myself; I want to trust VFIO to make the right
>> judgment and I'm happy to defer to it (via the KVM-VFIO device).
>>
>> Given how KVM is just a device driver inside Linux, VMs should be a slightly
>> more roundabout way to do stuff that is accessible to bare metal; not a way
>> to gain extra privilege.
> 
> Okay, fine, lets turn the question on its head then.
> 
> VFIO should provide a IOCTL VFIO_EXECUTE_WBINVD so that userspace VFIO
> application can make use of no-snoop optimizations. The ability of KVM
> to execute wbinvd should be tied to the ability of that IOCTL to run
> in a normal process context.
> 
> So, under what conditions do we want to allow VFIO to giave a process
> elevated access to the CPU:

Ok, I would definitely not want to tie it *only* to CAP_SYS_RAWIO (i.e. 
#2+#3 would be worse than what we have today), but IIUC the proposal 
(was it yours or Kevin's?) was to keep #2 and add #1 with an 
enable/disable ioctl, which then would be on VFIO and not on KVM.  I 
assumed Alex was more or less okay with it, given he included me in the 
discussion.

If later y'all switch to "it's always okay to issue the enable/disable 
ioctl", I guess the rationale would be documented in the commit message.

Paolo

>>>    1) User has access to a device that can issue no-snoop TLPS
>>>    2) User has access to an IOMMU that can not block no-snoop (today)
>>>    3) Require CAP_SYS_RAW_IO
>>>    4) Anyone
> 
> Jason
> 


^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-04 16:22                 ` Jacob Pan
@ 2021-06-04 16:22                   ` Jason Gunthorpe
  2021-06-04 18:05                     ` Jacob Pan
  0 siblings, 1 reply; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-04 16:22 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Jason Wang, Shenming Lu, Lu Baolu, Tian, Kevin, LKML,
	Joerg Roedel, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu, Hao,
	Jiang, Dave, Jean-Philippe Brucker, David Gibson, Kirti Wankhede,
	Robin Murphy, Zenghui Yu, wanghaibin.wang

On Fri, Jun 04, 2021 at 09:22:43AM -0700, Jacob Pan wrote:
> Hi Jason,
> 
> On Fri, 4 Jun 2021 09:30:37 +0800, Jason Wang <jasowang@redhat.com> wrote:
> 
> > 在 2021/6/4 上午2:19, Jacob Pan 写道:
> > > Hi Shenming,
> > >
> > > On Wed, 2 Jun 2021 12:50:26 +0800, Shenming Lu <lushenming@huawei.com>
> > > wrote:
> > >  
> > >> On 2021/6/2 1:33, Jason Gunthorpe wrote:  
> > >>> On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote:
> > >>>      
> > >>>> The drivers register per page table fault handlers to /dev/ioasid
> > >>>> which will then register itself to iommu core to listen and route
> > >>>> the per- device I/O page faults.  
> > >>> I'm still confused why drivers need fault handlers at all?  
> > >> Essentially it is the userspace that needs the fault handlers,
> > >> one case is to deliver the faults to the vIOMMU, and another
> > >> case is to enable IOPF on the GPA address space for on-demand
> > >> paging, it seems that both could be specified in/through the
> > >> IOASID_ALLOC ioctl?
> > >>  
> > > I would think IOASID_BIND_PGTABLE is where fault handler should be
> > > registered. There wouldn't be any IO page fault without the binding
> > > anyway.
> > >
> > > I also don't understand why device drivers should register the fault
> > > handler, the fault is detected by the pIOMMU and injected to the
> > > vIOMMU. So I think it should be the IOASID itself register the handler.
> > >  
> > 
> > 
> > As discussed in another thread.
> > 
> > I think the reason is that ATS doesn't forbid the #PF to be reported via 
> > a device specific way.
> 
> Yes, in that case we should support both. Give the device driver a chance
> to handle the IOPF if it can.

Huh?

The device driver does not "handle the IOPF" the device driver might
inject the IOPF.

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-04  1:30               ` Jason Wang
@ 2021-06-04 16:22                 ` Jacob Pan
  2021-06-04 16:22                   ` Jason Gunthorpe
  0 siblings, 1 reply; 258+ messages in thread
From: Jacob Pan @ 2021-06-04 16:22 UTC (permalink / raw)
  To: Jason Wang
  Cc: Shenming Lu, Jason Gunthorpe, Lu Baolu, Tian, Kevin, LKML,
	Joerg Roedel, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu, Hao,
	Jiang, Dave, Jean-Philippe Brucker, David Gibson, Kirti Wankhede,
	Robin Murphy, Zenghui Yu, wanghaibin.wang, jacob.jun.pan

Hi Jason,

On Fri, 4 Jun 2021 09:30:37 +0800, Jason Wang <jasowang@redhat.com> wrote:

> 在 2021/6/4 上午2:19, Jacob Pan 写道:
> > Hi Shenming,
> >
> > On Wed, 2 Jun 2021 12:50:26 +0800, Shenming Lu <lushenming@huawei.com>
> > wrote:
> >  
> >> On 2021/6/2 1:33, Jason Gunthorpe wrote:  
> >>> On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote:
> >>>      
> >>>> The drivers register per page table fault handlers to /dev/ioasid
> >>>> which will then register itself to iommu core to listen and route
> >>>> the per- device I/O page faults.  
> >>> I'm still confused why drivers need fault handlers at all?  
> >> Essentially it is the userspace that needs the fault handlers,
> >> one case is to deliver the faults to the vIOMMU, and another
> >> case is to enable IOPF on the GPA address space for on-demand
> >> paging, it seems that both could be specified in/through the
> >> IOASID_ALLOC ioctl?
> >>  
> > I would think IOASID_BIND_PGTABLE is where fault handler should be
> > registered. There wouldn't be any IO page fault without the binding
> > anyway.
> >
> > I also don't understand why device drivers should register the fault
> > handler, the fault is detected by the pIOMMU and injected to the
> > vIOMMU. So I think it should be the IOASID itself register the handler.
> >  
> 
> 
> As discussed in another thread.
> 
> I think the reason is that ATS doesn't forbid the #PF to be reported via 
> a device specific way.
> 
Yes, in that case we should support both. Give the device driver a chance
to handle the IOPF if it can.

> Thanks
> 
> 
> >  
> >> Thanks,
> >> Shenming
> >>  
> >
> > Thanks,
> >
> > Jacob
> >  
> 


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-04 16:10                                                   ` Paolo Bonzini
@ 2021-06-04 17:22                                                     ` Jason Gunthorpe
  2021-06-04 21:29                                                       ` Alex Williamson
  2021-06-05  6:22                                                       ` Paolo Bonzini
  0 siblings, 2 replies; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-04 17:22 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Alex Williamson, Tian, Kevin, Jean-Philippe Brucker, Jiang, Dave,
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML, iommu,
	David Gibson, Kirti Wankhede, David Woodhouse, Jason Wang

On Fri, Jun 04, 2021 at 06:10:51PM +0200, Paolo Bonzini wrote:
> On 04/06/21 18:03, Jason Gunthorpe wrote:
> > On Fri, Jun 04, 2021 at 05:57:19PM +0200, Paolo Bonzini wrote:
> > > I don't want a security proof myself; I want to trust VFIO to make the right
> > > judgment and I'm happy to defer to it (via the KVM-VFIO device).
> > > 
> > > Given how KVM is just a device driver inside Linux, VMs should be a slightly
> > > more roundabout way to do stuff that is accessible to bare metal; not a way
> > > to gain extra privilege.
> > 
> > Okay, fine, lets turn the question on its head then.
> > 
> > VFIO should provide a IOCTL VFIO_EXECUTE_WBINVD so that userspace VFIO
> > application can make use of no-snoop optimizations. The ability of KVM
> > to execute wbinvd should be tied to the ability of that IOCTL to run
> > in a normal process context.
> > 
> > So, under what conditions do we want to allow VFIO to giave a process
> > elevated access to the CPU:
> 
> Ok, I would definitely not want to tie it *only* to CAP_SYS_RAWIO (i.e.
> #2+#3 would be worse than what we have today), but IIUC the proposal (was it
> yours or Kevin's?) was to keep #2 and add #1 with an enable/disable ioctl,
> which then would be on VFIO and not on KVM.  

At the end of the day we need an ioctl with two arguments:
 - The 'security proof' FD (ie /dev/vfio/XX, or /dev/ioasid, or whatever)
 - The KVM FD to control wbinvd support on

Philosophically it doesn't matter too much which subsystem that ioctl
lives, but we have these obnoxious cross module dependencies to
consider.. 

Framing the question, as you have, to be about the process, I think
explains why KVM doesn't really care what is decided, so long as the
process and the VM have equivalent rights.

Alex, how about a more fleshed out suggestion:

 1) When the device is attached to the IOASID via VFIO_ATTACH_IOASID
    it communicates its no-snoop configuration:
     - 0 enable, allow WBINVD
     - 1 automatic disable, block WBINVD if the platform
       IOMMU can police it (what we do today)
     - 2 force disable, do not allow BINVD ever

    vfio_pci may want to take this from an admin configuration knob
    someplace. It allows the admin to customize if they want.

    If we can figure out a way to autodetect 2 from vfio_pci, all the
    better

 2) There is some IOMMU_EXECUTE_WBINVD IOCTL that allows userspace
    to access wbinvd so it can make use of the no snoop optimization.

    wbinvd is allowed when:
      - A device is joined with mode #0
      - A device is joined with mode #1 and the IOMMU cannot block
        no-snoop (today)

 3) The IOASID's don't care about this at all. If IOMMU_EXECUTE_WBINVD
    is blocked and userspace doesn't request to block no-snoop in the
    IOASID then it is a userspace error.

 4) The KVM interface is the very simple enable/disable WBINVD.
    Possessing a FD that can do IOMMU_EXECUTE_WBINVD is required
    to enable WBINVD at KVM.

It is pretty simple from a /dev/ioasid perpsective, covers todays
compat requirement, gives some future option to allow the no-snoop
optimization, and gives a new option for qemu to totally block wbinvd
no matter what.

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-04 12:05           ` Jason Gunthorpe
@ 2021-06-04 17:27             ` Jacob Pan
  2021-06-04 17:40               ` Jason Gunthorpe
  0 siblings, 1 reply; 258+ messages in thread
From: Jacob Pan @ 2021-06-04 17:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, David Gibson, Tian, Kevin, LKML,
	Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Kirti Wankhede, Robin Murphy,
	jacob.jun.pan

Hi Jason,

On Fri, 4 Jun 2021 09:05:55 -0300, Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Fri, Jun 04, 2021 at 12:24:08PM +0200, Jean-Philippe Brucker wrote:
> 
> > I think once it binds a device to an IOASID fd, QEMU will want to probe
> > what hardware features are available before going further with the
> > vIOMMU setup (is there PASID, PRI, which page table formats are
> > supported,  
> 
> I think David's point was that qemu should be told what vIOMMU it is
> emulating exactly (right down to what features it has) and then
> the goal is simply to match what the vIOMMU needs with direct HW
> support via /dev/ioasid and fall back to SW emulation when not
> possible.
> 
> If qemu wants to have some auto-configuration: 'pass host IOMMU
> capabilities' similar to the CPU flags then qemu should probe the
> /dev/ioasid - and maybe we should just return some highly rolled up
> "this is IOMMU HW ID ARM SMMU vXYZ" out of some query to guide qemu in
> doing this.
> 
There can be mixed types of physical IOMMUs on the host. So not until a
device is attached, we would not know if the vIOMMU can match the HW
support of the device's IOMMU. Perhaps, vIOMMU should check the
least common denominator features before commit.

Thanks,

Jacob

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-04 17:27             ` Jacob Pan
@ 2021-06-04 17:40               ` Jason Gunthorpe
  0 siblings, 0 replies; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-04 17:40 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Jean-Philippe Brucker, David Gibson, Tian, Kevin, LKML,
	Joerg Roedel, Lu Baolu, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Kirti Wankhede, Robin Murphy

On Fri, Jun 04, 2021 at 10:27:43AM -0700, Jacob Pan wrote:
> Hi Jason,
> 
> On Fri, 4 Jun 2021 09:05:55 -0300, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Fri, Jun 04, 2021 at 12:24:08PM +0200, Jean-Philippe Brucker wrote:
> > 
> > > I think once it binds a device to an IOASID fd, QEMU will want to probe
> > > what hardware features are available before going further with the
> > > vIOMMU setup (is there PASID, PRI, which page table formats are
> > > supported,  
> > 
> > I think David's point was that qemu should be told what vIOMMU it is
> > emulating exactly (right down to what features it has) and then
> > the goal is simply to match what the vIOMMU needs with direct HW
> > support via /dev/ioasid and fall back to SW emulation when not
> > possible.
> > 
> > If qemu wants to have some auto-configuration: 'pass host IOMMU
> > capabilities' similar to the CPU flags then qemu should probe the
> > /dev/ioasid - and maybe we should just return some highly rolled up
> > "this is IOMMU HW ID ARM SMMU vXYZ" out of some query to guide qemu in
> > doing this.
> > 
> There can be mixed types of physical IOMMUs on the host. So not until a
> device is attached, we would not know if the vIOMMU can match the HW
> support of the device's IOMMU. Perhaps, vIOMMU should check the
> least common denominator features before commit.

qemu has to set the vIOMMU at VM startup time, so if it is running in
some "copy host" mode the only thing it can do is evaluate the VFIO
devices that are present at boot and select a vIOMMU from that list.

Probably would pick the most capable physical IOMMU and software
emulate the reset.

platforms really should avoid creating wildly divergent IOMMUs in the
same system if they want to support virtualization effectively.

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-04 16:22                   ` Jason Gunthorpe
@ 2021-06-04 18:05                     ` Jacob Pan
  0 siblings, 0 replies; 258+ messages in thread
From: Jacob Pan @ 2021-06-04 18:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jason Wang, Shenming Lu, Lu Baolu, Tian, Kevin, LKML,
	Joerg Roedel, David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu, Hao,
	Jiang, Dave, Jean-Philippe Brucker, David Gibson, Kirti Wankhede,
	Robin Murphy, Zenghui Yu, wanghaibin.wang, jacob.jun.pan

Hi Jason,

On Fri, 4 Jun 2021 13:22:00 -0300, Jason Gunthorpe <jgg@nvidia.com> wrote:

> > 
> > Yes, in that case we should support both. Give the device driver a
> > chance to handle the IOPF if it can.  
> 
> Huh?
> 
> The device driver does not "handle the IOPF" the device driver might
> inject the IOPF.
You are right, I got confused with the native case where device drivers can
handle the fault, or do something about it.

Thanks,

Jacob

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-04 17:22                                                     ` Jason Gunthorpe
@ 2021-06-04 21:29                                                       ` Alex Williamson
  2021-06-04 23:01                                                         ` Jason Gunthorpe
  2021-06-07  3:25                                                         ` Tian, Kevin
  2021-06-05  6:22                                                       ` Paolo Bonzini
  1 sibling, 2 replies; 258+ messages in thread
From: Alex Williamson @ 2021-06-04 21:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Paolo Bonzini, Tian, Kevin, Jean-Philippe Brucker, Jiang, Dave,
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML, iommu,
	David Gibson, Kirti Wankhede, David Woodhouse, Jason Wang

On Fri, 4 Jun 2021 14:22:07 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Fri, Jun 04, 2021 at 06:10:51PM +0200, Paolo Bonzini wrote:
> > On 04/06/21 18:03, Jason Gunthorpe wrote:  
> > > On Fri, Jun 04, 2021 at 05:57:19PM +0200, Paolo Bonzini wrote:  
> > > > I don't want a security proof myself; I want to trust VFIO to make the right
> > > > judgment and I'm happy to defer to it (via the KVM-VFIO device).
> > > > 
> > > > Given how KVM is just a device driver inside Linux, VMs should be a slightly
> > > > more roundabout way to do stuff that is accessible to bare metal; not a way
> > > > to gain extra privilege.  
> > > 
> > > Okay, fine, lets turn the question on its head then.
> > > 
> > > VFIO should provide a IOCTL VFIO_EXECUTE_WBINVD so that userspace VFIO
> > > application can make use of no-snoop optimizations. The ability of KVM
> > > to execute wbinvd should be tied to the ability of that IOCTL to run
> > > in a normal process context.
> > > 
> > > So, under what conditions do we want to allow VFIO to giave a process
> > > elevated access to the CPU:  
> > 
> > Ok, I would definitely not want to tie it *only* to CAP_SYS_RAWIO (i.e.
> > #2+#3 would be worse than what we have today), but IIUC the proposal (was it
> > yours or Kevin's?) was to keep #2 and add #1 with an enable/disable ioctl,
> > which then would be on VFIO and not on KVM.    
> 
> At the end of the day we need an ioctl with two arguments:
>  - The 'security proof' FD (ie /dev/vfio/XX, or /dev/ioasid, or whatever)
>  - The KVM FD to control wbinvd support on
> 
> Philosophically it doesn't matter too much which subsystem that ioctl
> lives, but we have these obnoxious cross module dependencies to
> consider.. 
> 
> Framing the question, as you have, to be about the process, I think
> explains why KVM doesn't really care what is decided, so long as the
> process and the VM have equivalent rights.
> 
> Alex, how about a more fleshed out suggestion:
> 
>  1) When the device is attached to the IOASID via VFIO_ATTACH_IOASID
>     it communicates its no-snoop configuration:

Communicates to whom?

>      - 0 enable, allow WBINVD
>      - 1 automatic disable, block WBINVD if the platform
>        IOMMU can police it (what we do today)
>      - 2 force disable, do not allow BINVD ever

The only thing we know about the device is whether or not Enable
No-snoop is hard wired to zero, ie. it either can't generate no-snoop
TLPs ("coherent-only") or it might ("assumed non-coherent").  If
we're putting the policy decision in the hands of userspace they should
have access to wbinvd if they own a device that is assumed
non-coherent AND it's attached to an IOMMU (page table) that is not
blocking no-snoop (a "non-coherent IOASID").

I think that means that the IOASID needs to be created (IOASID_ALLOC)
with a flag that specifies whether this address space is coherent
(IOASID_GET_INFO probably needs a flag/cap to expose if the system
supports this).  All mappings in this IOASID would use IOMMU_CACHE and
and devices attached to it would be required to be backed by an IOMMU
capable of IOMMU_CAP_CACHE_COHERENCY (attach fails otherwise).  If only
these IOASIDs exist, access to wbinvd would not be provided.  (How does
a user provided page table work? - reserved bit set, user error?)

Conversely, a user could create a non-coherent IOASID and attach any
device to it, regardless of IOMMU backing capabilities.  Only if an
assumed non-coherent device is attached would the wbinvd be allowed.

I think that means that an EXECUTE_WBINVD ioctl lives on the IOASIDFD
and the IOASID world needs to understand the device's ability to
generate non-coherent DMA.  This wbinvd ioctl would be a no-op (or
some known errno) unless a non-coherent IOASID exists with a potentially
non-coherent device attached.
 
>     vfio_pci may want to take this from an admin configuration knob
>     someplace. It allows the admin to customize if they want.
> 
>     If we can figure out a way to autodetect 2 from vfio_pci, all the
>     better
> 
>  2) There is some IOMMU_EXECUTE_WBINVD IOCTL that allows userspace
>     to access wbinvd so it can make use of the no snoop optimization.
> 
>     wbinvd is allowed when:
>       - A device is joined with mode #0
>       - A device is joined with mode #1 and the IOMMU cannot block
>         no-snoop (today)
> 
>  3) The IOASID's don't care about this at all. If IOMMU_EXECUTE_WBINVD
>     is blocked and userspace doesn't request to block no-snoop in the
>     IOASID then it is a userspace error.

In my model above, the IOASID is central to this.
 
>  4) The KVM interface is the very simple enable/disable WBINVD.
>     Possessing a FD that can do IOMMU_EXECUTE_WBINVD is required
>     to enable WBINVD at KVM.

Right, and in the new world order, vfio is only a device driver, the
IOASID manages the device's DMA.  wbinvd is only necessary relative to
non-coherent DMA, which seems like QEMU needs to bump KVM with an
ioasidfd.
 
> It is pretty simple from a /dev/ioasid perpsective, covers todays
> compat requirement, gives some future option to allow the no-snoop
> optimization, and gives a new option for qemu to totally block wbinvd
> no matter what.

What do you imagine is the use case for totally blocking wbinvd?  In
the model I describe, wbinvd would always be a no-op/known-errno when
the IOASIDs are all allocated as coherent or a non-coherent IOASID has
only coherent-only devices attached.  Does userspace need a way to
prevent itself from scenarios where wbvind is not a no-op?

In general I'm having trouble wrapping my brain around the semantics of
the enable/automatic/force-disable wbinvd specific proposal, sorry.
Thanks,

Alex


^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-04 12:13                                   ` Jason Gunthorpe
@ 2021-06-04 21:45                                     ` Alex Williamson
  0 siblings, 0 replies; 258+ messages in thread
From: Alex Williamson @ 2021-06-04 21:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok, kvm,
	Jonathan Corbet, Robin Murphy, LKML, iommu, David Gibson,
	Kirti Wankhede, David Woodhouse, Jason Wang

On Fri, 4 Jun 2021 09:13:37 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Thu, Jun 03, 2021 at 02:41:36PM -0600, Alex Williamson wrote:
> 
> > Could you clarify "vfio_driver"?    
> 
> This is the thing providing the vfio_device_ops function pointers.
> 
> So vfio-pci can't know anything about this (although your no-snoop
> control probing idea makes sense to me)
> 
> But vfio_mlx5_pci can know
> 
> So can mdev_idxd
> 
> And kvmgt

A capability on VFIO_DEVICE_GET_INFO could provide a hint to userspace.
Stock vfio-pci could fill it out to the extent advertising if the
device is capable of non-coherent DMA based on the Enable No-snoop
probing, the device specific vfio_drivers could set it based on
knowledge of the device behavior.  Another bit might indicate a
preference to not suppress non-coherent DMA at the IOMMU.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-04 21:29                                                       ` Alex Williamson
@ 2021-06-04 23:01                                                         ` Jason Gunthorpe
  2021-06-07 15:41                                                           ` Alex Williamson
  2021-06-07  3:25                                                         ` Tian, Kevin
  1 sibling, 1 reply; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-04 23:01 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Paolo Bonzini, Tian, Kevin, Jean-Philippe Brucker, Jiang, Dave,
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML, iommu,
	David Gibson, Kirti Wankhede, David Woodhouse, Jason Wang

On Fri, Jun 04, 2021 at 03:29:18PM -0600, Alex Williamson wrote:
> On Fri, 4 Jun 2021 14:22:07 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Fri, Jun 04, 2021 at 06:10:51PM +0200, Paolo Bonzini wrote:
> > > On 04/06/21 18:03, Jason Gunthorpe wrote:  
> > > > On Fri, Jun 04, 2021 at 05:57:19PM +0200, Paolo Bonzini wrote:  
> > > > > I don't want a security proof myself; I want to trust VFIO to make the right
> > > > > judgment and I'm happy to defer to it (via the KVM-VFIO device).
> > > > > 
> > > > > Given how KVM is just a device driver inside Linux, VMs should be a slightly
> > > > > more roundabout way to do stuff that is accessible to bare metal; not a way
> > > > > to gain extra privilege.  
> > > > 
> > > > Okay, fine, lets turn the question on its head then.
> > > > 
> > > > VFIO should provide a IOCTL VFIO_EXECUTE_WBINVD so that userspace VFIO
> > > > application can make use of no-snoop optimizations. The ability of KVM
> > > > to execute wbinvd should be tied to the ability of that IOCTL to run
> > > > in a normal process context.
> > > > 
> > > > So, under what conditions do we want to allow VFIO to giave a process
> > > > elevated access to the CPU:  
> > > 
> > > Ok, I would definitely not want to tie it *only* to CAP_SYS_RAWIO (i.e.
> > > #2+#3 would be worse than what we have today), but IIUC the proposal (was it
> > > yours or Kevin's?) was to keep #2 and add #1 with an enable/disable ioctl,
> > > which then would be on VFIO and not on KVM.    
> > 
> > At the end of the day we need an ioctl with two arguments:
> >  - The 'security proof' FD (ie /dev/vfio/XX, or /dev/ioasid, or whatever)
> >  - The KVM FD to control wbinvd support on
> > 
> > Philosophically it doesn't matter too much which subsystem that ioctl
> > lives, but we have these obnoxious cross module dependencies to
> > consider.. 
> > 
> > Framing the question, as you have, to be about the process, I think
> > explains why KVM doesn't really care what is decided, so long as the
> > process and the VM have equivalent rights.
> > 
> > Alex, how about a more fleshed out suggestion:
> > 
> >  1) When the device is attached to the IOASID via VFIO_ATTACH_IOASID
> >     it communicates its no-snoop configuration:
> 
> Communicates to whom?

To the /dev/iommu FD which will have to maintain a list of devices
attached to it internally.

> >      - 0 enable, allow WBINVD
> >      - 1 automatic disable, block WBINVD if the platform
> >        IOMMU can police it (what we do today)
> >      - 2 force disable, do not allow BINVD ever
> 
> The only thing we know about the device is whether or not Enable
> No-snoop is hard wired to zero, ie. it either can't generate no-snoop
> TLPs ("coherent-only") or it might ("assumed non-coherent").  

Here I am outlining the choice an also imagining we might want an
admin knob to select the three.

> If we're putting the policy decision in the hands of userspace they
> should have access to wbinvd if they own a device that is assumed
> non-coherent AND it's attached to an IOMMU (page table) that is not
> blocking no-snoop (a "non-coherent IOASID").

There are two parts here, like Paolo was leading too. If the process
has access to WBINVD and then if such an allowed process tells KVM to
turn on WBINVD in the guest.

If the process has a device and it has a way to create a non-coherent
IOASID, then that process has access to WBINVD.

For security it doesn't matter if the process actually creates the
non-coherent IOASID or not. An attacker will simply do the steps that
give access to WBINVD.

The important detail is that access to WBINVD does not compell the
process to tell KVM to turn on WBINVD. So a qemu with access to WBINVD
can still choose to create a secure guest by always using IOMMU_CACHE
in its page tables and not asking KVM to enable WBINVD.

This propsal shifts this policy decision from the kernel to userspace.
qemu is responsible to determine if KVM should enable wbinvd or not
based on if it was able to create IOASID's with IOMMU_CACHE.

> Conversely, a user could create a non-coherent IOASID and attach any
> device to it, regardless of IOMMU backing capabilities.  Only if an
> assumed non-coherent device is attached would the wbinvd be allowed.

Right, this is exactly the point. Since the user gets to pick if the
IOASID is coherent or not then an attacker can always reach WBINVD
using only the device FD. Additional checks don't add to the security
of the process.

The additional checks you are describing add to the security of the
guest, however qemu is capable of doing them without more help from the
kernel.

It is the strenth of Paolo's model that KVM should not be able to do
optionally less, not more than the process itself can do.

> > It is pretty simple from a /dev/ioasid perpsective, covers todays
> > compat requirement, gives some future option to allow the no-snoop
> > optimization, and gives a new option for qemu to totally block wbinvd
> > no matter what.
> 
> What do you imagine is the use case for totally blocking wbinvd? 

If wbinvd is really security important then an operator should endevor
to turn it off. It can be safely turned off if the operator
understands the SRIOV devices they are using. ie if you are only using
mlx5 or a nvme then force it off and be secure, regardless of the
platform capability.

Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-06-04 12:09         ` Jason Gunthorpe
@ 2021-06-04 23:10           ` Tian, Kevin
  2021-06-07 17:54             ` Jason Gunthorpe
  0 siblings, 1 reply; 258+ messages in thread
From: Tian, Kevin @ 2021-06-04 23:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker,
	Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, David Gibson, Jiang, Dave,
	David Woodhouse, Jason Wang

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, June 4, 2021 8:09 PM
> 
> On Fri, Jun 04, 2021 at 06:37:26AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe
> > > Sent: Thursday, June 3, 2021 9:05 PM
> > >
> > > > >
> > > > > 3) Device accepts any PASIDs from the guest. No
> > > > >    vPASID/pPASID translation is possible. (classic vfio_pci)
> > > > > 4) Device accepts any PASID from the guest and has an
> > > > >    internal vPASID/pPASID translation (enhanced vfio_pci)
> > > >
> > > > what is enhanced vfio_pci? In my writing this is for mdev
> > > > which doesn't support ENQCMD
> > >
> > > This is a vfio_pci that mediates some element of the device interface
> > > to communicate the vPASID/pPASID table to the device, using Max's
> > > series for vfio_pci drivers to inject itself into VFIO.
> > >
> > > For instance a device might send a message through the PF that the VF
> > > has a certain vPASID/pPASID translation table. This would be useful
> > > for devices that cannot use ENQCMD but still want to support migration
> > > and thus need vPASID.
> >
> > I still don't quite get. If it's a PCI device why is PASID translation required?
> > Just delegate the per-RID PASID space to user as type-3 then migrating the
> > vPASID space is just straightforward.
> 
> This is only possible if we get rid of the global pPASID allocation
> (honestly is my preference as it makes the HW a lot simpler)
> 

In this proposal global vs. per-RID allocation is a per-device policy.
for vfio-pci it can always use per-RID (regardless of whether the
device is partially mediated or not) and no vPASID/pPASID conversion. 
Even for mdev if no ENQCMD we can still do per-RID conversion.
only for mdev which has ENQCMD we need global pPASID allocation.

I think this is the motivation you explained earlier that it's not good
to have one global PASID allocator in the kernel. per-RID vs. global
should be selected per device.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 258+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-06-04 12:33             ` Jason Gunthorpe
@ 2021-06-04 23:20               ` Tian, Kevin
  0 siblings, 0 replies; 258+ messages in thread
From: Tian, Kevin @ 2021-06-04 23:20 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: David Gibson, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, kvm, Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L,
	Wu, Hao, Jiang, Dave, Jacob Pan, Jean-Philippe Brucker,
	Kirti Wankhede, Robin Murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, June 4, 2021 8:34 PM
> 
> On Fri, Jun 04, 2021 at 06:08:28AM +0000, Tian, Kevin wrote:
> 
> > In Qemu case the problem is that it doesn't know the list of devices
> > that will be attached to an IOASID when it's created. This is a guest-
> > side knowledge which is conveyed one device at a time to Qemu
> > though vIOMMU.
> 
> At least for the guest side it is alot simpler because the vIOMMU
> being emulated will define nearly everything.
> 
> qemu will just have to ask the kernel for whatever it is the guest is
> doing. If the kernel can't do it then qemu has to SW emulate.
> 
> The no-snoop block may be the only thing that is under qemu's control
> because it is transparent to the guest.
> 
> This will probably become clearer as people start to define what the
> get_info should return.
> 

Sure. Just to clarify my comment that it is for " Perhaps creating an 
IOASID should pass in a list of the device labels that the IOASID will 
be used with". My point is that Qemu doesn't know this fact before
the guest completes binding page table to all relevant devices, while
IOASID must be created when the table is bound to first device. So
Qemu just needs to create IOASID with format that is required for the
current device. Incompatibility will be detected when attaching other
devices later.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-04 17:22                                                     ` Jason Gunthorpe
  2021-06-04 21:29                                                       ` Alex Williamson
@ 2021-06-05  6:22                                                       ` Paolo Bonzini
  2021-06-07  3:50                                                         ` Tian, Kevin
  2021-06-07 17:59                                                         ` Jason Gunthorpe
  1 sibling, 2 replies; 258+ messages in thread
From: Paolo Bonzini @ 2021-06-05  6:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Tian, Kevin, Jean-Philippe Brucker, Jiang, Dave,
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML, iommu,
	David Gibson, Kirti Wankhede, David Woodhouse, Jason Wang

On 04/06/21 19:22, Jason Gunthorpe wrote:
>   4) The KVM interface is the very simple enable/disable WBINVD.
>      Possessing a FD that can do IOMMU_EXECUTE_WBINVD is required
>      to enable WBINVD at KVM.

The KVM interface is the same kvm-vfio device that exists already.  The 
userspace API does not need to change at all: adding one VFIO file 
descriptor with WBINVD enabled to the kvm-vfio device lets the VM use 
WBINVD functionality (see kvm_vfio_update_coherency).

Alternatively you can add a KVM_DEV_IOASID_{ADD,DEL} pair of ioctls. 
But it seems useless complication compared to just using what we have 
now, at least while VMs only use IOASIDs via VFIO.

Either way, there should be no policy attached to the add/delete 
operations.  KVM users want to add the VFIO (or IOASID) file descriptors 
to the device independent of WBINVD.  If userspace wants/needs to apply 
its own policy on whether to enable WBINVD or not, they can do it on the 
VFIO/IOASID side:

>  1) When the device is attached to the IOASID via VFIO_ATTACH_IOASID
>     it communicates its no-snoop configuration:
>      - 0 enable, allow WBINVD
>      - 1 automatic disable, block WBINVD if the platform
>        IOMMU can police it (what we do today)
>      - 2 force disable, do not allow BINVD ever

Though, like Alex, it's also not clear to me whether force-disable is 
useful.  Instead userspace can query the IOMMU or the device to ensure 
it's not enabled.

Paolo


^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-04 11:58                               ` Jason Gunthorpe
@ 2021-06-07  3:18                                 ` Jason Wang
  2021-06-07 14:14                                   ` Jason Gunthorpe
  2021-06-30  7:05                                 ` Christoph Hellwig
  1 sibling, 1 reply; 258+ messages in thread
From: Jason Wang @ 2021-06-07  3:18 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Tian, Kevin, Jean-Philippe Brucker, Jiang, Dave,
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML, iommu,
	David Gibson, Kirti Wankhede, David Woodhouse


在 2021/6/4 下午7:58, Jason Gunthorpe 写道:
> On Fri, Jun 04, 2021 at 09:11:03AM +0800, Jason Wang wrote:
>>> nor do any virtio drivers implement the required platform specific
>>> cache flushing to make no-snoop TLPs work.
>> I don't get why virtio drivers needs to do that. I think DMA API should hide
>> those arch/platform specific stuffs from us.
> It is not arch/platform stuff. If the device uses no-snoop then a
> very platform specific recovery is required in the device driver.
>
> It is not part of the normal DMA API, it is side APIs like
> flush_agp_cache() or wbinvd() that are used by GPU drivers only.


Yes and virtio doesn't support AGP.


>
> If drivers/virtio doesn't explicitly call these things it doesn't
> support no-snoop - hence no VDPA device can ever use no-snoop.


Note that no drivers call these things doesn't meant it was not 
supported by the spec.

Actually, spec doesn't forbid the non coherent DMA, anyway we can raise 
a new thread in the virtio mailing list to discuss about that.

But consider virtio has already supported GPU, crypto and sound device, 
and the devices like codec and video are being proposed. It doesn't help 
if we mandate coherent DMA now.

Thanks


>
> Since VIRTIO_F_ACCESS_PLATFORM doesn't trigger wbinvd on x86 it has
> nothing to do with no-snoop.
>
> Jason
>


^ permalink raw reply	[flat|nested] 258+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-06-04 21:29                                                       ` Alex Williamson
  2021-06-04 23:01                                                         ` Jason Gunthorpe
@ 2021-06-07  3:25                                                         ` Tian, Kevin
  2021-06-07  6:51                                                           ` Paolo Bonzini
  2021-06-30  6:56                                                           ` Christoph Hellwig
  1 sibling, 2 replies; 258+ messages in thread
From: Tian, Kevin @ 2021-06-07  3:25 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: Paolo Bonzini, Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok,
	kvm, Jonathan Corbet, Robin Murphy, LKML, iommu, David Gibson,
	Kirti Wankhede, David Woodhouse, Jason Wang

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Saturday, June 5, 2021 5:29 AM
> 
> On Fri, 4 Jun 2021 14:22:07 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Fri, Jun 04, 2021 at 06:10:51PM +0200, Paolo Bonzini wrote:
> > > On 04/06/21 18:03, Jason Gunthorpe wrote:
> > > > On Fri, Jun 04, 2021 at 05:57:19PM +0200, Paolo Bonzini wrote:
> > > > > I don't want a security proof myself; I want to trust VFIO to make the
> right
> > > > > judgment and I'm happy to defer to it (via the KVM-VFIO device).
> > > > >
> > > > > Given how KVM is just a device driver inside Linux, VMs should be a
> slightly
> > > > > more roundabout way to do stuff that is accessible to bare metal; not
> a way
> > > > > to gain extra privilege.
> > > >
> > > > Okay, fine, lets turn the question on its head then.
> > > >
> > > > VFIO should provide a IOCTL VFIO_EXECUTE_WBINVD so that userspace
> VFIO
> > > > application can make use of no-snoop optimizations. The ability of KVM
> > > > to execute wbinvd should be tied to the ability of that IOCTL to run
> > > > in a normal process context.
> > > >
> > > > So, under what conditions do we want to allow VFIO to giave a process
> > > > elevated access to the CPU:
> > >
> > > Ok, I would definitely not want to tie it *only* to CAP_SYS_RAWIO (i.e.
> > > #2+#3 would be worse than what we have today), but IIUC the proposal
> (was it
> > > yours or Kevin's?) was to keep #2 and add #1 with an enable/disable ioctl,
> > > which then would be on VFIO and not on KVM.
> >
> > At the end of the day we need an ioctl with two arguments:
> >  - The 'security proof' FD (ie /dev/vfio/XX, or /dev/ioasid, or whatever)
> >  - The KVM FD to control wbinvd support on
> >
> > Philosophically it doesn't matter too much which subsystem that ioctl
> > lives, but we have these obnoxious cross module dependencies to
> > consider..
> >
> > Framing the question, as you have, to be about the process, I think
> > explains why KVM doesn't really care what is decided, so long as the
> > process and the VM have equivalent rights.
> >
> > Alex, how about a more fleshed out suggestion:

Possibly just a naming thing, but I feel it's better to just talk about
no-snoop or non-coherent in the uAPI. Per Intel SDM wbinvd is a
privileged instruction. A process on the host has no privilege to 
execute it. Only when this process holds a VM, this instruction matters
as there are guest privilege levels. But having VFIO uAPI (which is
userspace oriented) to explicitly deal with a CPU instruction which
makes sense only in a virtualization context sounds a bit weird...

> >
> >  1) When the device is attached to the IOASID via VFIO_ATTACH_IOASID
> >     it communicates its no-snoop configuration:
> 
> Communicates to whom?
> 
> >      - 0 enable, allow WBINVD
> >      - 1 automatic disable, block WBINVD if the platform
> >        IOMMU can police it (what we do today)
> >      - 2 force disable, do not allow BINVD ever
> 
> The only thing we know about the device is whether or not Enable
> No-snoop is hard wired to zero, ie. it either can't generate no-snoop
> TLPs ("coherent-only") or it might ("assumed non-coherent").  If
> we're putting the policy decision in the hands of userspace they should
> have access to wbinvd if they own a device that is assumed
> non-coherent AND it's attached to an IOMMU (page table) that is not
> blocking no-snoop (a "non-coherent IOASID").
> 
> I think that means that the IOASID needs to be created (IOASID_ALLOC)
> with a flag that specifies whether this address space is coherent
> (IOASID_GET_INFO probably needs a flag/cap to expose if the system
> supports this).  All mappings in this IOASID would use IOMMU_CACHE and

Yes, this sounds a cleaner way than specifying this attribute late in
VFIO_ATTACH_IOASID. Following Jason's proposal v2 will move to
the scheme requiring user to specify format info when creating an
IOASID. Leaving coherent out of that box just adds some trickiness, 
e.g. whether allowing user to update page table between ALLOC 
and ATTACH.

> and devices attached to it would be required to be backed by an IOMMU
> capable of IOMMU_CAP_CACHE_COHERENCY (attach fails otherwise).  If only
> these IOASIDs exist, access to wbinvd would not be provided.  (How does
> a user provided page table work? - reserved bit set, user error?)
> 
> Conversely, a user could create a non-coherent IOASID and attach any
> device to it, regardless of IOMMU backing capabilities.  Only if an
> assumed non-coherent device is attached would the wbinvd be allowed.
> 
> I think that means that an EXECUTE_WBINVD ioctl lives on the IOASIDFD
> and the IOASID world needs to understand the device's ability to
> generate non-coherent DMA.  This wbinvd ioctl would be a no-op (or
> some known errno) unless a non-coherent IOASID exists with a potentially
> non-coherent device attached.
> 
> >     vfio_pci may want to take this from an admin configuration knob
> >     someplace. It allows the admin to customize if they want.
> >
> >     If we can figure out a way to autodetect 2 from vfio_pci, all the
> >     better
> >
> >  2) There is some IOMMU_EXECUTE_WBINVD IOCTL that allows userspace
> >     to access wbinvd so it can make use of the no snoop optimization.
> >
> >     wbinvd is allowed when:
> >       - A device is joined with mode #0
> >       - A device is joined with mode #1 and the IOMMU cannot block
> >         no-snoop (today)
> >
> >  3) The IOASID's don't care about this at all. If IOMMU_EXECUTE_WBINVD
> >     is blocked and userspace doesn't request to block no-snoop in the
> >     IOASID then it is a userspace error.
> 
> In my model above, the IOASID is central to this.
> 
> >  4) The KVM interface is the very simple enable/disable WBINVD.
> >     Possessing a FD that can do IOMMU_EXECUTE_WBINVD is required
> >     to enable WBINVD at KVM.
> 
> Right, and in the new world order, vfio is only a device driver, the
> IOASID manages the device's DMA.  wbinvd is only necessary relative to
> non-coherent DMA, which seems like QEMU needs to bump KVM with an
> ioasidfd.
> 
> > It is pretty simple from a /dev/ioasid perpsective, covers todays
> > compat requirement, gives some future option to allow the no-snoop
> > optimization, and gives a new option for qemu to totally block wbinvd
> > no matter what.
> 
> What do you imagine is the use case for totally blocking wbinvd?  In
> the model I describe, wbinvd would always be a no-op/known-errno when
> the IOASIDs are all allocated as coherent or a non-coherent IOASID has
> only coherent-only devices attached.  Does userspace need a way to
> prevent itself from scenarios where wbvind is not a no-op?
> 
> In general I'm having trouble wrapping my brain around the semantics of
> the enable/automatic/force-disable wbinvd specific proposal, sorry.
> Thanks,
> 

Thanks,
Kevin

^ permalink raw reply	[flat|nested] 258+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-06-05  6:22                                                       ` Paolo Bonzini
@ 2021-06-07  3:50                                                         ` Tian, Kevin
  2021-06-07 17:59                                                         ` Jason Gunthorpe
  1 sibling, 0 replies; 258+ messages in thread
From: Tian, Kevin @ 2021-06-07  3:50 UTC (permalink / raw)
  To: Paolo Bonzini, Jason Gunthorpe
  Cc: Alex Williamson, Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok,
	kvm, Jonathan Corbet, Robin Murphy, LKML, iommu, David Gibson,
	Kirti Wankhede, David Woodhouse, Jason Wang

> From: Paolo Bonzini <pbonzini@redhat.com>
> Sent: Saturday, June 5, 2021 2:22 PM
> 
> On 04/06/21 19:22, Jason Gunthorpe wrote:
> >   4) The KVM interface is the very simple enable/disable WBINVD.
> >      Possessing a FD that can do IOMMU_EXECUTE_WBINVD is required
> >      to enable WBINVD at KVM.
> 
> The KVM interface is the same kvm-vfio device that exists already.  The
> userspace API does not need to change at all: adding one VFIO file
> descriptor with WBINVD enabled to the kvm-vfio device lets the VM use
> WBINVD functionality (see kvm_vfio_update_coherency).
> 
> Alternatively you can add a KVM_DEV_IOASID_{ADD,DEL} pair of ioctls.
> But it seems useless complication compared to just using what we have
> now, at least while VMs only use IOASIDs via VFIO.
> 

A new IOASID variation may make more sense in case non-vfio subsystems
want to handle similar coherency problem. Per other discussions looks 
it's still an open whether vDPA wants it or not. and there could be other
passthrough frameworks in the future. Having them all use vfio-naming
sounds not very clean. Anyway the coherency attribute must be configured
on IOASID in the end, then it looks reasonable for KVM to learn the info
from an unified place.

Just FYI we are also planning new IOASID-specific ioctl in KVM for other
usages. Future Intel platforms support a new ENQCMD instruction for
scalable work submission to the device. This instruction includes a 64-
bytes payload plus a PASID retrieved from a CPU MSR register (covered
by xsave). When supporting this instruction in the guest, the value in
the MSR is a guest PASID which must be translated to a host PASID. 
A new VMCS structure (PASID translation table) is introduced for this
purpose. In this /dev/ioasid proposal, we propose VFIO_{UN}MAP_
IOASID for user to update the VMCS structure properly. The user is
expected to provide {ioasid_fd, ioasid, vPASID} to KVM which then
calls ioasid helper function to figure out the corresponding hPASID
to update the specified entry.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-07  3:25                                                         ` Tian, Kevin
@ 2021-06-07  6:51                                                           ` Paolo Bonzini
  2021-06-07 18:01                                                             ` Jason Gunthorpe
  2021-06-30  6:56                                                           ` Christoph Hellwig
  1 sibling, 1 reply; 258+ messages in thread
From: Paolo Bonzini @ 2021-06-07  6:51 UTC (permalink / raw)
  To: Tian, Kevin, Alex Williamson, Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Jiang, Dave, Raj, Ashok, kvm,
	Jonathan Corbet, Robin Murphy, LKML, iommu, David Gibson,
	Kirti Wankhede, David Woodhouse, Jason Wang

On 07/06/21 05:25, Tian, Kevin wrote:
> Per Intel SDM wbinvd is a privileged instruction. A process on the
> host has no privilege to execute it.

(Half of) the point of the kernel is to do privileged tasks on the 
processes' behalf.  There are good reasons why a process that uses VFIO 
(without KVM) could want to use wbinvd, so VFIO lets them do it with a 
ioctl and adequate checks around the operation.

Paolo


^ permalink raw reply	[flat|nested] 258+ messages in thread

* RE: [RFC] /dev/ioasid uAPI proposal
  2021-06-04  2:03               ` Shenming Lu
@ 2021-06-07 12:19                 ` Liu, Yi L
  2021-06-08  1:09                   ` Shenming Lu
  0 siblings, 1 reply; 258+ messages in thread
From: Liu, Yi L @ 2021-06-07 12:19 UTC (permalink / raw)
  To: Shenming Lu, Jacob Pan
  Cc: Jason Gunthorpe, Lu Baolu, Tian, Kevin, LKML, Joerg Roedel,
	David Woodhouse, iommu, kvm,
	Alex Williamson (alex.williamson@redhat.com),
	Jason Wang, Eric Auger, Jonathan Corbet, Raj, Ashok, Wu, Hao,
	Jiang, Dave, Jean-Philippe Brucker, David Gibson, Kirti Wankhede,
	Robin Murphy, Zenghui Yu, wanghaibin.wang

> From: Shenming Lu <lushenming@huawei.com>
> Sent: Friday, June 4, 2021 10:03 AM
> 
> On 2021/6/4 2:19, Jacob Pan wrote:
> > Hi Shenming,
> >
> > On Wed, 2 Jun 2021 12:50:26 +0800, Shenming Lu
> <lushenming@huawei.com>
> > wrote:
> >
> >> On 2021/6/2 1:33, Jason Gunthorpe wrote:
> >>> On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote:
> >>>
> >>>> The drivers register per page table fault handlers to /dev/ioasid which
> >>>> will then register itself to iommu core to listen and route the per-
> >>>> device I/O page faults.
> >>>
> >>> I'm still confused why drivers need fault handlers at all?
> >>
> >> Essentially it is the userspace that needs the fault handlers,
> >> one case is to deliver the faults to the vIOMMU, and another
> >> case is to enable IOPF on the GPA address space for on-demand
> >> paging, it seems that both could be specified in/through the
> >> IOASID_ALLOC ioctl?
> >>
> > I would think IOASID_BIND_PGTABLE is where fault handler should be
> > registered. There wouldn't be any IO page fault without the binding
> anyway.
> 
> Yeah, I also proposed this before, registering the handler in the
> BIND_PGTABLE
> ioctl does make sense for the guest page faults. :-)
> 
> But how about the page faults from the GPA address space (it's page table is
> mapped through the MAP_DMA ioctl)? From your point of view, it seems
> that we should register the handler for the GPA address space in the (first)
> MAP_DMA ioctl.

under new proposal, I think the page fault handler is also registered
per ioasid object. The difference compared with guest page table case
is there is no need to inject the fault to VM.
 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-02 17:21                           ` Jason Gunthorpe
@ 2021-06-07 13:30                             ` Enrico Weigelt, metux IT consult
  2021-06-07 18:01                               ` Jason Gunthorpe
  2021-06-08  1:10                             ` Jason Wang
  1 sibling, 1 reply; 258+ messages in thread
From: Enrico Weigelt, metux IT consult @ 2021-06-07 13:30 UTC (permalink / raw)
  To: Jason Gunthorpe, Jason Wang
  Cc: Tian, Kevin, Lu Baolu, Liu Yi L, kvm, Jonathan Corbet, iommu,
	LKML, Alex Williamson (alex.williamson@redhat.com)"",
	David Woodhouse

On 02.06.21 19:21, Jason Gunthorpe wrote:

Hi,

> Not really, once one thing in an applicate uses a large number FDs the
> entire application is effected. If any open() can return 'very big
> number' then nothing in the process is allowed to ever use select.

isn't that a bug in select() ?

--mtx

-- 
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@metux.net -- +49-151-27565287

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-07  3:18                                 ` Jason Wang
@ 2021-06-07 14:14                                   ` Jason Gunthorpe
  2021-06-08  1:00                                     ` Jason Wang
  2021-06-30  7:07                                     ` Christoph Hellwig
  0 siblings, 2 replies; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-07 14:14 UTC (permalink / raw)
  To: Jason Wang
  Cc: Alex Williamson, Tian, Kevin, Jean-Philippe Brucker, Jiang, Dave,
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML, iommu,
	David Gibson, Kirti Wankhede, David Woodhouse

On Mon, Jun 07, 2021 at 11:18:33AM +0800, Jason Wang wrote:

> Note that no drivers call these things doesn't meant it was not
> supported by the spec.

Of course it does. If the spec doesn't define exactly when the driver
should call the cache flushes for no-snoop transactions then the
protocol doesn't support no-soop. 

no-snoop is only used in very specific sequences of operations, like
certain GPU usages, because regaining coherence on x86 is incredibly
expensive.

ie I wouldn't ever expect a NIC to use no-snoop because NIC's expect
packets to be processed by the CPU.

"non-coherent DMA" is some general euphemism that evokes images of
embedded platforms that don't have coherent DMA at all and have low
cost ways to regain coherence. This is not at all what we are talking
about here at all.
 
Jason

^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-04 23:01                                                         ` Jason Gunthorpe
@ 2021-06-07 15:41                                                           ` Alex Williamson
  2021-06-07 18:18                                                             ` Jason Gunthorpe
  0 siblings, 1 reply; 258+ messages in thread
From: Alex Williamson @ 2021-06-07 15:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Paolo Bonzini, Tian, Kevin, Jean-Philippe Brucker, Jiang, Dave,
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML, iommu,
	David Gibson, Kirti Wankhede, David Woodhouse, Jason Wang

On Fri, 4 Jun 2021 20:01:08 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Fri, Jun 04, 2021 at 03:29:18PM -0600, Alex Williamson wrote:
> > On Fri, 4 Jun 2021 14:22:07 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Fri, Jun 04, 2021 at 06:10:51PM +0200, Paolo Bonzini wrote:  
> > > > On 04/06/21 18:03, Jason Gunthorpe wrote:    
> > > > > On Fri, Jun 04, 2021 at 05:57:19PM +0200, Paolo Bonzini wrote:    
> > > > > > I don't want a security proof myself; I want to trust VFIO to make the right
> > > > > > judgment and I'm happy to defer to it (via the KVM-VFIO device).
> > > > > > 
> > > > > > Given how KVM is just a device driver inside Linux, VMs should be a slightly
> > > > > > more roundabout way to do stuff that is accessible to bare metal; not a way
> > > > > > to gain extra privilege.    
> > > > > 
> > > > > Okay, fine, lets turn the question on its head then.
> > > > > 
> > > > > VFIO should provide a IOCTL VFIO_EXECUTE_WBINVD so that userspace VFIO
> > > > > application can make use of no-snoop optimizations. The ability of KVM
> > > > > to execute wbinvd should be tied to the ability of that IOCTL to run
> > > > > in a normal process context.
> > > > > 
> > > > > So, under what conditions do we want to allow VFIO to giave a process
> > > > > elevated access to the CPU:    
> > > > 
> > > > Ok, I would definitely not want to tie it *only* to CAP_SYS_RAWIO (i.e.
> > > > #2+#3 would be worse than what we have today), but IIUC the proposal (was it
> > > > yours or Kevin's?) was to keep #2 and add #1 with an enable/disable ioctl,
> > > > which then would be on VFIO and not on KVM.      
> > > 
> > > At the end of the day we need an ioctl with two arguments:
> > >  - The 'security proof' FD (ie /dev/vfio/XX, or /dev/ioasid, or whatever)
> > >  - The KVM FD to control wbinvd support on
> > > 
> > > Philosophically it doesn't matter too much which subsystem that ioctl
> > > lives, but we have these obnoxious cross module dependencies to
> > > consider.. 
> > > 
> > > Framing the question, as you have, to be about the process, I think
> > > explains why KVM doesn't really care what is decided, so long as the
> > > process and the VM have equivalent rights.
> > > 
> > > Alex, how about a more fleshed out suggestion:
> > > 
> > >  1) When the device is attached to the IOASID via VFIO_ATTACH_IOASID
> > >     it communicates its no-snoop configuration:  
> > 
> > Communicates to whom?  
> 
> To the /dev/iommu FD which will have to maintain a list of devices
> attached to it internally.
> 
> > >      - 0 enable, allow WBINVD
> > >      - 1 automatic disable, block WBINVD if the platform
> > >        IOMMU can police it (what we do today)
> > >      - 2 force disable, do not allow BINVD ever  
> > 
> > The only thing we know about the device is whether or not Enable
> > No-snoop is hard wired to zero, ie. it either can't generate no-snoop
> > TLPs ("coherent-only") or it might ("assumed non-coherent").    
> 
> Here I am outlining the choice an also imagining we might want an
> admin knob to select the three.

You're calling this an admin knob, which to me suggests a global module
option, so are you trying to implement both an administrator and a user
policy?  ie. the user can create scenarios where access to wbinvd might
be justified by hardware/IOMMU configuration, but can be limited by the
admin?

For example I proposed that the ioasidfd would bear the responsibility
of a wbinvd ioctl and therefore validate the user's access to enable
wbinvd emulation w/ KVM, so I'm assuming this module option lives
there.  I essentially described the "enable" behavior in my previous
reply, user has access to wbinvd if owning a non-coherent capable
device managed in a non-coherent IOASID.  Yes, the user IOASID
configuration controls the latter half of this.

What then is "automatic" mode?  The user cannot create a non-coherent
IOASID with a non-coherent device if the IOMMU supports no-snoop
blocking?  Do they get a failure?  Does it get silently promoted to
coherent?

In "disable" mode, I think we're just narrowing the restriction
further, a non-coherent capable device cannot be used except in a
forced coherent IOASID.

> > If we're putting the policy decision in the hands of userspace they
> > should have access to wbinvd if they own a device that is assumed
> > non-coherent AND it's attached to an IOMMU (page table) that is not
> > blocking no-snoop (a "non-coherent IOASID").  
> 
> There are two parts here, like Paolo was leading too. If the process
> has access to WBINVD and then if such an allowed process tells KVM to
> turn on WBINVD in the guest.
> 
> If the process has a device and it has a way to create a non-coherent
> IOASID, then that process has access to WBINVD.
> 
> For security it doesn't matter if the process actually creates the
> non-coherent IOASID or not. An attacker will simply do the steps that
> give access to WBINVD.

Yes, at this point the user has the ability to create a configuration
where they could have access to wbinvd, but if they haven't created
such a configuration, is the wbinvd a no-op?

> The important detail is that access to WBINVD does not compell the
> process to tell KVM to turn on WBINVD. So a qemu with access to WBINVD
> can still choose to create a secure guest by always using IOMMU_CACHE
> in its page tables and not asking KVM to enable WBINVD.

Of course.

> This propsal shifts this policy decision from the kernel to userspace.
> qemu is responsible to determine if KVM should enable wbinvd or not
> based on if it was able to create IOASID's with IOMMU_CACHE.

QEMU is responsible for making sure the VM is consistent; if
non-coherent DMA can occur, wbinvd is emulated.  But it's still the
KVM/IOASID connection that validates that access.

> > Conversely, a user could create a non-coherent IOASID and attach any
> > device to it, regardless of IOMMU backing capabilities.  Only if an
> > assumed non-coherent device is attached would the wbinvd be allowed.  
> 
> Right, this is exactly the point. Since the user gets to pick if the
> IOASID is coherent or not then an attacker can always reach WBINVD
> using only the device FD. Additional checks don't add to the security
> of the process.
> 
> The additional checks you are describing add to the security of the
> guest, however qemu is capable of doing them without more help from the
> kernel.
> 
> It is the strenth of Paolo's model that KVM should not be able to do
> optionally less, not more than the process itself can do.

I think my previous reply was working towards those guidelines.  I feel
like we're mostly in agreement, but perhaps reading past each other.
Nothing here convinced me against my previous proposal that the
ioasidfd bears responsibility for managing access to a wbinvd ioctl,
and therefore the equivalent KVM access.  Whether wbinvd is allowed or
no-op'd when the use has access to a non-coherent device in a
configuration where the IOMMU prevents non-coherent DMA is maybe still
a matter of personal preference.

> > > It is pretty simple from a /dev/ioasid perpsective, covers todays
> > > compat requirement, gives some future option to allow the no-snoop
> > > optimization, and gives a new option for qemu to totally block wbinvd
> > > no matter what.  
> > 
> > What do you imagine is the use case for totally blocking wbinvd?   
> 
> If wbinvd is really security important then an operator should endevor
> to turn it off. It can be safely turned off if the operator
> understands the SRIOV devices they are using. ie if you are only using
> mlx5 or a nvme then force it off and be secure, regardless of the
> platform capability.

Ok, I'm not opposed to something like a module option that restricts to
only coherent DMA, but we need to work through how that's exposed and
the userspace behavior.  The most obvious would be that a GET_INFO
ioctl on the ioasidfd indicates the restrictions, a flag on the IOASID
alloc indicates the coherency of the IOASID, and we fail any cases
where the admin policy or hardware support doesn't match (ie. alloc if
it's incompatible with policy, attach if the device/IOMMU backing
violates policy).  This is all a compatible layer with what I described
previously.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 258+ messages in thread

* Re: [RFC] /dev/ioasid uAPI proposal
  2021-06-04 23:10           ` Tian, Kevin
@ 2021-06-07 17:54             ` Jason Gunthorpe
  0 siblings, 0 replies; 258+ messages in thread
From: Jason Gunthorpe @ 2021-06-07 17:54 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jean-Philippe Brucker,
	Alex Williamson (alex.williamson@redhat.com),
	Raj, Ashok, kvm, Jonathan Corbet, Robin Murphy, LKML,
	Kirti Wankhede, iommu, David Gibson, Jiang, Dave,
	David Woodhouse, Jason Wang

On Fri, Jun 04, 2021 at 11:10:53PM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Friday, June 4, 2021 8:09 PM
> > 
> > On Fri, Jun 04, 2021 at 06:37:26AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe
> > > > Sent: Thursday, June 3, 2021 9:05 PM
> > > >
> > > > > >
> > > > > > 3) Device accepts any PASIDs from the guest. No
> > > > > >    vPASID/pPASID translation is possible. (classic vfio_pci)
> > > > > > 4) Device accepts any PASID from the guest and has an
> > > > > >    internal vPASID/pPASID translation (enhanced vfio_pci)
> > > > >
> > > > > what is enhanced vfio_pci? In my writing this is for mdev
> > > > > which doesn't support ENQCMD
> > > >
> > > > This is a vfio_pci that mediates some element of the device interface
> > > > to communicate the vPASID/pPASID table to the device, using Max's
> > > > series for vfio_pci drivers to inject itself into VFIO.
> > > >
> > > > For instance a device might send a message through the PF that the VF
> > > > has a certain vPASID/pPASID translation table. This would be useful
> > > > for devices that cannot use ENQCMD but still want to support migration
> > > > and thus need vPASID.
> > >
> > > I still don't quite get. If it's a PCI device why is PASID translation required?
> > > Just delegate the per-RID PASID space to user as type-3 then migrating the
> > > vPASID space is just straightforward.
> > 
> > This is only possible if we get rid of the global pPASID allocation
> > (honestly is my preference as it makes the HW a lot simpler)
> > 
> 
> In this proposal global vs. per-RID allocation is a per-device policy.
> for vfio-pci it can always use per-RID (regardless of whether the
> device is partially mediated or not) and no vPASID/pPASID conversion. 
> Even for mdev if no ENQCMD we can still do per-RID conversion.
> only for mdev which has ENQCMD we need global pPASID allocation.
> 
> I think this is the motivation you explained earlier that it's not good
> to have one global PASID allocator in the kernel. per-RID vs. global
> should be selected per device.

I thought we concluded this wasn't possible because the guest could
choose to bind the same vPASID to a RID and to a ENQCMD device and
then we run into trouble? Are are you saying that a RID device gets a
complete dedicated table and can always have a vPASID == pPASID?

In any event it needs clear explanation in the next RFC

Jason