linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V4 00/18] IOASID extensions for guest SVA
@ 2021-02-27 22:01 Jacob Pan
  2021-02-27 22:01 ` [PATCH V4 01/18] docs: Document IO Address Space ID (IOASID) APIs Jacob Pan
                   ` (18 more replies)
  0 siblings, 19 replies; 269+ messages in thread
From: Jacob Pan @ 2021-02-27 22:01 UTC (permalink / raw)
  To: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker
  Cc: Alex Williamson, Eric Auger, Jason Gunthorpe, Jonathan Corbet,
	Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao, Dave Jiang, Jacob Pan

I/O Address Space ID (IOASID) core code was introduced in v5.5 as a generic
kernel allocator service for both PCIe Process Address Space ID (PASID) and
ARM SMMU's Substream ID. IOASIDs are used to associate DMA requests with
virtual address spaces, including both host and guest.

In addition to providing basic ID allocation, ioasid_set was defined as a
token that is shared by a group of IOASIDs. This set token can be used
for permission checking, but lack some features to address the following
needs by guest Shared Virtual Address (SVA).
- Manage IOASIDs by group, group ownership, quota, etc.
- State synchronization among IOASID users (e.g. IOMMU driver, KVM, device
drivers)
- Non-identity guest-host IOASID mapping
- Lifecycle management

This patchset introduces the following extensions as solutions to the
problems above.
- Redefine and extend IOASID set such that IOASIDs can be managed by groups/pools.
- Add notifications for IOASID state synchronization
- Extend reference counting for life cycle alignment among multiple users
- Support ioasid_set private IDs, which can be used as guest IOASIDs
- Add a new cgroup controller for resource distribution

Please refer to Documentation/admin-guide/cgroup-v1/ioasids.rst and
Documentation/driver-api/ioasid.rst in the enclosed patches for more
details.

Based on discussions on LKML[1], a direction change was made in v4 such that
the user interfaces for IOASID allocation are extracted from VFIO
subsystem. The proposed IOASID subsystem now consists of three components:
1. IOASID core[01-14]: provides APIs for allocation, pool management,
  notifications, and refcounting.
2. IOASID cgroup controller[RFC 15-17]: manage resource distribution[2].
3. IOASID user[RFC 18]:  provides user allocation interface via /dev/ioasid 

This patchset only included VT-d driver as users of some of the new APIs.
VFIO and KVM patches are coming up to fully utilize the APIs introduced here.

[1] https://lore.kernel.org/linux-iommu/1599734733-6431-1-git-send-email-yi.l.liu@intel.com/
[2] Note that ioasid quota management code can be removed once the IOASIDs
cgroup is ratified.

You can find this series, VFIO, KVM, and IOASID user at:
https://github.com/jacobpan/linux.git ioasid_v4
(VFIO and KVM patches will be available at this branch when published.)

This work is a result of collaboration with many people:
Liu, Yi L <yi.l.liu@intel.com>
Wu Hao <hao.wu@intel.com>
Ashok Raj <ashok.raj@intel.com>
Kevin Tian <kevin.tian@intel.com>

Thanks,

Jacob

Changelog:

v4
- Introduced IOASIDs cgroup controller
- Introduced /dev/ioasid user API for allocation/free
- Added IOASID states and free function, aligned refcounting on v5.11
  introduced by Jean.
- Support iommu-sva-lib (will converge VT-d code afterward)
- Added a shared ordered workqueue for notification work that requires
  thread context. Streamlined notification framework among multiple IOASID
  users.
- Added ioasid_set helper functions for taking per set operations

V3:
- Use consistent ioasid_set_ prefix for ioasid_set level APIs
- Make SPID and private detach/attach APIs symmetric
- Use the same ioasid_put semantics as Jean-Phillippe IOASID reference patch
- Take away the public ioasid_notify() function, notifications are now emitted
  by IOASID core as a result of certain IOASID APIs
- Partition into finer incremental patches
- Miscellaneous cleanup, locking, exception handling fixes based on v2 reviews

V2:
- Redesigned ioasid_set APIs, removed set ID
- Added set private ID (SPID) for guest PASID usage.
- Add per ioasid_set notification and priority support.
- Back to use spinlocks and atomic notifications.
- Added async work in VT-d driver to perform teardown outside atomic context


Jacob Pan (17):
  docs: Document IO Address Space ID (IOASID) APIs
  iommu/ioasid: Rename ioasid_set_data()
  iommu/ioasid: Add a separate function for detach data
  iommu/ioasid: Support setting system-wide capacity
  iommu/ioasid: Redefine IOASID set and allocation APIs
  iommu/ioasid: Add free function and states
  iommu/ioasid: Add ioasid_set iterator helper functions
  iommu/ioasid: Introduce ioasid_set private ID
  iommu/ioasid: Introduce notification APIs
  iommu/ioasid: Support mm token type ioasid_set notifications
  iommu/ioasid: Add ownership check in guest bind
  iommu/vt-d: Remove mm reference for guest SVA
  iommu/ioasid: Add a workqueue for cleanup work
  iommu/vt-d: Listen to IOASID notifications
  cgroup: Introduce ioasids controller
  iommu/ioasid: Consult IOASIDs cgroup for allocation
  docs: cgroup-v1: Add IOASIDs controller

Liu Yi L (1):
  ioasid: Add /dev/ioasid for userspace

 Documentation/admin-guide/cgroup-v1/index.rst |   1 +
 .../admin-guide/cgroup-v1/ioasids.rst         | 107 ++
 Documentation/driver-api/index.rst            |   1 +
 Documentation/driver-api/ioasid.rst           | 510 +++++++++
 Documentation/userspace-api/index.rst         |   1 +
 Documentation/userspace-api/ioasid.rst        |  49 +
 drivers/iommu/Kconfig                         |   5 +
 drivers/iommu/Makefile                        |   1 +
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |   1 +
 drivers/iommu/intel/Kconfig                   |   1 +
 drivers/iommu/intel/iommu.c                   |  32 +-
 drivers/iommu/intel/pasid.h                   |   1 +
 drivers/iommu/intel/svm.c                     | 145 ++-
 drivers/iommu/ioasid.c                        | 983 +++++++++++++++++-
 drivers/iommu/ioasid_user.c                   | 297 ++++++
 drivers/iommu/iommu-sva-lib.c                 |  19 +-
 drivers/iommu/iommu.c                         |  16 +-
 include/linux/cgroup_subsys.h                 |   4 +
 include/linux/intel-iommu.h                   |   2 +
 include/linux/ioasid.h                        | 256 ++++-
 include/linux/miscdevice.h                    |   1 +
 include/uapi/linux/ioasid.h                   |  98 ++
 init/Kconfig                                  |   7 +
 kernel/cgroup/Makefile                        |   1 +
 kernel/cgroup/ioasids.c                       | 345 ++++++
 25 files changed, 2794 insertions(+), 90 deletions(-)
 create mode 100644 Documentation/admin-guide/cgroup-v1/ioasids.rst
 create mode 100644 Documentation/driver-api/ioasid.rst
 create mode 100644 Documentation/userspace-api/ioasid.rst
 create mode 100644 drivers/iommu/ioasid_user.c
 create mode 100644 include/uapi/linux/ioasid.h
 create mode 100644 kernel/cgroup/ioasids.c

-- 
2.25.1


^ permalink raw reply	[flat|nested] 269+ messages in thread

* [PATCH V4 01/18] docs: Document IO Address Space ID (IOASID) APIs
  2021-02-27 22:01 [PATCH V4 00/18] IOASID extensions for guest SVA Jacob Pan
@ 2021-02-27 22:01 ` Jacob Pan
  2021-02-27 22:01 ` [PATCH V4 02/18] iommu/ioasid: Rename ioasid_set_data() Jacob Pan
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 269+ messages in thread
From: Jacob Pan @ 2021-02-27 22:01 UTC (permalink / raw)
  To: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker
  Cc: Alex Williamson, Eric Auger, Jason Gunthorpe, Jonathan Corbet,
	Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao, Dave Jiang, Jacob Pan,
	linux-doc, Randy Dunlap

IOASID is used to identify address spaces that can be targeted by device
DMA. It is a system-wide resource that is essential to its many users.
This document is an attempt to help developers from all vendors navigate
the APIs. At this time, ARM SMMU and Intel’s Scalable IO Virtualization
(SIOV) enabled platforms are the primary users of IOASID. Examples of
how SIOV components interact with the IOASID APIs are provided.

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: linux-doc@vger.kernel.org
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Wu Hao <hao.wu@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 Documentation/driver-api/index.rst  |   1 +
 Documentation/driver-api/ioasid.rst | 510 ++++++++++++++++++++++++++++
 2 files changed, 511 insertions(+)
 create mode 100644 Documentation/driver-api/ioasid.rst

diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst
index 2456d0a97ed8..baeec308cf2c 100644
--- a/Documentation/driver-api/index.rst
+++ b/Documentation/driver-api/index.rst
@@ -37,6 +37,7 @@ available subsections can be seen below.
    pci/index
    spi
    i2c
+   ioasid
    ipmb
    ipmi
    i3c/index
diff --git a/Documentation/driver-api/ioasid.rst b/Documentation/driver-api/ioasid.rst
new file mode 100644
index 000000000000..f3ed5bf43fa6
--- /dev/null
+++ b/Documentation/driver-api/ioasid.rst
@@ -0,0 +1,510 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. ioasid:
+
+=====================
+ IO Address Space ID
+=====================
+
+IOASIDs are used to identify virtual address spaces that DMA requests can
+target. It is a generic name for PCIe Process Address ID (PASID) or
+SubstreamID defined by ARM's SMMU.
+
+The primary use cases for IOASIDs are Shared Virtual Address (SVA) and
+IO Virtual Address (IOVA) when multiple address spaces per device are
+desired. Due to hardware architectural differences the requirements for
+IOASID management can vary in terms of namespace, state management, and
+virtualization usages.
+
+The IOASID subsystem consists of three components:
+
+- IOASID core: provides APIs for allocation, pool management,
+  notifications and refcounting.
+- IOASID user:  provides user allocation interface via /dev/ioasid
+- IOASID cgroup controller: manage resource distribution.
+  (Documentation/admin-guide/cgroup-v1/ioasids.rst)
+
+This document covers the features supported by the IOASID core APIs.
+Vendor-specific use cases are also illustrated with Intel's VT-d
+based platforms as the first example. The term PASID and IOASID are used
+interchangeably throughout this document.
+
+.. contents:: :local:
+
+Glossary
+========
+PASID - Process Address Space ID
+
+IOVA - IO Virtual Address
+
+IOASID - IO Address Space ID (generic term for PCIe PASID and
+SubstreamID in SMMU)
+
+SVA/SVM - Shared Virtual Addressing/Memory
+
+gSVA - Guest Shared Virtual Addressing
+
+gIOVA - Guest IO Virtual Addressing
+
+ENQCMD - Instruction to submit work to shared workqueues. Refer
+to "Intel X86 ISA for efficient workqueue submission" [1]
+
+DSA - Intel Data Streaming Accelerator [2]
+
+VDCM - Virtual Device Composition Module [3]
+
+SIOV - Intel Scalable IO Virtualization
+
+DWQ - Dedicated Work Queue
+
+SWQ - Shared Work Queue
+
+1. https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf
+
+2. https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator
+
+3. https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification
+
+
+Key Concepts
+============
+
+IOASID Set
+----------
+An IOASID set is a group of IOASIDs allocated from the system-wide
+IOASID pool. Refer to section "IOASID Set Level APIs" for more details.
+
+IOASID set is particularly useful for guest SVA where each guest could
+have its own IOASID set for security and efficiency reasons.
+
+Guest IOASID
+------------------
+IOASID used by the guest, identifies a guest IOVA space or a guest VA
+space per guest process.
+
+Host IOASID
+-----------------
+IOASID used by the host either for bare metal SVA or as the backing of a
+guest IOASID.
+
+Bind/Unbind
+-----------
+Refer to the process where mappings among IOASID, page tables, and devices
+are established/demolished. This usually involes setting up an entry of
+the IOMMU's per device PASID table with a given PGD.
+
+IOASID Set Private ID (SPID)
+----------------------------
+Each IOASID set has a private namespace of SPIDs. An SPID maps to a
+single system-wide IOASID. Conversely, each IOASID may be associated
+with an alias ID, local to the IOASID set, named SPID.
+SPIDs can be used as guest IOASIDs where each guest could do
+IOASID allocation from its own pool/set and map them to host physical
+IOASIDs. SPIDs are particularly useful for supporting live migration
+where decoupling guest and host physical resources are necessary. Guest
+to Host PASID mapping can be torn down and re-established. Storing the
+mapping inside the kernel also provides lookup service.
+
+For example, two VMs can both allocate guest PASID/SPID #101 but map to
+different host PASIDs #201 and #202 respectively as shown in the
+diagram below.
+::
+
+ .------------------.    .------------------.
+ |   VM 1           |    |   VM 2           |
+ |                  |    |                  |
+ |------------------|    |------------------|
+ | GPASID/SPID 101  |    | GPASID/SPID 101  |
+ '------------------'    -------------------'     Guest
+ __________|______________________|____________________
+           |                      |               Host
+           v                      v
+ .------------------.    .------------------.
+ | Host IOASID 201  |    | Host IOASID 202  |
+ '------------------'    '------------------'
+ |   IOASID set 1   |    |   IOASID set 2   |
+ '------------------'    '------------------'
+
+Guest PASID is treated as IOASID set private ID (SPID) within an
+IOASID set, mappings between guest and host IOASIDs are stored in the
+set for inquiry.
+
+Theory of Operation
+===================
+
+States
+------
+IOASID has four states as illustrated in the diagram below.
+::
+
+   BIND/UNBIND, WQ PROG/CLEAR⁴
+   -----------------------------.
+                                |
+   ALLOC/FREE                   |
+   ------------.                |
+               |                |
+   +-------+   v    +-------+   v     +----------+
+   | FREE  |<======>| IDLE¹ |<=======>| ACTIVE²  |
+   +-------+        +-------+         +----------+
+      ^                                    |
+      |           +---------------+        |
+      '===========| FREE PENDING³ |<======='
+                  +---------------+  ^
+   FREE                              |
+   ----------------------------------'
+   ¹ Allocated but not used
+   ² Used by device drivers, IOMMU, or CPU, each user holds a reference
+   ³ Waiting for all users drop their refcount before returning IOASID
+     back to the pool
+   ⁴ Device drivers obtain refcount after programs workqueue with IOASID.
+     Release the refcount after clearing the workqueue.
+     Similarly, the IOMMU driver can also get/put IOASID refcount during
+     bind/unbind.
+
+Notifications
+-------------
+Depending on the hardware architecture, an IOASID can be programmed into
+CPU, IOMMU, or devices for DMA related activity. The synchronization among them
+is based on events notifications which follows a publisher-subscriber pattern.
+
+Events
+~~~~~~
+Notification events are pertinent to individual IOASIDs, they can be
+one of the following::
+
+ - ALLOC
+ - FREE
+ - BIND
+ - UNBIND
+
+Besides calling ioasid_notify() directly with explicit events, notifications
+can also be sent by the IOASID core as a by-product of calling the following
+APIs::
+
+ - ioasisd_free()        /* emits IOASID_FREE */
+ - ioasid_detach_spid()  /* emits IOASID_UNBIND */
+ - ioasid_attach_spid()  /* emits IOASID_BIND */
+
+Ordering
+~~~~~~~~
+Ordering of notification events is supported by the IOASID core as the
+following (from high to low)::
+
+ - CPU
+ - IOMMU
+ - DEVICE
+
+Subscribers of IOASID events are responsible for registering their
+notification blocks according to the priorities.
+
+The above order applies to all events. For example, if UNBIND event is
+issued when a guest IOASID is freed due to exceptions. All active DMA
+sources should be quiesced before tearing down other hardware contexts
+associated with the IOASID in the system. This is necessary to reduce
+the churn in handling faults. The notification order ensures that vCPU
+is stopped before IOMMU and devices. KVM x86 code registers notification
+block with priority IOASID_PRIO_CPU and VDCM code registers notification
+block with priority IOASID_PRIO_DEVICE, IOASID core ensures the CPU
+handlers are called before the DEVICE handlers.
+
+It is the caller's responsibility to avoid chained notifications in the
+atomic notification handlers. i.e. ioasid_detach_spid() cannot be called
+inside the IOASID_FREE atomic handlers due to spinlocks held by the
+caller of the notifier. However, ioasid_detach_spid() can be called from
+deferred work. See Atomicity section for details.
+
+Level Sensitivity
+~~~~~~~~~~~~~~~~~
+For each IOASID state transition, IOASID core ensures that there is
+only one notification sent. This resembles level triggered interrupt
+where a single interrupt is raised during a state transition.
+For example, if ioasid_free() is called twice by a user before the
+IOASID is reclaimed, IOASID core will only send out a single
+IOASID_NOTIFY_FREE event. Similarly, for IOASID_NOTIFY_BIND/UNBIND
+events, which is only sent out once when a SPID is attached/detached.
+
+Scopes
+~~~~~~
+There are two types of notifiers in IOASID core: system-wide and
+ioasid_set-wide (one notifier chain per ioasid_set).
+
+System-wide notifier is catering for users that need to handle all the
+IOASIDs in the system. E.g. The IOMMU driver.
+
+Per ioasid_set notifier can be used by VM specific components such as
+KVM. After all, each KVM instance only cares about IOASIDs within its
+own set/guest. The following flags are used to distinguish the scopes::
+
+ #define IOASID_NOTIFY_FLAG_ALL BIT(0)
+ #define IOASID_NOTIFY_FLAG_SET BIT(1)
+
+For example, on VT-d platform both KVM and VDCM shall register notifier
+block on the IOASID set such that *only* events from the matching VM
+are received.
+
+If KVM attempts to register a notifier block before the IOASID set is
+created using the MM token, the notifier block will be placed on a
+pending list inside IOASID core. Once the token matching IOASID set
+is created, IOASID will register the notifier block automatically.
+IOASID core does not replay events for the existing IOASIDs in the
+set. For IOASID set of MM type, notification blocks can be registered
+on empty sets only. This is to avoid lost events.
+
+IOMMU driver shall register notifier block on global chain, e.g. ::
+
+ static struct notifier_block pasid_nb_vtd = {
+	.notifier_call = pasid_status_change_vtd,
+	.priority      = IOASID_PRIO_IOMMU,
+ };
+
+Atomicity
+~~~~~~~~~
+IOASID notifiers are atomic due to spinlocks used inside the IOASID
+core. For tasks that cannot be completed in the notifier handler,
+async work to be completed in order must be submitted to the ordered
+workqueue provided by the IOASID core. This will ensure the order w.r.t.
+the work items submitted by other users of the same event.
+
+It is the caller's responsibility to avoid chained notifications in the
+atomic notification handlers. e.g. ioasid_detach_spid() cannot be called
+inside the IOASID_FREE atomic handlers due to spinlocks held by the
+caller of the notifier. However, ioasid_detach_spid() can be called from
+deferred work.
+
+Reference counting
+------------------
+IOASID life cycle management is based on reference counting. Users of
+IOASID who intend to align its context with the life cycle need to hold
+references of the IOASID. An IOASID will not be returned to the pool
+for re-allocation until all its references are dropped. Calling ioasid_free()
+will mark the IOASID as FREE_PENDING if the IOASID has outstanding
+references. No new references can be taken by ioasid_get() once an
+IOASID is in the FREE_PENDING state. ioasid_free() can be called
+multiple times without an error until all refs are dropped.
+
+ioasid_put() decrements and tests refcount of the IOASID. If refcount
+is 0, ioasid will be freed. The IOASID will be returned to the pool and
+available for new allocations. Note that ioasid_put() can be called by
+the IOASID_FREE event handler where the subscriber can drop the last
+refcount that ends the free pending state.
+
+Event notifications are used to inform users of IOASID status change.
+IOASID_FREE or UNBIND events prompt users to drop their references after
+clearing its context.
+
+For example, on VT-d platform when an IOASID is freed, teardown
+actions are performed on CPU (KVM), device driver (VDCM), and the IOMMU
+driver. To quiesce vCPU for work submission, KVM notifier handler must
+be called before VDCM handler. Therefore, KVM and VDCM shall monitor
+notification events IOASID_UNBIND.
+
+Namespaces
+----------
+IOASIDs are limited system resources that default to 20 bits in
+size. Each device can have its own PASID table for security reasons.
+Theoretically the namespace can be per device also.
+
+However IOASID namespace is system-wide for two reasons:
+- Simplicity
+- Sharing resources of a single device to multiple VMs.
+
+Take VT-d as an example, VT-d supports shared workqueue and ENQCMD[1]
+where one IOASID could be used to submit work on multiple devices that
+are shared with other VMs. This requires IOASID to be
+system-wide. This is also the reason why guests must use an
+emulated virtual command interface to allocate IOASID from the host.
+
+Life cycle
+----------
+This section covers the IOASID life cycle management for both bare-metal
+and guest usages. In bare-metal SVA, MMU notifier is directly hooked
+up with the IOMMU driver. By leveraging the .release() function, the
+IOASID life cycle can be made to match the process address space (MM)
+life cycle.
+
+However, guest MMU notifier is not available to the host IOMMU driver,
+when guest MM terminates unexpectedly, the events have to go through
+VFIO and IOMMU UAPI to reach host IOMMU driver. There are also more
+parties involved in guest SVA, e.g. on Intel VT-d platform, IOASIDs
+are used by IOMMU driver, KVM, VDCM, and VFIO.
+
+At the highlevel, there are following four patterns:
+
+1.   ALLOC -> FREE
+2.   ALLOC -> BIND -> DMA Activity -> UNBIND -> FREE
+3.   ALLOC -> BIND -> FREE
+4.   ALLOC -> BIND -> DMA Activity -> FREE
+
+The first two are normal cases, 3 and 4 are exceptions due to user
+process misbehaving.
+
+Exception handling can be complex when there are lots of IOASID
+consumers involved but the pattern is common and quite simple. When an
+IOASID in active state is being freed, IOASID core will notify all
+users to perform clean up. Each IOASID user performs cleanup and drop
+the reference at the end. When reference count drops to 0, IOASID will
+be reclaimed and ready to be allocated again.
+
+Cleanup can be either done in the atomic notifier handler or as queued
+work to the common ordered IOASID workqueue to be performed asynchronously.
+The highlevel flow is the following::
+
+  Free Req¹ -> Notify users -> Cleanup -> Drop reference -> Reclaim
+
+Notes:
+¹ Free one IOASID or free all IOASID within a set
+
+The following table shows how events are used on Intel VT-d platform.
+::
+
+  --------------------------------------------------------------------------
+  Events     |Publishers       | Subscribers
+  -----------+-----------------+--------------------------------------------
+  ALLOC      |/dev/ioasid      | None
+  -----------+-----------------+--------------------------------------------
+  FREE       |/dev/ioasid      | IOMMU (VT-d driver)¹
+  -----------+-----------------+-----------------------------------------------
+  BIND       |IOMMU            | KVM, VDCM
+  -----------+-----------------+-----------------------------------------------
+  UNBIND     |IOMMU²           | KVM, VDCM
+  -----------+--------------------------------------------------------------
+
+  ¹ IOASID core issues FREE events if the IOASID is in the ACTIVE state. IOMMU
+    driver calls ioasid_detach_spid() which issues UNBIND event outside atomic
+    notifier handler.
+  ² Only *one* BIND/UBIND event is issued per bind/unbind cycle. For multiple
+    devices bound to the same PASID, BIND event is issued for the first device
+    bind, UNBDIN event is issued for the last device unbind. Faults must be
+    tolerated between the first and last device unbind. Under normal
+    circumstances, faults are not expected in that the teardown process shall
+    stop DMA activities prior to unbind.
+
+The number of IOASIDs allocated in the ioasid_set serves as the refcount
+of the set, this ensures the life cycle alignment of the set and its
+IOASIDs.
+
+API Implementation
+==================
+To get the IOASID APIs, users must #include <linux/ioasid.h>. These APIs
+serve the following functionalities:
+
+  - IOASID allocation/freeing
+  - Group management in the form of ioasid_set
+  - Private data storage and lookup
+  - Reference counting
+  - Event notification in case of a state change
+
+Custom allocator APIs
+---------------------
+
+IOASIDs are allocated for both host and guest SVA/IOVA usage. However,
+allocators can be different. For example, on VT-d guest PASID
+allocation must be performed via a virtual command interface which is
+emulated by VMM.
+
+IOASID core has the notion of "custom allocator" such that guest can
+register virtual command allocator that precedes the default one.
+::
+
+ int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
+
+ void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
+
+IOASID Set Level APIs
+---------------------
+For use cases such as guest SVA it is necessary to manage IOASIDs at
+ioasid_set level. For example, VMs may allocate multiple IOASIDs for
+guest process address sharing (vSVA). It is imperative to enforce
+VM-IOASID ownership such that a malicious guest cannot target DMA
+traffic outside its own IOASIDs, or free an active IOASID that belongs
+to another VM.
+
+The IOASID set APIs serve the following purposes:
+
+ - Ownership/permission enforcement
+ - Take collective actions, e.g. free an entire set
+ - Event notifications within a set
+ - Look up a set based on token
+ - Quota enforcement (TBD, contingent upon ioasids cgroup)
+
+Each IOASID set is created with a token, which can be one of the
+following token types::
+
+ - IOASID_SET_TYPE_NONE (Arbitrary u64 value)
+ - IOASID_SET_TYPE_MM (Set token is a mm_struct)
+
+The explicit MM token type is useful when multiple users of an IOASID
+set under the same process need to communicate about their shared IOASIDs.
+E.g. An IOASID set created by VFIO for one guest can be associated
+with the KVM instance for the same guest since they share a common mm_struct.
+A token must be unique within its type.
+
+::
+
+ struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota, u32 type)
+
+ int ioasid_set_for_each_ioasid(struct ioasid_set *set,
+                                void (*fn)(ioasid_t id, void *data),
+                                void *data)
+
+ struct ioasid_set *ioasid_find_mm_set(struct mm_struct *token)
+
+ void ioasid_free_all_in_set(struct ioasid_set *set)
+
+Individual IOASID APIs
+----------------------
+Once an ioasid_set is created, IOASIDs can be allocated from the set.
+Within the IOASID set namespace, set private ID (SPID) is supported. In
+the VM use case, SPID can be used for storing guest PASID.
+
+::
+
+ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
+                       void *private);
+
+ int ioasid_get(struct ioasid_set *set, ioasid_t ioasid);
+
+ void ioasid_put(struct ioasid_set *set, ioasid_t ioasid);
+
+ int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid);
+
+ void ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid);
+
+ void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
+                   bool (*getter)(void *));
+
+ ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid,
+ bool get)
+
+ int ioasid_attach_data(struct ioasid_set *set, ioasid_t ioasid,
+                        void *data);
+ int ioasid_attach_spid(struct ioasid_set *set, ioasid_t ioasid,
+                        ioasid_t spid);
+
+
+Notification APIs
+-----------------
+An IOASID may have multiple users, each user may have hardware context
+associated with an IOASID. When the status of an IOASID changes,
+e.g. an IOASID is being freed, users need to be notified such that the
+associated hardware context can be cleared, flushed, and drained.
+
+::
+
+ int ioasid_register_notifier(struct ioasid_set *set, struct
+                              notifier_block *nb)
+
+ void ioasid_unregister_notifier(struct ioasid_set *set,
+                                 struct notifier_block *nb)
+
+ int ioasid_register_notifier_mm(struct mm_struct *mm, struct
+                                 notifier_block *nb)
+
+ void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct
+                                    notifier_block *nb)
+
+ int ioasid_notify(ioasid_t ioasid, enum ioasid_notify_val cmd,
+                   unsigned int flags)
+
+"_mm" flavor of the ioasid_register_notifier() APIs are used when
+an IOASID user need to listen to the IOASID events belong to a
+process but without the knowledge of the associated ioasid_set.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 269+ messages in thread

* [PATCH V4 02/18] iommu/ioasid: Rename ioasid_set_data()
  2021-02-27 22:01 [PATCH V4 00/18] IOASID extensions for guest SVA Jacob Pan
  2021-02-27 22:01 ` [PATCH V4 01/18] docs: Document IO Address Space ID (IOASID) APIs Jacob Pan
@ 2021-02-27 22:01 ` Jacob Pan
  2021-02-27 22:01 ` [PATCH V4 03/18] iommu/ioasid: Add a separate function for detach data Jacob Pan
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 269+ messages in thread
From: Jacob Pan @ 2021-02-27 22:01 UTC (permalink / raw)
  To: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker
  Cc: Alex Williamson, Eric Auger, Jason Gunthorpe, Jonathan Corbet,
	Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao, Dave Jiang, Jacob Pan

Rename ioasid_set_data() to ioasid_attach_data() to avoid confusion with
struct ioasid_set. ioasid_set is a group of IOASIDs that share a common
token.

Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/intel/svm.c | 6 +++---
 drivers/iommu/ioasid.c    | 6 +++---
 include/linux/ioasid.h    | 4 ++--
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 18a9f05df407..0053df9edffc 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -371,7 +371,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
 			svm->gpasid = data->gpasid;
 			svm->flags |= SVM_FLAG_GUEST_PASID;
 		}
-		ioasid_set_data(data->hpasid, svm);
+		ioasid_attach_data(data->hpasid, svm);
 		INIT_LIST_HEAD_RCU(&svm->devs);
 		mmput(svm->mm);
 	}
@@ -425,7 +425,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
 	list_add_rcu(&sdev->list, &svm->devs);
  out:
 	if (!IS_ERR_OR_NULL(svm) && list_empty(&svm->devs)) {
-		ioasid_set_data(data->hpasid, NULL);
+		ioasid_attach_data(data->hpasid, NULL);
 		kfree(svm);
 	}
 
@@ -468,7 +468,7 @@ int intel_svm_unbind_gpasid(struct device *dev, u32 pasid)
 				 * the unbind, IOMMU driver will get notified
 				 * and perform cleanup.
 				 */
-				ioasid_set_data(pasid, NULL);
+				ioasid_attach_data(pasid, NULL);
 				kfree(svm);
 			}
 		}
diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
index 50ee27bbd04e..eeadf4586e0a 100644
--- a/drivers/iommu/ioasid.c
+++ b/drivers/iommu/ioasid.c
@@ -259,14 +259,14 @@ void ioasid_unregister_allocator(struct ioasid_allocator_ops *ops)
 EXPORT_SYMBOL_GPL(ioasid_unregister_allocator);
 
 /**
- * ioasid_set_data - Set private data for an allocated ioasid
+ * ioasid_attach_data - Set private data for an allocated ioasid
  * @ioasid: the ID to set data
  * @data:   the private data
  *
  * For IOASID that is already allocated, private data can be set
  * via this API. Future lookup can be done via ioasid_find.
  */
-int ioasid_set_data(ioasid_t ioasid, void *data)
+int ioasid_attach_data(ioasid_t ioasid, void *data)
 {
 	struct ioasid_data *ioasid_data;
 	int ret = 0;
@@ -288,7 +288,7 @@ int ioasid_set_data(ioasid_t ioasid, void *data)
 
 	return ret;
 }
-EXPORT_SYMBOL_GPL(ioasid_set_data);
+EXPORT_SYMBOL_GPL(ioasid_attach_data);
 
 /**
  * ioasid_alloc - Allocate an IOASID
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index e9dacd4b9f6b..60ea279802b8 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -40,7 +40,7 @@ void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
 		  bool (*getter)(void *));
 int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
 void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
-int ioasid_set_data(ioasid_t ioasid, void *data);
+int ioasid_attach_data(ioasid_t ioasid, void *data);
 
 #else /* !CONFIG_IOASID */
 static inline ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
@@ -73,7 +73,7 @@ static inline void ioasid_unregister_allocator(struct ioasid_allocator_ops *allo
 {
 }
 
-static inline int ioasid_set_data(ioasid_t ioasid, void *data)
+static inline int ioasid_attach_data(ioasid_t ioasid, void *data)
 {
 	return -ENOTSUPP;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 269+ messages in thread

* [PATCH V4 03/18] iommu/ioasid: Add a separate function for detach data
  2021-02-27 22:01 [PATCH V4 00/18] IOASID extensions for guest SVA Jacob Pan
  2021-02-27 22:01 ` [PATCH V4 01/18] docs: Document IO Address Space ID (IOASID) APIs Jacob Pan
  2021-02-27 22:01 ` [PATCH V4 02/18] iommu/ioasid: Rename ioasid_set_data() Jacob Pan
@ 2021-02-27 22:01 ` Jacob Pan
  2021-02-27 22:01 ` [PATCH V4 04/18] iommu/ioasid: Support setting system-wide capacity Jacob Pan
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 269+ messages in thread
From: Jacob Pan @ 2021-02-27 22:01 UTC (permalink / raw)
  To: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker
  Cc: Alex Williamson, Eric Auger, Jason Gunthorpe, Jonathan Corbet,
	Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao, Dave Jiang, Jacob Pan

IOASID private data can be cleared by ioasid_attach_data() with a NULL
data pointer. A common use case is for a caller to free the data
afterward. ioasid_attach_data() calls synchronize_rcu() before return
such that free data can be sure without outstanding readers.
However, since synchronize_rcu() may sleep, ioasid_attach_data() cannot
be used under spinlocks.

This patch adds ioasid_detach_data() as a separate API where
synchronize_rcu() is called only in this case. ioasid_attach_data() can
then be used under spinlocks. In addition, this change makes the API
symmetrical.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/intel/svm.c |  4 +--
 drivers/iommu/ioasid.c    | 54 +++++++++++++++++++++++++++++++--------
 include/linux/ioasid.h    |  5 +++-
 3 files changed, 50 insertions(+), 13 deletions(-)

diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 0053df9edffc..68372a7eb8b5 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -425,7 +425,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
 	list_add_rcu(&sdev->list, &svm->devs);
  out:
 	if (!IS_ERR_OR_NULL(svm) && list_empty(&svm->devs)) {
-		ioasid_attach_data(data->hpasid, NULL);
+		ioasid_detach_data(data->hpasid);
 		kfree(svm);
 	}
 
@@ -468,7 +468,7 @@ int intel_svm_unbind_gpasid(struct device *dev, u32 pasid)
 				 * the unbind, IOMMU driver will get notified
 				 * and perform cleanup.
 				 */
-				ioasid_attach_data(pasid, NULL);
+				ioasid_detach_data(pasid);
 				kfree(svm);
 			}
 		}
diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
index eeadf4586e0a..4eb9b3dd1b85 100644
--- a/drivers/iommu/ioasid.c
+++ b/drivers/iommu/ioasid.c
@@ -273,23 +273,57 @@ int ioasid_attach_data(ioasid_t ioasid, void *data)
 
 	spin_lock(&ioasid_allocator_lock);
 	ioasid_data = xa_load(&active_allocator->xa, ioasid);
-	if (ioasid_data)
-		rcu_assign_pointer(ioasid_data->private, data);
-	else
+
+	if (!ioasid_data) {
 		ret = -ENOENT;
-	spin_unlock(&ioasid_allocator_lock);
+		goto done_unlock;
+	}
 
-	/*
-	 * Wait for readers to stop accessing the old private data, so the
-	 * caller can free it.
-	 */
-	if (!ret)
-		synchronize_rcu();
+	if (ioasid_data->private) {
+		ret = -EBUSY;
+		goto done_unlock;
+	}
+	rcu_assign_pointer(ioasid_data->private, data);
+
+done_unlock:
+	spin_unlock(&ioasid_allocator_lock);
 
 	return ret;
 }
 EXPORT_SYMBOL_GPL(ioasid_attach_data);
 
+/**
+ * ioasid_detach_data - Clear the private data of an ioasid
+ *
+ * @ioasid: the IOASIDD to clear private data
+ */
+void ioasid_detach_data(ioasid_t ioasid)
+{
+	struct ioasid_data *ioasid_data;
+
+	spin_lock(&ioasid_allocator_lock);
+	ioasid_data = xa_load(&active_allocator->xa, ioasid);
+
+	if (!ioasid_data) {
+		pr_warn("IOASID %u not found to detach data from\n", ioasid);
+		goto done_unlock;
+	}
+
+	if (ioasid_data->private) {
+		rcu_assign_pointer(ioasid_data->private, NULL);
+		goto done_unlock;
+	}
+
+done_unlock:
+	spin_unlock(&ioasid_allocator_lock);
+	/*
+	 * Wait for readers to stop accessing the old private data,
+	 * so the caller can free it.
+	 */
+	synchronize_rcu();
+}
+EXPORT_SYMBOL_GPL(ioasid_detach_data);
+
 /**
  * ioasid_alloc - Allocate an IOASID
  * @set: the IOASID set
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index 60ea279802b8..f6e705f832f0 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -41,7 +41,7 @@ void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
 int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
 void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
 int ioasid_attach_data(ioasid_t ioasid, void *data);
-
+void ioasid_detach_data(ioasid_t ioasid);
 #else /* !CONFIG_IOASID */
 static inline ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
 				    ioasid_t max, void *private)
@@ -78,5 +78,8 @@ static inline int ioasid_attach_data(ioasid_t ioasid, void *data)
 	return -ENOTSUPP;
 }
 
+static inline void ioasid_detach_data(ioasid_t ioasid)
+{
+}
 #endif /* CONFIG_IOASID */
 #endif /* __LINUX_IOASID_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 269+ messages in thread

* [PATCH V4 04/18] iommu/ioasid: Support setting system-wide capacity
  2021-02-27 22:01 [PATCH V4 00/18] IOASID extensions for guest SVA Jacob Pan
                   ` (2 preceding siblings ...)
  2021-02-27 22:01 ` [PATCH V4 03/18] iommu/ioasid: Add a separate function for detach data Jacob Pan
@ 2021-02-27 22:01 ` Jacob Pan
  2021-02-27 22:01 ` [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs Jacob Pan
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 269+ messages in thread
From: Jacob Pan @ 2021-02-27 22:01 UTC (permalink / raw)
  To: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker
  Cc: Alex Williamson, Eric Auger, Jason Gunthorpe, Jonathan Corbet,
	Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao, Dave Jiang, Jacob Pan

IOASID is a system-wide resource that could vary on different systems.
The default capacity is 20 bits as defined in the PCI-E specifications.
This patch adds a function to allow adjusting system IOASID capacity.
For VT-d this is set during boot as part of the Intel IOMMU
initialization. APIs also added to support runtime capacity reservation,
potentially by cgroups.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/intel/iommu.c |  5 +++
 drivers/iommu/ioasid.c      | 70 +++++++++++++++++++++++++++++++++++++
 include/linux/ioasid.h      | 18 ++++++++++
 3 files changed, 93 insertions(+)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index f665322a0991..6f42ff7d171d 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -41,6 +41,7 @@
 #include <linux/dma-direct.h>
 #include <linux/crash_dump.h>
 #include <linux/numa.h>
+#include <linux/ioasid.h>
 #include <asm/irq_remapping.h>
 #include <asm/cacheflush.h>
 #include <asm/iommu.h>
@@ -3298,6 +3299,10 @@ static int __init init_dmars(void)
 	if (ret)
 		goto free_iommu;
 
+	/* PASID is needed for scalable mode irrespective to SVM */
+	if (intel_iommu_sm)
+		ioasid_install_capacity(intel_pasid_max_id);
+
 	/*
 	 * for each drhd
 	 *   enable fault log
diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
index 4eb9b3dd1b85..28681b99340b 100644
--- a/drivers/iommu/ioasid.c
+++ b/drivers/iommu/ioasid.c
@@ -10,6 +10,10 @@
 #include <linux/spinlock.h>
 #include <linux/xarray.h>
 
+/* Default to PCIe standard 20 bit PASID */
+#define PCI_PASID_MAX 0x100000
+static ioasid_t ioasid_capacity = PCI_PASID_MAX;
+static ioasid_t ioasid_capacity_avail = PCI_PASID_MAX;
 struct ioasid_data {
 	ioasid_t id;
 	struct ioasid_set *set;
@@ -258,6 +262,72 @@ void ioasid_unregister_allocator(struct ioasid_allocator_ops *ops)
 }
 EXPORT_SYMBOL_GPL(ioasid_unregister_allocator);
 
+void ioasid_install_capacity(ioasid_t total)
+{
+	spin_lock(&ioasid_allocator_lock);
+	if (ioasid_capacity && ioasid_capacity != PCI_PASID_MAX) {
+		pr_warn("IOASID capacity is already set.\n");
+		goto done_unlock;
+	}
+	ioasid_capacity = ioasid_capacity_avail = total;
+done_unlock:
+	spin_unlock(&ioasid_allocator_lock);
+}
+EXPORT_SYMBOL_GPL(ioasid_install_capacity);
+
+/**
+ * @brief Reserve capacity from the system pool
+ *
+ * @param nr_ioasid Number of IOASIDs requested to be reserved, 0 means
+ *	reserve all remaining IDs.
+ *
+ * @return the remaining capacity on success, or errno
+ */
+int ioasid_reserve_capacity(ioasid_t nr_ioasid)
+{
+	int ret = 0;
+
+	spin_lock(&ioasid_allocator_lock);
+	if (nr_ioasid > ioasid_capacity_avail) {
+		ret = -ENOSPC;
+		goto done_unlock;
+	}
+	if (!nr_ioasid)
+		nr_ioasid = ioasid_capacity_avail;
+	ioasid_capacity_avail -= nr_ioasid;
+	ret = nr_ioasid;
+done_unlock:
+	spin_unlock(&ioasid_allocator_lock);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(ioasid_reserve_capacity);
+
+/**
+ * @brief Return capacity to the system pool
+ * 	We trust the caller not to return more than it has reserved, we could
+ * 	also track reservation if needed.
+ *
+ * @param nr_ioasid Number of IOASIDs requested to be returned
+ *
+ * @return the remaining capacity on success, or errno
+ */
+int ioasid_cancel_capacity(ioasid_t nr_ioasid)
+{
+	int ret = 0;
+
+	spin_lock(&ioasid_allocator_lock);
+	if (nr_ioasid + ioasid_capacity_avail > ioasid_capacity) {
+		ret = -EINVAL;
+		goto done_unlock;
+	}
+	ioasid_capacity_avail += nr_ioasid;
+	ret = ioasid_capacity_avail;
+done_unlock:
+	spin_unlock(&ioasid_allocator_lock);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(ioasid_cancel_capacity);
+
 /**
  * ioasid_attach_data - Set private data for an allocated ioasid
  * @ioasid: the ID to set data
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index f6e705f832f0..2780bdc84b94 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -32,6 +32,10 @@ struct ioasid_allocator_ops {
 #define DECLARE_IOASID_SET(name) struct ioasid_set name = { 0 }
 
 #if IS_ENABLED(CONFIG_IOASID)
+void ioasid_install_capacity(ioasid_t total);
+int ioasid_reserve_capacity(ioasid_t nr_ioasid);
+int ioasid_cancel_capacity(ioasid_t nr_ioasid);
+
 ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
 		      void *private);
 void ioasid_get(ioasid_t ioasid);
@@ -43,6 +47,20 @@ void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
 int ioasid_attach_data(ioasid_t ioasid, void *data);
 void ioasid_detach_data(ioasid_t ioasid);
 #else /* !CONFIG_IOASID */
+static inline void ioasid_install_capacity(ioasid_t total)
+{
+}
+
+static inline int ioasid_reserve_capacity(ioasid_t nr_ioasid)
+{
+	return -ENOSPC;
+}
+
+static inline int ioasid_cancel_capacity(ioasid_t nr_ioasid)
+{
+	return -EINVAL;
+}
+
 static inline ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
 				    ioasid_t max, void *private)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 269+ messages in thread

* [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-02-27 22:01 [PATCH V4 00/18] IOASID extensions for guest SVA Jacob Pan
                   ` (3 preceding siblings ...)
  2021-02-27 22:01 ` [PATCH V4 04/18] iommu/ioasid: Support setting system-wide capacity Jacob Pan
@ 2021-02-27 22:01 ` Jacob Pan
  2021-03-19  0:22   ` Jacob Pan
  2021-02-27 22:01 ` [PATCH V4 06/18] iommu/ioasid: Add free function and states Jacob Pan
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 269+ messages in thread
From: Jacob Pan @ 2021-02-27 22:01 UTC (permalink / raw)
  To: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker
  Cc: Alex Williamson, Eric Auger, Jason Gunthorpe, Jonathan Corbet,
	Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao, Dave Jiang, Jacob Pan

ioasid_set was introduced as an arbitrary token that is shared by a
group of IOASIDs. For example, two IOASIDs allocated via the same
ioasid_set pointer belong to the same set.

For guest SVA usages, system-wide IOASID resources need to be
partitioned such that each VM can have its own quota and being managed
separately. ioasid_set is the perfect candidate for meeting such
requirements. This patch redefines and extends ioasid_set with the
following new fields:
- Quota
- Reference count
- Storage of its namespace
- The token is now stored in the ioasid_set with types

Basic ioasid_set level APIs are introduced that wire up these new data.
Existing users of IOASID APIs are converted where a host IOASID set is
allocated for bare-metal usages. Including VT-d driver and
iommu-sva-lib.

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |   1 +
 drivers/iommu/intel/iommu.c                   |  27 +-
 drivers/iommu/intel/pasid.h                   |   1 +
 drivers/iommu/intel/svm.c                     |  25 +-
 drivers/iommu/ioasid.c                        | 288 +++++++++++++++---
 drivers/iommu/iommu-sva-lib.c                 |  19 +-
 include/linux/ioasid.h                        |  68 ++++-
 7 files changed, 361 insertions(+), 68 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
index e13b092e6004..588aa66ed5e4 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
@@ -459,6 +459,7 @@ int arm_smmu_master_enable_sva(struct arm_smmu_master *master)
 {
 	mutex_lock(&sva_lock);
 	master->sva_enabled = true;
+	iommu_sva_init();
 	mutex_unlock(&sva_lock);
 
 	return 0;
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 6f42ff7d171d..eb9868061545 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -103,6 +103,9 @@
  */
 #define INTEL_IOMMU_PGSIZES	(~0xFFFUL)
 
+/* PASIDs used by host SVM */
+struct ioasid_set *host_pasid_set;
+
 static inline int agaw_to_level(int agaw)
 {
 	return agaw + 2;
@@ -173,6 +176,7 @@ static struct intel_iommu **g_iommus;
 
 static void __init check_tylersburg_isoch(void);
 static int rwbf_quirk;
+static bool scalable_mode_support(void);
 
 /*
  * set to 1 to panic kernel if can't successfully enable VT-d
@@ -3114,8 +3118,8 @@ static void intel_vcmd_ioasid_free(ioasid_t ioasid, void *data)
 	 * Sanity check the ioasid owner is done at upper layer, e.g. VFIO
 	 * We can only free the PASID when all the devices are unbound.
 	 */
-	if (ioasid_find(NULL, ioasid, NULL)) {
-		pr_alert("Cannot free active IOASID %d\n", ioasid);
+	if (IS_ERR(ioasid_find(host_pasid_set, ioasid, NULL))) {
+		pr_err("IOASID %d to be freed but not in system set\n", ioasid);
 		return;
 	}
 	vcmd_free_pasid(iommu, ioasid);
@@ -3300,8 +3304,17 @@ static int __init init_dmars(void)
 		goto free_iommu;
 
 	/* PASID is needed for scalable mode irrespective to SVM */
-	if (intel_iommu_sm)
+	if (scalable_mode_support()) {
 		ioasid_install_capacity(intel_pasid_max_id);
+		/* We should not run out of IOASIDs at boot */
+		host_pasid_set = ioasid_set_alloc(NULL, PID_MAX_DEFAULT,
+						  IOASID_SET_TYPE_NULL);
+		if (IS_ERR_OR_NULL(host_pasid_set)) {
+			pr_err("Failed to allocate host PASID set %lu\n",
+				PTR_ERR(host_pasid_set));
+			intel_iommu_sm = 0;
+		}
+	}
 
 	/*
 	 * for each drhd
@@ -3348,7 +3361,7 @@ static int __init init_dmars(void)
 		disable_dmar_iommu(iommu);
 		free_dmar_iommu(iommu);
 	}
-
+	ioasid_set_free(host_pasid_set);
 	kfree(g_iommus);
 
 error:
@@ -4573,7 +4586,7 @@ static int aux_domain_add_dev(struct dmar_domain *domain,
 		u32 pasid;
 
 		/* No private data needed for the default pasid */
-		pasid = ioasid_alloc(NULL, PASID_MIN,
+		pasid = ioasid_alloc(host_pasid_set, PASID_MIN,
 				     pci_max_pasids(to_pci_dev(dev)) - 1,
 				     NULL);
 		if (pasid == INVALID_IOASID) {
@@ -4630,7 +4643,7 @@ static int aux_domain_add_dev(struct dmar_domain *domain,
 link_failed:
 	spin_unlock_irqrestore(&device_domain_lock, flags);
 	if (list_empty(&domain->subdevices) && domain->default_pasid > 0)
-		ioasid_put(domain->default_pasid);
+		ioasid_put(host_pasid_set, domain->default_pasid);
 
 	return ret;
 }
@@ -4660,7 +4673,7 @@ static void aux_domain_remove_dev(struct dmar_domain *domain,
 	spin_unlock_irqrestore(&device_domain_lock, flags);
 
 	if (list_empty(&domain->subdevices) && domain->default_pasid > 0)
-		ioasid_put(domain->default_pasid);
+		ioasid_put(host_pasid_set, domain->default_pasid);
 }
 
 static int prepare_domain_attach_device(struct iommu_domain *domain,
diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
index 97dfcffbf495..12b5ca18de5d 100644
--- a/drivers/iommu/intel/pasid.h
+++ b/drivers/iommu/intel/pasid.h
@@ -99,6 +99,7 @@ static inline bool pasid_pte_is_present(struct pasid_entry *pte)
 }
 
 extern unsigned int intel_pasid_max_id;
+extern struct ioasid_set *host_pasid_set;
 int intel_pasid_alloc_id(void *ptr, int start, int end, gfp_t gfp);
 void intel_pasid_free_id(u32 pasid);
 void *intel_pasid_lookup_id(u32 pasid);
diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 68372a7eb8b5..c469c24d23f5 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -247,7 +247,9 @@ static LIST_HEAD(global_svm_list);
 	list_for_each_entry((sdev), &(svm)->devs, list)	\
 		if ((d) != (sdev)->dev) {} else
 
-static int pasid_to_svm_sdev(struct device *dev, unsigned int pasid,
+static int pasid_to_svm_sdev(struct device *dev,
+			     struct ioasid_set *set,
+			     unsigned int pasid,
 			     struct intel_svm **rsvm,
 			     struct intel_svm_dev **rsdev)
 {
@@ -261,7 +263,7 @@ static int pasid_to_svm_sdev(struct device *dev, unsigned int pasid,
 	if (pasid == INVALID_IOASID || pasid >= PASID_MAX)
 		return -EINVAL;
 
-	svm = ioasid_find(NULL, pasid, NULL);
+	svm = ioasid_find(set, pasid, NULL);
 	if (IS_ERR(svm))
 		return PTR_ERR(svm);
 
@@ -337,7 +339,8 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
 	dmar_domain = to_dmar_domain(domain);
 
 	mutex_lock(&pasid_mutex);
-	ret = pasid_to_svm_sdev(dev, data->hpasid, &svm, &sdev);
+	ret = pasid_to_svm_sdev(dev, NULL,
+				data->hpasid, &svm, &sdev);
 	if (ret)
 		goto out;
 
@@ -444,7 +447,7 @@ int intel_svm_unbind_gpasid(struct device *dev, u32 pasid)
 		return -EINVAL;
 
 	mutex_lock(&pasid_mutex);
-	ret = pasid_to_svm_sdev(dev, pasid, &svm, &sdev);
+	ret = pasid_to_svm_sdev(dev, NULL, pasid, &svm, &sdev);
 	if (ret)
 		goto out;
 
@@ -602,7 +605,7 @@ intel_svm_bind_mm(struct device *dev, unsigned int flags,
 			pasid_max = intel_pasid_max_id;
 
 		/* Do not use PASID 0, reserved for RID to PASID */
-		svm->pasid = ioasid_alloc(NULL, PASID_MIN,
+		svm->pasid = ioasid_alloc(host_pasid_set, PASID_MIN,
 					  pasid_max - 1, svm);
 		if (svm->pasid == INVALID_IOASID) {
 			kfree(svm);
@@ -619,7 +622,7 @@ intel_svm_bind_mm(struct device *dev, unsigned int flags,
 		if (mm) {
 			ret = mmu_notifier_register(&svm->notifier, mm);
 			if (ret) {
-				ioasid_put(svm->pasid);
+				ioasid_put(host_pasid_set, svm->pasid);
 				kfree(svm);
 				kfree(sdev);
 				goto out;
@@ -637,7 +640,7 @@ intel_svm_bind_mm(struct device *dev, unsigned int flags,
 		if (ret) {
 			if (mm)
 				mmu_notifier_unregister(&svm->notifier, mm);
-			ioasid_put(svm->pasid);
+			ioasid_put(host_pasid_set, svm->pasid);
 			kfree(svm);
 			kfree(sdev);
 			goto out;
@@ -689,7 +692,8 @@ static int intel_svm_unbind_mm(struct device *dev, u32 pasid)
 	if (!iommu)
 		goto out;
 
-	ret = pasid_to_svm_sdev(dev, pasid, &svm, &sdev);
+	ret = pasid_to_svm_sdev(dev, host_pasid_set,
+				pasid, &svm, &sdev);
 	if (ret)
 		goto out;
 
@@ -710,7 +714,7 @@ static int intel_svm_unbind_mm(struct device *dev, u32 pasid)
 			kfree_rcu(sdev, rcu);
 
 			if (list_empty(&svm->devs)) {
-				ioasid_put(svm->pasid);
+				ioasid_put(host_pasid_set, svm->pasid);
 				if (svm->mm) {
 					mmu_notifier_unregister(&svm->notifier, svm->mm);
 					/* Clear mm's pasid. */
@@ -1184,7 +1188,8 @@ int intel_svm_page_response(struct device *dev,
 		goto out;
 	}
 
-	ret = pasid_to_svm_sdev(dev, prm->pasid, &svm, &sdev);
+	ret = pasid_to_svm_sdev(dev, host_pasid_set,
+				prm->pasid, &svm, &sdev);
 	if (ret || !sdev) {
 		ret = -ENODEV;
 		goto out;
diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
index 28681b99340b..d7b476651027 100644
--- a/drivers/iommu/ioasid.c
+++ b/drivers/iommu/ioasid.c
@@ -1,8 +1,8 @@
 // SPDX-License-Identifier: GPL-2.0
 /*
  * I/O Address Space ID allocator. There is one global IOASID space, split into
- * subsets. Users create a subset with DECLARE_IOASID_SET, then allocate and
- * free IOASIDs with ioasid_alloc and ioasid_put.
+ * sets. Users create a set with ioasid_set_alloc, then allocate/free IDs
+ * with ioasid_alloc, ioasid_put, and ioasid_free.
  */
 #include <linux/ioasid.h>
 #include <linux/module.h>
@@ -14,6 +14,7 @@
 #define PCI_PASID_MAX 0x100000
 static ioasid_t ioasid_capacity = PCI_PASID_MAX;
 static ioasid_t ioasid_capacity_avail = PCI_PASID_MAX;
+static DEFINE_XARRAY_ALLOC(ioasid_sets);
 struct ioasid_data {
 	ioasid_t id;
 	struct ioasid_set *set;
@@ -394,6 +395,151 @@ void ioasid_detach_data(ioasid_t ioasid)
 }
 EXPORT_SYMBOL_GPL(ioasid_detach_data);
 
+static inline bool ioasid_set_is_valid(struct ioasid_set *set)
+{
+	return xa_load(&ioasid_sets, set->id) == set;
+}
+
+/**
+ * ioasid_set_alloc - Allocate a new IOASID set for a given token
+ *
+ * @token:	An optional arbitrary number that can be associated with the
+ *		IOASID set. @token can be NULL if the type is
+ *		IOASID_SET_TYPE_NULL
+ * @quota:	Quota allowed in this set, 0 indicates no limit for the set
+ * @type:	The type of the token used to create the IOASID set
+ *
+ * IOASID is limited system-wide resource that requires quota management.
+ * Token will be stored in the ioasid_set returned. A reference will be taken
+ * on the newly created set. Subsequent IOASID allocation within the set need
+ * to use the returned ioasid_set pointer.
+ */
+struct ioasid_set *ioasid_set_alloc(void *token, ioasid_t quota, int type)
+{
+	struct ioasid_set *set;
+	unsigned long index;
+	ioasid_t id;
+
+	if (type >= IOASID_SET_TYPE_NR)
+		return ERR_PTR(-EINVAL);
+
+	/* No limit for the set, use whatever is available on the system */
+	if (!quota)
+		quota = ioasid_capacity_avail;
+
+	spin_lock(&ioasid_allocator_lock);
+	if (quota > ioasid_capacity_avail) {
+		pr_warn("Out of IOASID capacity! ask %d, avail %d\n",
+			quota, ioasid_capacity_avail);
+		set = ERR_PTR(-ENOSPC);
+		goto exit_unlock;
+	}
+
+	/*
+	 * Token is only unique within its types but right now we have only
+	 * mm type. If we have more token types, we have to match type as well.
+	 */
+	switch (type) {
+	case IOASID_SET_TYPE_MM:
+		if (!token) {
+			set = ERR_PTR(-EINVAL);
+			goto exit_unlock;
+		}
+		/* Search existing set tokens, reject duplicates */
+		xa_for_each(&ioasid_sets, index, set) {
+			if (set->token == token && set->type == IOASID_SET_TYPE_MM) {
+				set = ERR_PTR(-EEXIST);
+				goto exit_unlock;
+			}
+		}
+		break;
+	case IOASID_SET_TYPE_NULL:
+		if (!token)
+			break;
+		fallthrough;
+	default:
+		pr_err("Invalid token and IOASID type\n");
+		set = ERR_PTR(-EINVAL);
+		goto exit_unlock;
+	}
+
+	set = kzalloc(sizeof(*set), GFP_ATOMIC);
+	if (!set) {
+		set = ERR_PTR(-ENOMEM);
+		goto exit_unlock;
+	}
+
+	if (xa_alloc(&ioasid_sets, &id, set,
+		     XA_LIMIT(0, ioasid_capacity_avail),
+		     GFP_ATOMIC)) {
+		kfree(set);
+		set = ERR_PTR(-ENOSPC);
+		goto exit_unlock;
+	}
+
+	set->token = token;
+	set->type = type;
+	set->quota = quota;
+	set->id = id;
+	atomic_set(&set->nr_ioasids, 0);
+	/*
+	 * Per set XA is used to store private IDs within the set, get ready
+	 * for ioasid_set private ID and system-wide IOASID allocation
+	 * results.
+	 */
+	xa_init(&set->xa);
+	ioasid_capacity_avail -= quota;
+
+exit_unlock:
+	spin_unlock(&ioasid_allocator_lock);
+
+	return set;
+}
+EXPORT_SYMBOL_GPL(ioasid_set_alloc);
+
+static int ioasid_set_free_locked(struct ioasid_set *set)
+{
+	int ret = 0;
+
+	if (!ioasid_set_is_valid(set)) {
+		ret = -EINVAL;
+		goto exit_done;
+	}
+
+	if (atomic_read(&set->nr_ioasids)) {
+		ret = -EBUSY;
+		goto exit_done;
+	}
+
+	WARN_ON(!xa_empty(&set->xa));
+	/*
+	 * Token got released right away after the ioasid_set is freed.
+	 * If a new set is created immediately with the newly released token,
+	 * it will not allocate the same IOASIDs unless they are reclaimed.
+	 */
+	xa_erase(&ioasid_sets, set->id);
+	kfree_rcu(set, rcu);
+exit_done:
+	return ret;
+};
+
+/**
+ * @brief Free an ioasid_set if empty. Restore pending notification list.
+ *
+ * @param set to be freed
+ * @return
+ */
+int ioasid_set_free(struct ioasid_set *set)
+{
+	int ret = 0;
+
+	spin_lock(&ioasid_allocator_lock);
+	ret = ioasid_set_free_locked(set);
+	spin_unlock(&ioasid_allocator_lock);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(ioasid_set_free);
+
 /**
  * ioasid_alloc - Allocate an IOASID
  * @set: the IOASID set
@@ -411,11 +557,22 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
 {
 	struct ioasid_data *data;
 	void *adata;
-	ioasid_t id;
+	ioasid_t id = INVALID_IOASID;
+
+	spin_lock(&ioasid_allocator_lock);
+	/* Check if the IOASID set has been allocated and initialized */
+	if (!ioasid_set_is_valid(set))
+		goto done_unlock;
+
+	if (set->quota <= atomic_read(&set->nr_ioasids)) {
+		pr_err_ratelimited("IOASID set out of quota %d\n",
+				   set->quota);
+		goto done_unlock;
+	}
 
 	data = kzalloc(sizeof(*data), GFP_ATOMIC);
 	if (!data)
-		return INVALID_IOASID;
+		goto done_unlock;
 
 	data->set = set;
 	data->private = private;
@@ -425,7 +582,6 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
 	 * Custom allocator needs allocator data to perform platform specific
 	 * operations.
 	 */
-	spin_lock(&ioasid_allocator_lock);
 	adata = active_allocator->flags & IOASID_ALLOCATOR_CUSTOM ? active_allocator->ops->pdata : data;
 	id = active_allocator->ops->alloc(min, max, adata);
 	if (id == INVALID_IOASID) {
@@ -442,67 +598,121 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
 	}
 	data->id = id;
 
-	spin_unlock(&ioasid_allocator_lock);
-	return id;
+	/* Store IOASID in the per set data */
+	if (xa_err(xa_store(&set->xa, id, data, GFP_ATOMIC))) {
+		pr_err_ratelimited("Failed to store ioasid %d in set\n", id);
+		active_allocator->ops->free(id, active_allocator->ops->pdata);
+		goto exit_free;
+	}
+	atomic_inc(&set->nr_ioasids);
+	goto done_unlock;
 exit_free:
-	spin_unlock(&ioasid_allocator_lock);
 	kfree(data);
-	return INVALID_IOASID;
+done_unlock:
+	spin_unlock(&ioasid_allocator_lock);
+	return id;
 }
 EXPORT_SYMBOL_GPL(ioasid_alloc);
 
+static void ioasid_do_free_locked(struct ioasid_data *data)
+{
+	struct ioasid_data *ioasid_data;
+
+	active_allocator->ops->free(data->id, active_allocator->ops->pdata);
+	/* Custom allocator needs additional steps to free the xa element */
+	if (active_allocator->flags & IOASID_ALLOCATOR_CUSTOM) {
+		ioasid_data = xa_erase(&active_allocator->xa, data->id);
+		kfree_rcu(ioasid_data, rcu);
+	}
+	atomic_dec(&data->set->nr_ioasids);
+	xa_erase(&data->set->xa, data->id);
+	/* Destroy the set if empty */
+	if (!atomic_read(&data->set->nr_ioasids))
+		ioasid_set_free_locked(data->set);
+}
+
+int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid)
+{
+	struct ioasid_data *data;
+
+	data = xa_load(&active_allocator->xa, ioasid);
+	if (!data) {
+		pr_err("Trying to get unknown IOASID %u\n", ioasid);
+		return -EINVAL;
+	}
+
+	/* Check set ownership if the set is non-null */
+	if (set && data->set != set) {
+		pr_err("Trying to get IOASID %u outside the set\n", ioasid);
+		/* data found but does not belong to the set */
+		return -EACCES;
+	}
+	refcount_inc(&data->refs);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(ioasid_get_locked);
+
 /**
  * ioasid_get - obtain a reference to the IOASID
+ * @set:	the ioasid_set to check permission against if not NULL
+ * @ioasid:	the IOASID to get reference
+ *
+ *
+ * Return: 0 on success, error if failed.
  */
-void ioasid_get(ioasid_t ioasid)
+int ioasid_get(struct ioasid_set *set, ioasid_t ioasid)
 {
-	struct ioasid_data *ioasid_data;
+	int ret;
 
 	spin_lock(&ioasid_allocator_lock);
-	ioasid_data = xa_load(&active_allocator->xa, ioasid);
-	if (ioasid_data)
-		refcount_inc(&ioasid_data->refs);
-	else
-		WARN_ON(1);
+	ret = ioasid_get_locked(set, ioasid);
 	spin_unlock(&ioasid_allocator_lock);
+
+	return ret;
 }
 EXPORT_SYMBOL_GPL(ioasid_get);
 
+bool ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid)
+{
+	struct ioasid_data *data;
+
+	data = xa_load(&active_allocator->xa, ioasid);
+	if (!data) {
+		pr_err("Trying to put unknown IOASID %u\n", ioasid);
+		return false;
+	}
+	if (set && data->set != set) {
+		pr_err("Trying to drop IOASID %u outside the set\n", ioasid);
+		return false;
+	}
+	if (!refcount_dec_and_test(&data->refs))
+		return false;
+
+	ioasid_do_free_locked(data);
+
+	return true;
+}
+EXPORT_SYMBOL_GPL(ioasid_put_locked);
+
 /**
  * ioasid_put - Release a reference to an ioasid
- * @ioasid: the ID to remove
+ * @set:	the ioasid_set to check permission against if not NULL
+ * @ioasid:	the IOASID to drop reference
  *
  * Put a reference to the IOASID, free it when the number of references drops to
  * zero.
  *
  * Return: %true if the IOASID was freed, %false otherwise.
  */
-bool ioasid_put(ioasid_t ioasid)
+bool ioasid_put(struct ioasid_set *set, ioasid_t ioasid)
 {
-	bool free = false;
-	struct ioasid_data *ioasid_data;
+	bool ret;
 
 	spin_lock(&ioasid_allocator_lock);
-	ioasid_data = xa_load(&active_allocator->xa, ioasid);
-	if (!ioasid_data) {
-		pr_err("Trying to free unknown IOASID %u\n", ioasid);
-		goto exit_unlock;
-	}
-
-	free = refcount_dec_and_test(&ioasid_data->refs);
-	if (!free)
-		goto exit_unlock;
-
-	active_allocator->ops->free(ioasid, active_allocator->ops->pdata);
-	/* Custom allocator needs additional steps to free the xa element */
-	if (active_allocator->flags & IOASID_ALLOCATOR_CUSTOM) {
-		ioasid_data = xa_erase(&active_allocator->xa, ioasid);
-		kfree_rcu(ioasid_data, rcu);
-	}
-
-exit_unlock:
+	ret = ioasid_put_locked(set, ioasid);
 	spin_unlock(&ioasid_allocator_lock);
-	return free;
+	return ret;
 }
 EXPORT_SYMBOL_GPL(ioasid_put);
 
diff --git a/drivers/iommu/iommu-sva-lib.c b/drivers/iommu/iommu-sva-lib.c
index bd41405d34e9..7f97a03a135b 100644
--- a/drivers/iommu/iommu-sva-lib.c
+++ b/drivers/iommu/iommu-sva-lib.c
@@ -8,7 +8,16 @@
 #include "iommu-sva-lib.h"
 
 static DEFINE_MUTEX(iommu_sva_lock);
-static DECLARE_IOASID_SET(iommu_sva_pasid);
+static struct ioasid_set *iommu_sva_pasid;
+
+/* Must be called before PASID allocations can occur */
+void iommu_sva_init(void)
+{
+	if (iommu_sva_pasid)
+		return;
+	iommu_sva_pasid = ioasid_set_alloc(NULL, 0, IOASID_SET_TYPE_NULL);
+}
+EXPORT_SYMBOL_GPL(iommu_sva_init);
 
 /**
  * iommu_sva_alloc_pasid - Allocate a PASID for the mm
@@ -35,11 +44,11 @@ int iommu_sva_alloc_pasid(struct mm_struct *mm, ioasid_t min, ioasid_t max)
 	mutex_lock(&iommu_sva_lock);
 	if (mm->pasid) {
 		if (mm->pasid >= min && mm->pasid <= max)
-			ioasid_get(mm->pasid);
+			ioasid_get(iommu_sva_pasid, mm->pasid);
 		else
 			ret = -EOVERFLOW;
 	} else {
-		pasid = ioasid_alloc(&iommu_sva_pasid, min, max, mm);
+		pasid = ioasid_alloc(iommu_sva_pasid, min, max, mm);
 		if (pasid == INVALID_IOASID)
 			ret = -ENOMEM;
 		else
@@ -59,7 +68,7 @@ EXPORT_SYMBOL_GPL(iommu_sva_alloc_pasid);
 void iommu_sva_free_pasid(struct mm_struct *mm)
 {
 	mutex_lock(&iommu_sva_lock);
-	if (ioasid_put(mm->pasid))
+	if (ioasid_put(iommu_sva_pasid, mm->pasid))
 		mm->pasid = 0;
 	mutex_unlock(&iommu_sva_lock);
 }
@@ -81,6 +90,6 @@ static bool __mmget_not_zero(void *mm)
  */
 struct mm_struct *iommu_sva_find(ioasid_t pasid)
 {
-	return ioasid_find(&iommu_sva_pasid, pasid, __mmget_not_zero);
+	return ioasid_find(iommu_sva_pasid, pasid, __mmget_not_zero);
 }
 EXPORT_SYMBOL_GPL(iommu_sva_find);
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index 2780bdc84b94..095f4e50dc58 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -4,14 +4,43 @@
 
 #include <linux/types.h>
 #include <linux/errno.h>
+#include <linux/xarray.h>
+#include <linux/refcount.h>
 
 #define INVALID_IOASID ((ioasid_t)-1)
 typedef unsigned int ioasid_t;
 typedef ioasid_t (*ioasid_alloc_fn_t)(ioasid_t min, ioasid_t max, void *data);
 typedef void (*ioasid_free_fn_t)(ioasid_t ioasid, void *data);
 
+/* IOASID set types */
+enum ioasid_set_type {
+	IOASID_SET_TYPE_NULL = 1, /* Set token is NULL */
+	IOASID_SET_TYPE_MM,	  /* Set token is a mm_struct pointer
+				   * i.e. associated with a process
+				   */
+	IOASID_SET_TYPE_NR,
+};
+
+/**
+ * struct ioasid_set - Meta data about ioasid_set
+ * @nh:		List of notifiers private to that set
+ * @xa:		XArray to store ioasid_set private IDs, can be used for
+ *		guest-host IOASID mapping, or just a private IOASID namespace.
+ * @token:	Unique to identify an IOASID set
+ * @type:	Token types
+ * @quota:	Max number of IOASIDs can be allocated within the set
+ * @nr_ioasids:	Number of IOASIDs currently allocated in the set
+ * @id:		ID of the set
+ */
 struct ioasid_set {
-	int dummy;
+	struct atomic_notifier_head nh;
+	struct xarray xa;
+	void *token;
+	int type;
+	int quota;
+	atomic_t nr_ioasids;
+	int id;
+	struct rcu_head rcu;
 };
 
 /**
@@ -29,17 +58,20 @@ struct ioasid_allocator_ops {
 	void *pdata;
 };
 
-#define DECLARE_IOASID_SET(name) struct ioasid_set name = { 0 }
-
 #if IS_ENABLED(CONFIG_IOASID)
 void ioasid_install_capacity(ioasid_t total);
 int ioasid_reserve_capacity(ioasid_t nr_ioasid);
 int ioasid_cancel_capacity(ioasid_t nr_ioasid);
+struct ioasid_set *ioasid_set_alloc(void *token, ioasid_t quota, int type);
+int ioasid_set_free(struct ioasid_set *set);
+struct ioasid_set *ioasid_find_mm_set(struct mm_struct *token);
 
 ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
 		      void *private);
-void ioasid_get(ioasid_t ioasid);
-bool ioasid_put(ioasid_t ioasid);
+int ioasid_get(struct ioasid_set *set, ioasid_t ioasid);
+int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid);
+bool ioasid_put(struct ioasid_set *set, ioasid_t ioasid);
+bool ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid);
 void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
 		  bool (*getter)(void *));
 int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
@@ -67,11 +99,33 @@ static inline ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
 	return INVALID_IOASID;
 }
 
-static inline void ioasid_get(ioasid_t ioasid)
+static inline struct ioasid_set *ioasid_set_alloc(void *token, ioasid_t quota,
+						  ioasid_set_type type)
 {
+	return ERR_PTR(-ENOTSUPP);
+}
+
+static inline struct ioasid_set *ioasid_find_mm_set(struct mm_struct *token)
+{
+	return NULL;
+}
+
+static inline int ioasid_get(struct ioasid_set *set, ioasid_t ioasid)
+{
+	return -ENOTSUPP;
+}
+
+static inline int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid)
+{
+	return -ENOTSUPP;
+}
+
+static inline bool ioasid_put(struct ioasid_set *set, ioasid_t ioasid)
+{
+	return false;
 }
 
-static inline bool ioasid_put(ioasid_t ioasid)
+static inline bool ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid)
 {
 	return false;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 269+ messages in thread

* [PATCH V4 06/18] iommu/ioasid: Add free function and states
  2021-02-27 22:01 [PATCH V4 00/18] IOASID extensions for guest SVA Jacob Pan
                   ` (4 preceding siblings ...)
  2021-02-27 22:01 ` [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs Jacob Pan
@ 2021-02-27 22:01 ` Jacob Pan
  2021-02-27 22:01 ` [PATCH V4 07/18] iommu/ioasid: Add ioasid_set iterator helper functions Jacob Pan
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 269+ messages in thread
From: Jacob Pan @ 2021-02-27 22:01 UTC (permalink / raw)
  To: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker
  Cc: Alex Williamson, Eric Auger, Jason Gunthorpe, Jonathan Corbet,
	Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao, Dave Jiang, Jacob Pan

When an actively used IOASID is freed due to exceptions, users must be
notified to perform the cleanup. The IOASID shall be put in a pending
state until all users completed their cleanup work.

This patch adds ioasid_free() function to let the caller initiate the
freeing process. Both ioasid_free() and ioasid_put() decrements
reference counts. Unlike ioasid_put(), the ioasid_free() function also
transition the IOASID to the free pending state where further
ioasid_get() is prohibited. This paves the way for FREE event
notifications that will be introduced next.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/ioasid.c | 73 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/ioasid.h |  5 +++
 2 files changed, 78 insertions(+)

diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
index d7b476651027..a10f8154c680 100644
--- a/drivers/iommu/ioasid.c
+++ b/drivers/iommu/ioasid.c
@@ -15,8 +15,26 @@
 static ioasid_t ioasid_capacity = PCI_PASID_MAX;
 static ioasid_t ioasid_capacity_avail = PCI_PASID_MAX;
 static DEFINE_XARRAY_ALLOC(ioasid_sets);
+
+enum ioasid_state {
+	IOASID_STATE_IDLE,
+	IOASID_STATE_ACTIVE,
+	IOASID_STATE_FREE_PENDING,
+};
+
+/**
+ * struct ioasid_data - Meta data about ioasid
+ *
+ * @id:		Unique ID
+ * @refs:	Number of active users
+ * @state:	Track state of the IOASID
+ * @set:	ioasid_set of the IOASID belongs to
+ * @private:	Private data associated with the IOASID
+ * @rcu:	For free after RCU grace period
+ */
 struct ioasid_data {
 	ioasid_t id;
+	enum ioasid_state state;
 	struct ioasid_set *set;
 	void *private;
 	struct rcu_head rcu;
@@ -597,6 +615,7 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
 		goto exit_free;
 	}
 	data->id = id;
+	data->state = IOASID_STATE_IDLE;
 
 	/* Store IOASID in the per set data */
 	if (xa_err(xa_store(&set->xa, id, data, GFP_ATOMIC))) {
@@ -631,6 +650,56 @@ static void ioasid_do_free_locked(struct ioasid_data *data)
 		ioasid_set_free_locked(data->set);
 }
 
+static void ioasid_free_locked(struct ioasid_set *set, ioasid_t ioasid)
+{
+	struct ioasid_data *data;
+
+	data = xa_load(&active_allocator->xa, ioasid);
+	if (!data) {
+		pr_err_ratelimited("Trying to free unknown IOASID %u\n", ioasid);
+		return;
+	}
+	if (data->set != set) {
+		pr_warn("Cannot free IOASID %u due to set ownership\n", ioasid);
+		return;
+	}
+	/* Check if the set exists */
+	if (WARN_ON(!xa_load(&ioasid_sets, data->set->id)))
+		return;
+
+	/* Free is already in progress */
+	if (data->state == IOASID_STATE_FREE_PENDING)
+		return;
+
+	data->state = IOASID_STATE_FREE_PENDING;
+	/*
+	 * If the refcount is 1, it means there is no other users of the IOASID
+	 * other than IOASID core itself. There is no need to notify anyone.
+	 */
+	if (!refcount_dec_and_test(&data->refs))
+		return;
+
+	ioasid_do_free_locked(data);
+}
+
+/**
+ * ioasid_free - Drop reference on an IOASID. Free if refcount drops to 0,
+ *               including free from its set and system-wide list.
+ * @set:	The ioasid_set to check permission with. If not NULL, IOASID
+ *		free will fail if the set does not match.
+ * @ioasid:	The IOASID to remove
+ *
+ * TODO: return true if all references dropped, false if async work is in
+ * progress, IOASID is in FREE_PENDING state. wait queue to be used for blocking
+ * free task.
+ */
+void ioasid_free(struct ioasid_set *set, ioasid_t ioasid)
+{
+	spin_lock(&ioasid_allocator_lock);
+	ioasid_free_locked(set, ioasid);
+	spin_unlock(&ioasid_allocator_lock);
+}
+EXPORT_SYMBOL_GPL(ioasid_free);
 int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid)
 {
 	struct ioasid_data *data;
@@ -640,6 +709,10 @@ int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid)
 		pr_err("Trying to get unknown IOASID %u\n", ioasid);
 		return -EINVAL;
 	}
+	if (data->state == IOASID_STATE_FREE_PENDING) {
+		pr_err("Trying to get IOASID being freed%u\n", ioasid);
+		return -EBUSY;
+	}
 
 	/* Check set ownership if the set is non-null */
 	if (set && data->set != set) {
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index 095f4e50dc58..cabaf0b0348f 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -72,6 +72,7 @@ int ioasid_get(struct ioasid_set *set, ioasid_t ioasid);
 int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid);
 bool ioasid_put(struct ioasid_set *set, ioasid_t ioasid);
 bool ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid);
+void ioasid_free(struct ioasid_set *set, ioasid_t ioasid);
 void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
 		  bool (*getter)(void *));
 int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
@@ -105,6 +106,10 @@ static inline struct ioasid_set *ioasid_set_alloc(void *token, ioasid_t quota,
 	return ERR_PTR(-ENOTSUPP);
 }
 
+static inline void ioasid_free(struct ioasid_set *set, ioasid_t ioasid)
+{
+}
+
 static inline struct ioasid_set *ioasid_find_mm_set(struct mm_struct *token)
 {
 	return NULL;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 269+ messages in thread

* [PATCH V4 07/18] iommu/ioasid: Add ioasid_set iterator helper functions
  2021-02-27 22:01 [PATCH V4 00/18] IOASID extensions for guest SVA Jacob Pan
                   ` (5 preceding siblings ...)
  2021-02-27 22:01 ` [PATCH V4 06/18] iommu/ioasid: Add free function and states Jacob Pan
@ 2021-02-27 22:01 ` Jacob Pan
  2021-02-27 22:01 ` [PATCH V4 08/18] iommu/ioasid: Introduce ioasid_set private ID Jacob Pan
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 269+ messages in thread
From: Jacob Pan @ 2021-02-27 22:01 UTC (permalink / raw)
  To: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker
  Cc: Alex Williamson, Eric Auger, Jason Gunthorpe, Jonathan Corbet,
	Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao, Dave Jiang, Jacob Pan

Users of an ioasid_set may not keep track of all the IOASIDs allocated
under the set. When collective actions are needed for each IOASIDs, it
is useful to iterate over all the IOASIDs within the set. For example,
when the ioasid_set is freed, the user might perform the same cleanup
operation on each IOASID.

This patch adds an API to iterate all the IOASIDs within the set.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/ioasid.c | 84 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/ioasid.h | 20 ++++++++++
 2 files changed, 104 insertions(+)

diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
index a10f8154c680..9a3ba157dec3 100644
--- a/drivers/iommu/ioasid.c
+++ b/drivers/iommu/ioasid.c
@@ -700,6 +700,61 @@ void ioasid_free(struct ioasid_set *set, ioasid_t ioasid)
 	spin_unlock(&ioasid_allocator_lock);
 }
 EXPORT_SYMBOL_GPL(ioasid_free);
+
+/**
+ * ioasid_free_all_in_set
+ *
+ * @brief
+ * Free all PASIDs from system-wide IOASID pool, all subscribers gets
+ * notified and do cleanup of their own.
+ * Note that some references of the IOASIDs within the set can still
+ * be held after the free call. This is OK in that the IOASIDs will be
+ * marked inactive, the only operations can be done is ioasid_put.
+ * No need to track IOASID set states since there is no reclaim phase.
+ *
+ * @param
+ * struct ioasid_set where all IOASIDs within the set will be freed.
+ */
+void ioasid_free_all_in_set(struct ioasid_set *set)
+{
+	struct ioasid_data *entry;
+	unsigned long index;
+
+	if (!ioasid_set_is_valid(set))
+		return;
+
+	if (xa_empty(&set->xa))
+		return;
+
+	if (!atomic_read(&set->nr_ioasids))
+		return;
+	spin_lock(&ioasid_allocator_lock);
+	xa_for_each(&set->xa, index, entry) {
+		ioasid_free_locked(set, index);
+		/* Free from per set private pool */
+		xa_erase(&set->xa, index);
+	}
+	spin_unlock(&ioasid_allocator_lock);
+}
+EXPORT_SYMBOL_GPL(ioasid_free_all_in_set);
+
+/**
+ * ioasid_set_for_each_ioasid
+ * @brief
+ * Iterate over all the IOASIDs within the set
+ */
+void ioasid_set_for_each_ioasid(struct ioasid_set *set,
+				void (*fn)(ioasid_t id, void *data),
+				void *data)
+{
+	struct ioasid_data *entry;
+	unsigned long index;
+
+	xa_for_each(&set->xa, index, entry)
+		fn(index, data);
+}
+EXPORT_SYMBOL_GPL(ioasid_set_for_each_ioasid);
+
 int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid)
 {
 	struct ioasid_data *data;
@@ -789,6 +844,35 @@ bool ioasid_put(struct ioasid_set *set, ioasid_t ioasid)
 }
 EXPORT_SYMBOL_GPL(ioasid_put);
 
+/**
+ * @brief
+ * Find the ioasid_set of an IOASID. As long as the IOASID is valid,
+ * the set must be valid since the refcounting is based on the number of IOASID
+ * in the set.
+ *
+ * @param ioasid
+ * @return struct ioasid_set*
+ */
+struct ioasid_set *ioasid_find_set(ioasid_t ioasid)
+{
+	struct ioasid_allocator_data *idata;
+	struct ioasid_data *ioasid_data;
+	struct ioasid_set *set = NULL;
+
+	rcu_read_lock();
+	idata = rcu_dereference(active_allocator);
+	ioasid_data = xa_load(&idata->xa, ioasid);
+	if (!ioasid_data) {
+		set = ERR_PTR(-ENOENT);
+		goto unlock;
+	}
+	set = ioasid_data->set;
+unlock:
+	rcu_read_unlock();
+	return set;
+}
+EXPORT_SYMBOL_GPL(ioasid_find_set);
+
 /**
  * ioasid_find - Find IOASID data
  * @set: the IOASID set
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index cabaf0b0348f..e7f3e6108724 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -73,12 +73,17 @@ int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid);
 bool ioasid_put(struct ioasid_set *set, ioasid_t ioasid);
 bool ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid);
 void ioasid_free(struct ioasid_set *set, ioasid_t ioasid);
+void ioasid_free_all_in_set(struct ioasid_set *set);
 void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
 		  bool (*getter)(void *));
+struct ioasid_set *ioasid_find_set(ioasid_t ioasid);
 int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
 void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
 int ioasid_attach_data(ioasid_t ioasid, void *data);
 void ioasid_detach_data(ioasid_t ioasid);
+void ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
+				void (*fn)(ioasid_t id, void *data),
+				void *data);
 #else /* !CONFIG_IOASID */
 static inline void ioasid_install_capacity(ioasid_t total)
 {
@@ -158,5 +163,20 @@ static inline int ioasid_attach_data(ioasid_t ioasid, void *data)
 static inline void ioasid_detach_data(ioasid_t ioasid)
 {
 }
+
+static inline void ioasid_free_all_in_set(struct ioasid_set *set)
+{
+}
+
+static inline struct ioasid_set *ioasid_find_set(ioasid_t ioasid)
+{
+	return ERR_PTR(-ENOTSUPP);
+}
+
+static inline void ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
+					      void (*fn)(ioasid_t id, void *data),
+					      void *data)
+{
+}
 #endif /* CONFIG_IOASID */
 #endif /* __LINUX_IOASID_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 269+ messages in thread

* [PATCH V4 08/18] iommu/ioasid: Introduce ioasid_set private ID
  2021-02-27 22:01 [PATCH V4 00/18] IOASID extensions for guest SVA Jacob Pan
                   ` (6 preceding siblings ...)
  2021-02-27 22:01 ` [PATCH V4 07/18] iommu/ioasid: Add ioasid_set iterator helper functions Jacob Pan
@ 2021-02-27 22:01 ` Jacob Pan
  2021-02-27 22:01 ` [PATCH V4 09/18] iommu/ioasid: Introduce notification APIs Jacob Pan
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 269+ messages in thread
From: Jacob Pan @ 2021-02-27 22:01 UTC (permalink / raw)
  To: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker
  Cc: Alex Williamson, Eric Auger, Jason Gunthorpe, Jonathan Corbet,
	Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao, Dave Jiang, Jacob Pan

When an IOASID set is used for guest SVA, each VM will acquire its
ioasid_set for IOASID allocations. IOASIDs within the VM must have a
host/physical IOASID backing, mapping between guest and host IOASIDs can
be non-identical. IOASID set private ID (SPID) is introduced in this
patch to be used as guest IOASID. However, the concept of ioasid_set
specific namespace is generic, thus named SPID.

As SPID namespace is within the IOASID set, the IOASID core can provide
lookup services at both directions. SPIDs may not be available when its
IOASID is allocated, the mapping between SPID and IOASID is usually
established when a guest page table is bound to a host PASID.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/ioasid.c | 104 +++++++++++++++++++++++++++++++++++++++++
 include/linux/ioasid.h |  18 +++++++
 2 files changed, 122 insertions(+)

diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
index 9a3ba157dec3..7707bb608bdd 100644
--- a/drivers/iommu/ioasid.c
+++ b/drivers/iommu/ioasid.c
@@ -26,6 +26,7 @@ enum ioasid_state {
  * struct ioasid_data - Meta data about ioasid
  *
  * @id:		Unique ID
+ * @spid:	Private ID unique within a set
  * @refs:	Number of active users
  * @state:	Track state of the IOASID
  * @set:	ioasid_set of the IOASID belongs to
@@ -34,6 +35,7 @@ enum ioasid_state {
  */
 struct ioasid_data {
 	ioasid_t id;
+	ioasid_t spid;
 	enum ioasid_state state;
 	struct ioasid_set *set;
 	void *private;
@@ -413,6 +415,107 @@ void ioasid_detach_data(ioasid_t ioasid)
 }
 EXPORT_SYMBOL_GPL(ioasid_detach_data);
 
+static ioasid_t ioasid_find_by_spid_locked(struct ioasid_set *set, ioasid_t spid, bool get)
+{
+	ioasid_t ioasid = INVALID_IOASID;
+	struct ioasid_data *entry;
+	unsigned long index;
+
+	if (!xa_load(&ioasid_sets, set->id)) {
+		pr_warn("Invalid set\n");
+		goto done;
+	}
+
+	xa_for_each(&set->xa, index, entry) {
+		if (spid == entry->spid) {
+			if (get)
+				refcount_inc(&entry->refs);
+			ioasid = index;
+		}
+	}
+done:
+	return ioasid;
+}
+
+/**
+ * ioasid_attach_spid - Attach ioasid_set private ID to an IOASID
+ *
+ * @ioasid: the system-wide IOASID to attach
+ * @spid:   the ioasid_set private ID of @ioasid
+ *
+ * After attching SPID, future lookup can be done via ioasid_find_by_spid().
+ */
+int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
+{
+	struct ioasid_data *data;
+	int ret = 0;
+
+	if (spid == INVALID_IOASID)
+		return -EINVAL;
+
+	spin_lock(&ioasid_allocator_lock);
+	data = xa_load(&active_allocator->xa, ioasid);
+
+	if (!data) {
+		pr_err("No IOASID entry %d to attach SPID %d\n",
+			ioasid, spid);
+		ret = -ENOENT;
+		goto done_unlock;
+	}
+	/* Check if SPID is unique within the set */
+	if (ioasid_find_by_spid_locked(data->set, spid, false) != INVALID_IOASID) {
+		ret = -EINVAL;
+		goto done_unlock;
+	}
+	data->spid = spid;
+
+done_unlock:
+	spin_unlock(&ioasid_allocator_lock);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(ioasid_attach_spid);
+
+void ioasid_detach_spid(ioasid_t ioasid)
+{
+	struct ioasid_data *data;
+
+	spin_lock(&ioasid_allocator_lock);
+	data = xa_load(&active_allocator->xa, ioasid);
+
+	if (!data || data->spid == INVALID_IOASID) {
+		pr_err("Invalid IOASID entry %d to detach\n", ioasid);
+		goto done_unlock;
+	}
+	data->spid = INVALID_IOASID;
+
+done_unlock:
+	spin_unlock(&ioasid_allocator_lock);
+}
+EXPORT_SYMBOL_GPL(ioasid_detach_spid);
+
+/**
+ * ioasid_find_by_spid - Find the system-wide IOASID by a set private ID and
+ * its set.
+ *
+ * @set:	the ioasid_set to search within
+ * @spid:	the set private ID
+ * @get:	flag indicates whether to take a reference once found
+ *
+ * Given a set private ID and its IOASID set, find the system-wide IOASID. Take
+ * a reference upon finding the matching IOASID if @get is true. Return
+ * INVALID_IOASID if the IOASID is not found in the set or the set is not valid.
+ */
+ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid, bool get)
+{
+	ioasid_t ioasid;
+
+	spin_lock(&ioasid_allocator_lock);
+	ioasid = ioasid_find_by_spid_locked(set, spid, get);
+	spin_unlock(&ioasid_allocator_lock);
+	return ioasid;
+}
+EXPORT_SYMBOL_GPL(ioasid_find_by_spid);
+
 static inline bool ioasid_set_is_valid(struct ioasid_set *set)
 {
 	return xa_load(&ioasid_sets, set->id) == set;
@@ -616,6 +719,7 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
 	}
 	data->id = id;
 	data->state = IOASID_STATE_IDLE;
+	data->spid = INVALID_IOASID;
 
 	/* Store IOASID in the per set data */
 	if (xa_err(xa_store(&set->xa, id, data, GFP_ATOMIC))) {
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index e7f3e6108724..dcab02886cb5 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -81,6 +81,9 @@ int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
 void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
 int ioasid_attach_data(ioasid_t ioasid, void *data);
 void ioasid_detach_data(ioasid_t ioasid);
+int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid);
+void ioasid_detach_spid(ioasid_t ioasid);
+ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid, bool get);
 void ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
 				void (*fn)(ioasid_t id, void *data),
 				void *data);
@@ -173,6 +176,21 @@ static inline struct ioasid_set *ioasid_find_set(ioasid_t ioasid)
 	return ERR_PTR(-ENOTSUPP);
 }
 
+static inline int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
+{
+	return -ENOTSUPP;
+}
+
+static inline void ioasid_detach_spid(ioasid_t ioasid)
+{
+}
+
+static inline ioasid_t ioasid_find_by_spid(struct ioasid_set *set,
+					   ioasid_t spid, bool get)
+{
+	return INVALID_IOASID;
+}
+
 static inline void ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
 					      void (*fn)(ioasid_t id, void *data),
 					      void *data)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 269+ messages in thread

* [PATCH V4 09/18] iommu/ioasid: Introduce notification APIs
  2021-02-27 22:01 [PATCH V4 00/18] IOASID extensions for guest SVA Jacob Pan
                   ` (7 preceding siblings ...)
  2021-02-27 22:01 ` [PATCH V4 08/18] iommu/ioasid: Introduce ioasid_set private ID Jacob Pan
@ 2021-02-27 22:01 ` Jacob Pan
  2021-02-27 22:01 ` [PATCH V4 10/18] iommu/ioasid: Support mm token type ioasid_set notifications Jacob Pan
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 269+ messages in thread
From: Jacob Pan @ 2021-02-27 22:01 UTC (permalink / raw)
  To: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker
  Cc: Alex Williamson, Eric Auger, Jason Gunthorpe, Jonathan Corbet,
	Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao, Dave Jiang, Jacob Pan

Relations among IOASID users largely follow a publisher-subscriber
pattern. E.g. to support guest SVA on Intel Scalable I/O Virtualization
(SIOV) enabled platforms, VFIO, IOMMU, device drivers, KVM are all users
of IOASIDs. When a state change occurs, VFIO publishes the change event
that needs to be processed by other users/subscribers.

This patch introduced two types of notifications: global and per
ioasid_set. The latter is intended for users who only needs to handle
events related to the IOASID of a given set.
For more information, refer to the kernel documentation at
Documentation/ioasid.rst.

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Wu Hao <hao.wu@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/ioasid.c | 111 +++++++++++++++++++++++++++++++++++++++--
 include/linux/ioasid.h |  54 ++++++++++++++++++++
 2 files changed, 161 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
index 7707bb608bdd..56577e745c4b 100644
--- a/drivers/iommu/ioasid.c
+++ b/drivers/iommu/ioasid.c
@@ -10,12 +10,33 @@
 #include <linux/spinlock.h>
 #include <linux/xarray.h>
 
+/*
+ * An IOASID can have multiple consumers where each consumer may have
+ * hardware contexts associated with the IOASID.
+ * When a status change occurs, like on IOASID deallocation, notifier chains
+ * are used to keep the consumers in sync.
+ * This is a publisher-subscriber pattern where publisher can change the
+ * state of each IOASID, e.g. alloc/free, bind IOASID to a device and mm.
+ * On the other hand, subscribers get notified for the state change and
+ * keep local states in sync.
+ */
+static ATOMIC_NOTIFIER_HEAD(ioasid_notifier);
+static DEFINE_SPINLOCK(ioasid_nb_lock);
+
 /* Default to PCIe standard 20 bit PASID */
 #define PCI_PASID_MAX 0x100000
 static ioasid_t ioasid_capacity = PCI_PASID_MAX;
 static ioasid_t ioasid_capacity_avail = PCI_PASID_MAX;
 static DEFINE_XARRAY_ALLOC(ioasid_sets);
 
+struct ioasid_set_nb {
+	struct list_head	list;
+	struct notifier_block	*nb;
+	void			*token;
+	struct ioasid_set	*set;
+	bool			active;
+};
+
 enum ioasid_state {
 	IOASID_STATE_IDLE,
 	IOASID_STATE_ACTIVE,
@@ -415,6 +436,38 @@ void ioasid_detach_data(ioasid_t ioasid)
 }
 EXPORT_SYMBOL_GPL(ioasid_detach_data);
 
+/**
+ * ioasid_notify - Send notification on a given IOASID for status change.
+ *
+ * @data:	The IOASID data to which the notification will send
+ * @cmd:	Notification event sent by IOASID external users, can be
+ *		IOASID_BIND or IOASID_UNBIND.
+ *
+ * @flags:	Special instructions, e.g. notify within a set or global by
+ *		IOASID_NOTIFY_FLAG_SET or IOASID_NOTIFY_FLAG_ALL flags
+ * Caller must hold ioasid_allocator_lock and reference to the IOASID
+ */
+static int ioasid_notify(struct ioasid_data *data,
+			 enum ioasid_notify_val cmd, unsigned int flags)
+{
+	struct ioasid_nb_args args = { 0 };
+	int ret = 0;
+
+	if (flags & ~(IOASID_NOTIFY_FLAG_ALL | IOASID_NOTIFY_FLAG_SET))
+		return -EINVAL;
+
+	args.id = data->id;
+	args.set = data->set;
+	args.pdata = data->private;
+	args.spid = data->spid;
+	if (flags & IOASID_NOTIFY_FLAG_ALL)
+		ret = atomic_notifier_call_chain(&ioasid_notifier, cmd, &args);
+	if (flags & IOASID_NOTIFY_FLAG_SET)
+		ret = atomic_notifier_call_chain(&data->set->nh, cmd, &args);
+
+	return ret;
+}
+
 static ioasid_t ioasid_find_by_spid_locked(struct ioasid_set *set, ioasid_t spid, bool get)
 {
 	ioasid_t ioasid = INVALID_IOASID;
@@ -468,7 +521,7 @@ int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
 		goto done_unlock;
 	}
 	data->spid = spid;
-
+	ioasid_notify(data, IOASID_NOTIFY_BIND, IOASID_NOTIFY_FLAG_SET);
 done_unlock:
 	spin_unlock(&ioasid_allocator_lock);
 	return ret;
@@ -486,8 +539,8 @@ void ioasid_detach_spid(ioasid_t ioasid)
 		pr_err("Invalid IOASID entry %d to detach\n", ioasid);
 		goto done_unlock;
 	}
+	ioasid_notify(data, IOASID_NOTIFY_UNBIND, IOASID_NOTIFY_FLAG_SET);
 	data->spid = INVALID_IOASID;
-
 done_unlock:
 	spin_unlock(&ioasid_allocator_lock);
 }
@@ -603,6 +656,8 @@ struct ioasid_set *ioasid_set_alloc(void *token, ioasid_t quota, int type)
 	set->quota = quota;
 	set->id = id;
 	atomic_set(&set->nr_ioasids, 0);
+	ATOMIC_INIT_NOTIFIER_HEAD(&set->nh);
+
 	/*
 	 * Per set XA is used to store private IDs within the set, get ready
 	 * for ioasid_set private ID and system-wide IOASID allocation
@@ -655,7 +710,9 @@ int ioasid_set_free(struct ioasid_set *set)
 	int ret = 0;
 
 	spin_lock(&ioasid_allocator_lock);
+	spin_lock(&ioasid_nb_lock);
 	ret = ioasid_set_free_locked(set);
+	spin_unlock(&ioasid_nb_lock);
 	spin_unlock(&ioasid_allocator_lock);
 	return ret;
 }
@@ -728,6 +785,7 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
 		goto exit_free;
 	}
 	atomic_inc(&set->nr_ioasids);
+	ioasid_notify(data, IOASID_NOTIFY_ALLOC, IOASID_NOTIFY_FLAG_SET);
 	goto done_unlock;
 exit_free:
 	kfree(data);
@@ -780,9 +838,11 @@ static void ioasid_free_locked(struct ioasid_set *set, ioasid_t ioasid)
 	 * If the refcount is 1, it means there is no other users of the IOASID
 	 * other than IOASID core itself. There is no need to notify anyone.
 	 */
-	if (!refcount_dec_and_test(&data->refs))
+	if (!refcount_dec_and_test(&data->refs)) {
+		ioasid_notify(data, IOASID_NOTIFY_FREE,
+			IOASID_NOTIFY_FLAG_SET | IOASID_NOTIFY_FLAG_ALL);
 		return;
-
+	}
 	ioasid_do_free_locked(data);
 }
 
@@ -833,15 +893,39 @@ void ioasid_free_all_in_set(struct ioasid_set *set)
 	if (!atomic_read(&set->nr_ioasids))
 		return;
 	spin_lock(&ioasid_allocator_lock);
+	spin_lock(&ioasid_nb_lock);
 	xa_for_each(&set->xa, index, entry) {
 		ioasid_free_locked(set, index);
 		/* Free from per set private pool */
 		xa_erase(&set->xa, index);
 	}
+	spin_unlock(&ioasid_nb_lock);
 	spin_unlock(&ioasid_allocator_lock);
 }
 EXPORT_SYMBOL_GPL(ioasid_free_all_in_set);
 
+/*
+ * ioasid_find_mm_set - Retrieve IOASID set with mm token
+ * Take a reference of the set if found.
+ */
+struct ioasid_set *ioasid_find_mm_set(struct mm_struct *token)
+{
+	struct ioasid_set *set;
+	unsigned long index;
+
+	spin_lock(&ioasid_allocator_lock);
+
+	xa_for_each(&ioasid_sets, index, set) {
+		if (set->type == IOASID_SET_TYPE_MM && set->token == token)
+			goto exit_unlock;
+	}
+	set = NULL;
+exit_unlock:
+	spin_unlock(&ioasid_allocator_lock);
+	return set;
+}
+EXPORT_SYMBOL_GPL(ioasid_find_mm_set);
+
 /**
  * ioasid_set_for_each_ioasid
  * @brief
@@ -1021,6 +1105,25 @@ void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
 }
 EXPORT_SYMBOL_GPL(ioasid_find);
 
+int ioasid_register_notifier(struct ioasid_set *set, struct notifier_block *nb)
+{
+	if (set)
+		return atomic_notifier_chain_register(&set->nh, nb);
+	else
+		return atomic_notifier_chain_register(&ioasid_notifier, nb);
+}
+EXPORT_SYMBOL_GPL(ioasid_register_notifier);
+
+void ioasid_unregister_notifier(struct ioasid_set *set,
+				struct notifier_block *nb)
+{
+	if (set)
+		atomic_notifier_chain_unregister(&set->nh, nb);
+	else
+		atomic_notifier_chain_unregister(&ioasid_notifier, nb);
+}
+EXPORT_SYMBOL_GPL(ioasid_unregister_notifier);
+
 MODULE_AUTHOR("Jean-Philippe Brucker <jean-philippe.brucker@arm.com>");
 MODULE_AUTHOR("Jacob Pan <jacob.jun.pan@linux.intel.com>");
 MODULE_DESCRIPTION("IO Address Space ID (IOASID) allocator");
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index dcab02886cb5..d8b85a04214f 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -58,6 +58,47 @@ struct ioasid_allocator_ops {
 	void *pdata;
 };
 
+/* Notification data when IOASID status changed */
+enum ioasid_notify_val {
+	IOASID_NOTIFY_ALLOC = 1,
+	IOASID_NOTIFY_FREE,
+	IOASID_NOTIFY_BIND,
+	IOASID_NOTIFY_UNBIND,
+};
+
+#define IOASID_NOTIFY_FLAG_ALL BIT(0)
+#define IOASID_NOTIFY_FLAG_SET BIT(1)
+/**
+ * enum ioasid_notifier_prios - IOASID event notification order
+ *
+ * When status of an IOASID changes, users might need to take actions to
+ * reflect the new state. For example, when an IOASID is freed due to
+ * exception, the hardware context in virtual CPU, DMA device, and IOMMU
+ * shall be cleared and drained. Order is required to prevent life cycle
+ * problems.
+ */
+enum ioasid_notifier_prios {
+	IOASID_PRIO_LAST,
+	IOASID_PRIO_DEVICE,
+	IOASID_PRIO_IOMMU,
+	IOASID_PRIO_CPU,
+};
+
+/**
+ * struct ioasid_nb_args - Argument provided by IOASID core when notifier
+ * is called.
+ * @id:		The IOASID being notified
+ * @spid:	The set private ID associated with the IOASID
+ * @set:	The IOASID set of @id
+ * @pdata:	The private data attached to the IOASID
+ */
+struct ioasid_nb_args {
+	ioasid_t id;
+	ioasid_t spid;
+	struct ioasid_set *set;
+	void *pdata;
+};
+
 #if IS_ENABLED(CONFIG_IOASID)
 void ioasid_install_capacity(ioasid_t total);
 int ioasid_reserve_capacity(ioasid_t nr_ioasid);
@@ -84,6 +125,10 @@ void ioasid_detach_data(ioasid_t ioasid);
 int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid);
 void ioasid_detach_spid(ioasid_t ioasid);
 ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid, bool get);
+int ioasid_register_notifier(struct ioasid_set *set,
+			struct notifier_block *nb);
+void ioasid_unregister_notifier(struct ioasid_set *set,
+				struct notifier_block *nb);
 void ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
 				void (*fn)(ioasid_t id, void *data),
 				void *data);
@@ -149,6 +194,15 @@ static inline void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
 	return NULL;
 }
 
+static inline int ioasid_register_notifier(struct notifier_block *nb)
+{
+	return -ENOTSUPP;
+}
+
+static inline void ioasid_unregister_notifier(struct notifier_block *nb)
+{
+}
+
 static inline int ioasid_register_allocator(struct ioasid_allocator_ops *allocator)
 {
 	return -ENOTSUPP;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 269+ messages in thread

* [PATCH V4 10/18] iommu/ioasid: Support mm token type ioasid_set notifications
  2021-02-27 22:01 [PATCH V4 00/18] IOASID extensions for guest SVA Jacob Pan
                   ` (8 preceding siblings ...)
  2021-02-27 22:01 ` [PATCH V4 09/18] iommu/ioasid: Introduce notification APIs Jacob Pan
@ 2021-02-27 22:01 ` Jacob Pan
  2021-02-27 22:01 ` [PATCH V4 11/18] iommu/ioasid: Add ownership check in guest bind Jacob Pan
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 269+ messages in thread
From: Jacob Pan @ 2021-02-27 22:01 UTC (permalink / raw)
  To: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker
  Cc: Alex Williamson, Eric Auger, Jason Gunthorpe, Jonathan Corbet,
	Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao, Dave Jiang, Jacob Pan

As a system-wide resource, IOASID is often shared by multiple kernel
subsystems that are independent of each other. However, at the
ioasid_set level, these kernel subsystems must communicate with each
other for ownership checking, event notifications, etc. For example, on
Intel Scalable IO Virtualization (SIOV) enabled platforms, KVM and VFIO
instances under the same process/guest must be aware of a shared IOASID
set.
IOASID_SET_TYPE_MM token type was introduced to explicitly mark an
IOASID set that belongs to a process, thus use the same mm_struct
pointer as a token. Users of the same process can then identify with
each other based on this token.

This patch introduces MM token specific event registration APIs. Event
subscribers such as KVM instances can register IOASID event handler
without the knowledge of its ioasid_set. Event handlers are registered
based on its mm_struct pointer as a token. In case when subscribers
register handler *prior* to the creation of the ioasid_set, the
handler’s notification block is stored in a pending list within IOASID
core. Once the ioasid_set of the MM token is created, the notification
block will be registered by the IOASID core.

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Wu Hao <hao.wu@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/ioasid.c | 142 +++++++++++++++++++++++++++++++++++++++++
 include/linux/ioasid.h |  18 ++++++
 2 files changed, 160 insertions(+)

diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
index 56577e745c4b..96e941dfada7 100644
--- a/drivers/iommu/ioasid.c
+++ b/drivers/iommu/ioasid.c
@@ -21,6 +21,8 @@
  * keep local states in sync.
  */
 static ATOMIC_NOTIFIER_HEAD(ioasid_notifier);
+/* List to hold pending notification block registrations */
+static LIST_HEAD(ioasid_nb_pending_list);
 static DEFINE_SPINLOCK(ioasid_nb_lock);
 
 /* Default to PCIe standard 20 bit PASID */
@@ -574,6 +576,27 @@ static inline bool ioasid_set_is_valid(struct ioasid_set *set)
 	return xa_load(&ioasid_sets, set->id) == set;
 }
 
+static void ioasid_add_pending_nb(struct ioasid_set *set)
+{
+	struct ioasid_set_nb *curr;
+
+	if (set->type != IOASID_SET_TYPE_MM)
+		return;
+	/*
+	 * Check if there are any pending nb requests for the given token, if so
+	 * add them to the notifier chain.
+	 */
+	spin_lock(&ioasid_nb_lock);
+	list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
+		if (curr->token == set->token && !curr->active) {
+			atomic_notifier_chain_register(&set->nh, curr->nb);
+			curr->set = set;
+			curr->active = true;
+		}
+	}
+	spin_unlock(&ioasid_nb_lock);
+}
+
 /**
  * ioasid_set_alloc - Allocate a new IOASID set for a given token
  *
@@ -658,6 +681,11 @@ struct ioasid_set *ioasid_set_alloc(void *token, ioasid_t quota, int type)
 	atomic_set(&set->nr_ioasids, 0);
 	ATOMIC_INIT_NOTIFIER_HEAD(&set->nh);
 
+	/*
+	 * Check if there are any pending nb requests for the given token, if so
+	 * add them to the notifier chain.
+	 */
+	ioasid_add_pending_nb(set);
 	/*
 	 * Per set XA is used to store private IDs within the set, get ready
 	 * for ioasid_set private ID and system-wide IOASID allocation
@@ -675,6 +703,7 @@ EXPORT_SYMBOL_GPL(ioasid_set_alloc);
 
 static int ioasid_set_free_locked(struct ioasid_set *set)
 {
+	struct ioasid_set_nb *curr;
 	int ret = 0;
 
 	if (!ioasid_set_is_valid(set)) {
@@ -688,6 +717,16 @@ static int ioasid_set_free_locked(struct ioasid_set *set)
 	}
 
 	WARN_ON(!xa_empty(&set->xa));
+	/* Restore pending status of the set NBs */
+	list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
+		if (curr->token == set->token) {
+			if (curr->active)
+				curr->active = false;
+			else
+				pr_warn("Set token exists but not active!\n");
+		}
+	}
+
 	/*
 	 * Token got released right away after the ioasid_set is freed.
 	 * If a new set is created immediately with the newly released token,
@@ -1117,6 +1156,22 @@ EXPORT_SYMBOL_GPL(ioasid_register_notifier);
 void ioasid_unregister_notifier(struct ioasid_set *set,
 				struct notifier_block *nb)
 {
+	struct ioasid_set_nb *curr;
+
+	spin_lock(&ioasid_nb_lock);
+	/*
+	 * Pending list is registered with a token without an ioasid_set,
+	 * therefore should not be unregistered directly.
+	 */
+	list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
+		if (curr->nb == nb) {
+			pr_warn("Cannot unregister NB from pending list\n");
+			spin_unlock(&ioasid_nb_lock);
+			return;
+		}
+	}
+	spin_unlock(&ioasid_nb_lock);
+
 	if (set)
 		atomic_notifier_chain_unregister(&set->nh, nb);
 	else
@@ -1124,6 +1179,93 @@ void ioasid_unregister_notifier(struct ioasid_set *set,
 }
 EXPORT_SYMBOL_GPL(ioasid_unregister_notifier);
 
+/**
+ * ioasid_register_notifier_mm - Register a notifier block on the IOASID set
+ *                               created by the mm_struct pointer as the token
+ *
+ * @mm: the mm_struct token of the ioasid_set
+ * @nb: notfier block to be registered on the ioasid_set
+ *
+ * This a variant of ioasid_register_notifier() where the caller intends to
+ * listen to IOASID events belong the ioasid_set created under the same
+ * process. Caller is not aware of the ioasid_set, no need to hold reference
+ * of the ioasid_set.
+ */
+int ioasid_register_notifier_mm(struct mm_struct *mm, struct notifier_block *nb)
+{
+	struct ioasid_set_nb *curr;
+	struct ioasid_set *set;
+	int ret = 0;
+
+	spin_lock(&ioasid_nb_lock);
+	/* Check for duplicates, nb is unique per set */
+	list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
+		if (curr->token == mm && curr->nb == nb) {
+			ret = -EBUSY;
+			goto exit_unlock;
+		}
+	}
+	curr = kzalloc(sizeof(*curr), GFP_ATOMIC);
+	if (!curr) {
+		ret = -ENOMEM;
+		goto exit_unlock;
+	}
+	/* Check if the token has an existing set */
+	set = ioasid_find_mm_set(mm);
+	if (!set) {
+		/* Add to the rsvd list as inactive */
+		curr->active = false;
+	} else {
+		/* REVISIT: Only register empty set for now. Can add an option
+		 * in the future to playback existing PASIDs.
+		 */
+		if (atomic_read(&set->nr_ioasids)) {
+			pr_warn("IOASID set %d not empty %d\n", set->id,
+				atomic_read(&set->nr_ioasids));
+			ret = -EBUSY;
+			goto exit_free;
+		}
+		curr->token = mm;
+		curr->nb = nb;
+		curr->active = true;
+		curr->set = set;
+
+		/* Set already created, add to the notifier chain */
+		atomic_notifier_chain_register(&set->nh, nb);
+	}
+
+	list_add(&curr->list, &ioasid_nb_pending_list);
+	goto exit_unlock;
+exit_free:
+	kfree(curr);
+exit_unlock:
+	spin_unlock(&ioasid_nb_lock);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(ioasid_register_notifier_mm);
+
+void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct notifier_block *nb)
+{
+	struct ioasid_set_nb *curr;
+
+	spin_lock(&ioasid_nb_lock);
+	list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
+		if (curr->token == mm && curr->nb == nb) {
+			list_del(&curr->list);
+			spin_unlock(&ioasid_nb_lock);
+			if (curr->active) {
+				atomic_notifier_chain_unregister(&curr->set->nh,
+								 nb);
+			}
+			kfree(curr);
+			return;
+		}
+	}
+	pr_warn("No ioasid set found for mm token %llx\n",  (u64)mm);
+	spin_unlock(&ioasid_nb_lock);
+}
+EXPORT_SYMBOL_GPL(ioasid_unregister_notifier_mm);
+
 MODULE_AUTHOR("Jean-Philippe Brucker <jean-philippe.brucker@arm.com>");
 MODULE_AUTHOR("Jacob Pan <jacob.jun.pan@linux.intel.com>");
 MODULE_DESCRIPTION("IO Address Space ID (IOASID) allocator");
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index d8b85a04214f..c97e80ff65cc 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -132,6 +132,8 @@ void ioasid_unregister_notifier(struct ioasid_set *set,
 void ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
 				void (*fn)(ioasid_t id, void *data),
 				void *data);
+int ioasid_register_notifier_mm(struct mm_struct *mm, struct notifier_block *nb);
+void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct notifier_block *nb);
 #else /* !CONFIG_IOASID */
 static inline void ioasid_install_capacity(ioasid_t total)
 {
@@ -250,5 +252,21 @@ static inline void ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
 					      void *data)
 {
 }
+
+static inline int ioasid_register_notifier_mm(struct mm_struct *mm,
+					      struct notifier_block *nb)
+{
+	return -ENOTSUPP;
+}
+
+static inline void ioasid_unregister_notifier_mm(struct mm_struct *mm,
+						 struct notifier_block *nb)
+{
+}
+
+static inline bool ioasid_queue_work(struct work_struct *work)
+{
+	return false;
+}
 #endif /* CONFIG_IOASID */
 #endif /* __LINUX_IOASID_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 269+ messages in thread

* [PATCH V4 11/18] iommu/ioasid: Add ownership check in guest bind
  2021-02-27 22:01 [PATCH V4 00/18] IOASID extensions for guest SVA Jacob Pan
                   ` (9 preceding siblings ...)
  2021-02-27 22:01 ` [PATCH V4 10/18] iommu/ioasid: Support mm token type ioasid_set notifications Jacob Pan
@ 2021-02-27 22:01 ` Jacob Pan
  2021-02-27 22:01 ` [PATCH V4 12/18] iommu/vt-d: Remove mm reference for guest SVA Jacob Pan
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 269+ messages in thread
From: Jacob Pan @ 2021-02-27 22:01 UTC (permalink / raw)
  To: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker
  Cc: Alex Williamson, Eric Auger, Jason Gunthorpe, Jonathan Corbet,
	Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao, Dave Jiang, Jacob Pan

Bind guest page table call comes with an IOASID provided by the
userspace.  To prevent attacks by malicious users, we must ensure the
IOASID was allocated under the same process.

This patch adds a new API that will perform an ownership check that is
based on whether the IOASID belongs to the ioasid_set allocated with the
mm_struct pointer as a token.

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/ioasid.c | 37 +++++++++++++++++++++++++++++++++++++
 drivers/iommu/iommu.c  | 16 ++++++++++++++--
 include/linux/ioasid.h |  6 ++++++
 3 files changed, 57 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
index 96e941dfada7..28a2e9b6594d 100644
--- a/drivers/iommu/ioasid.c
+++ b/drivers/iommu/ioasid.c
@@ -9,6 +9,7 @@
 #include <linux/slab.h>
 #include <linux/spinlock.h>
 #include <linux/xarray.h>
+#include <linux/sched/mm.h>
 
 /*
  * An IOASID can have multiple consumers where each consumer may have
@@ -1028,6 +1029,42 @@ int ioasid_get(struct ioasid_set *set, ioasid_t ioasid)
 }
 EXPORT_SYMBOL_GPL(ioasid_get);
 
+/**
+ * ioasid_get_if_owned - obtain a reference to the IOASID if the IOASID belongs
+ * 		to the ioasid_set with the current mm as token
+ * @ioasid:	the IOASID to get reference
+ *
+ *
+ * Return: 0 on success, error if failed.
+ */
+int ioasid_get_if_owned(ioasid_t ioasid)
+{
+	struct ioasid_set *set;
+	int ret;
+
+	spin_lock(&ioasid_allocator_lock);
+	set = ioasid_find_set(ioasid);
+	if (IS_ERR_OR_NULL(set)) {
+		ret = -ENOENT;
+		goto done_unlock;
+	}
+	if (set->type != IOASID_SET_TYPE_MM) {
+		ret = -EINVAL;
+		goto done_unlock;
+	}
+	if (current->mm != set->token) {
+		ret = -EPERM;
+		goto done_unlock;
+	}
+
+	ret = ioasid_get_locked(set, ioasid);
+done_unlock:
+	spin_unlock(&ioasid_allocator_lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(ioasid_get_if_owned);
+
 bool ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid)
 {
 	struct ioasid_data *data;
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index fd76e2f579fe..18716d856b02 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2169,7 +2169,13 @@ int iommu_uapi_sva_bind_gpasid(struct iommu_domain *domain, struct device *dev,
 	if (ret)
 		return ret;
 
-	return domain->ops->sva_bind_gpasid(domain, dev, &data);
+	ret = ioasid_get_if_owned(data.hpasid);
+	if (ret)
+		return ret;
+	ret = domain->ops->sva_bind_gpasid(domain, dev, &data);
+	ioasid_put(NULL, data.hpasid);
+
+	return ret;
 }
 EXPORT_SYMBOL_GPL(iommu_uapi_sva_bind_gpasid);
 
@@ -2196,7 +2202,13 @@ int iommu_uapi_sva_unbind_gpasid(struct iommu_domain *domain, struct device *dev
 	if (ret)
 		return ret;
 
-	return iommu_sva_unbind_gpasid(domain, dev, data.hpasid);
+	ret = ioasid_get_if_owned(data.hpasid);
+	if (ret)
+		return ret;
+	ret = iommu_sva_unbind_gpasid(domain, dev, data.hpasid);
+	ioasid_put(NULL, data.hpasid);
+
+	return ret;
 }
 EXPORT_SYMBOL_GPL(iommu_uapi_sva_unbind_gpasid);
 
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index c97e80ff65cc..9624b665f810 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -111,6 +111,7 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
 		      void *private);
 int ioasid_get(struct ioasid_set *set, ioasid_t ioasid);
 int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid);
+int ioasid_get_if_owned(ioasid_t ioasid);
 bool ioasid_put(struct ioasid_set *set, ioasid_t ioasid);
 bool ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid);
 void ioasid_free(struct ioasid_set *set, ioasid_t ioasid);
@@ -180,6 +181,11 @@ static inline int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid)
 	return -ENOTSUPP;
 }
 
+static inline int ioasid_get_if_owned(ioasid_t ioasid)
+{
+	return -ENOTSUPP;
+}
+
 static inline bool ioasid_put(struct ioasid_set *set, ioasid_t ioasid)
 {
 	return false;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 269+ messages in thread

* [PATCH V4 12/18] iommu/vt-d: Remove mm reference for guest SVA
  2021-02-27 22:01 [PATCH V4 00/18] IOASID extensions for guest SVA Jacob Pan
                   ` (10 preceding siblings ...)
  2021-02-27 22:01 ` [PATCH V4 11/18] iommu/ioasid: Add ownership check in guest bind Jacob Pan
@ 2021-02-27 22:01 ` Jacob Pan
  2021-02-27 22:01 ` [PATCH V4 13/18] iommu/ioasid: Add a workqueue for cleanup work Jacob Pan
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 269+ messages in thread
From: Jacob Pan @ 2021-02-27 22:01 UTC (permalink / raw)
  To: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker
  Cc: Alex Williamson, Eric Auger, Jason Gunthorpe, Jonathan Corbet,
	Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao, Dave Jiang, Jacob Pan

Now that IOASID core keeps track of the IOASID to mm_struct ownership in
the forms of ioasid_set with IOASID_SET_TYPE_MM token type, there is no
need to keep the same mapping in VT-d driver specific data. Native SVM
usage is not affected by the change.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/intel/svm.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index c469c24d23f5..f75699ddb923 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -363,12 +363,6 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
 			ret = -ENOMEM;
 			goto out;
 		}
-		/* REVISIT: upper layer/VFIO can track host process that bind
-		 * the PASID. ioasid_set = mm might be sufficient for vfio to
-		 * check pasid VMM ownership. We can drop the following line
-		 * once VFIO and IOASID set check is in place.
-		 */
-		svm->mm = get_task_mm(current);
 		svm->pasid = data->hpasid;
 		if (data->flags & IOMMU_SVA_GPASID_VAL) {
 			svm->gpasid = data->gpasid;
@@ -376,7 +370,6 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
 		}
 		ioasid_attach_data(data->hpasid, svm);
 		INIT_LIST_HEAD_RCU(&svm->devs);
-		mmput(svm->mm);
 	}
 	sdev = kzalloc(sizeof(*sdev), GFP_KERNEL);
 	if (!sdev) {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 269+ messages in thread

* [PATCH V4 13/18] iommu/ioasid: Add a workqueue for cleanup work
  2021-02-27 22:01 [PATCH V4 00/18] IOASID extensions for guest SVA Jacob Pan
                   ` (11 preceding siblings ...)
  2021-02-27 22:01 ` [PATCH V4 12/18] iommu/vt-d: Remove mm reference for guest SVA Jacob Pan
@ 2021-02-27 22:01 ` Jacob Pan
  2021-02-27 22:01 ` [PATCH V4 14/18] iommu/vt-d: Listen to IOASID notifications Jacob Pan
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 269+ messages in thread
From: Jacob Pan @ 2021-02-27 22:01 UTC (permalink / raw)
  To: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker
  Cc: Alex Williamson, Eric Auger, Jason Gunthorpe, Jonathan Corbet,
	Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao, Dave Jiang, Jacob Pan

An IOASID can have multiple users, such as IOMMU driver, KVM, and device
drivers.   The atomic IOASID notifier is used to inform users of IOASID
state change. For example, the IOASID_NOTIFY_UNBIND event is issued when
the IOASID is no longer bound to an address space. This requires ordered
actions among users to tear down their contexts.

Not all work can be handled in the atomic notifier handler. This patch
introduces a shared, ordered workqueue for all IOASID users who wish to
perform work asynchronously upon notification.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/ioasid.c | 25 +++++++++++++++++++++++++
 include/linux/ioasid.h |  1 +
 2 files changed, 26 insertions(+)

diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
index 28a2e9b6594d..d42b39ca2c8b 100644
--- a/drivers/iommu/ioasid.c
+++ b/drivers/iommu/ioasid.c
@@ -32,6 +32,9 @@ static ioasid_t ioasid_capacity = PCI_PASID_MAX;
 static ioasid_t ioasid_capacity_avail = PCI_PASID_MAX;
 static DEFINE_XARRAY_ALLOC(ioasid_sets);
 
+/* Workqueue for IOASID users to do cleanup upon notification */
+static struct workqueue_struct *ioasid_wq;
+
 struct ioasid_set_nb {
 	struct list_head	list;
 	struct notifier_block	*nb;
@@ -1281,6 +1284,12 @@ int ioasid_register_notifier_mm(struct mm_struct *mm, struct notifier_block *nb)
 }
 EXPORT_SYMBOL_GPL(ioasid_register_notifier_mm);
 
+bool ioasid_queue_work(struct work_struct *work)
+{
+	return queue_work(ioasid_wq, work);
+}
+EXPORT_SYMBOL_GPL(ioasid_queue_work);
+
 void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct notifier_block *nb)
 {
 	struct ioasid_set_nb *curr;
@@ -1303,7 +1312,23 @@ void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct notifier_block *
 }
 EXPORT_SYMBOL_GPL(ioasid_unregister_notifier_mm);
 
+static int __init ioasid_init(void)
+{
+	ioasid_wq = alloc_ordered_workqueue("ioasid_wq", 0);
+	if (!ioasid_wq)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static void __exit ioasid_cleanup(void)
+{
+	destroy_workqueue(ioasid_wq);
+}
+
 MODULE_AUTHOR("Jean-Philippe Brucker <jean-philippe.brucker@arm.com>");
 MODULE_AUTHOR("Jacob Pan <jacob.jun.pan@linux.intel.com>");
 MODULE_DESCRIPTION("IO Address Space ID (IOASID) allocator");
 MODULE_LICENSE("GPL");
+module_init(ioasid_init);
+module_exit(ioasid_cleanup);
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index 9624b665f810..4547086797df 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -135,6 +135,7 @@ void ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
 				void *data);
 int ioasid_register_notifier_mm(struct mm_struct *mm, struct notifier_block *nb);
 void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct notifier_block *nb);
+bool ioasid_queue_work(struct work_struct *work);
 #else /* !CONFIG_IOASID */
 static inline void ioasid_install_capacity(ioasid_t total)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 269+ messages in thread

* [PATCH V4 14/18] iommu/vt-d: Listen to IOASID notifications
  2021-02-27 22:01 [PATCH V4 00/18] IOASID extensions for guest SVA Jacob Pan
                   ` (12 preceding siblings ...)
  2021-02-27 22:01 ` [PATCH V4 13/18] iommu/ioasid: Add a workqueue for cleanup work Jacob Pan
@ 2021-02-27 22:01 ` Jacob Pan
  2021-02-27 22:01 ` [RFC PATCH 15/18] cgroup: Introduce ioasids controller Jacob Pan
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 269+ messages in thread
From: Jacob Pan @ 2021-02-27 22:01 UTC (permalink / raw)
  To: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker
  Cc: Alex Williamson, Eric Auger, Jason Gunthorpe, Jonathan Corbet,
	Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao, Dave Jiang, Jacob Pan

On Intel Scalable I/O Virtualization (SIOV) enabled platforms, IOMMU
driver is one of the users of IOASIDs. In normal flow, callers will
perform IOASID allocation, bind, unbind, and free in order. However, for
guest SVA, IOASID free could come before unbind as guest is untrusted.
This patch registers IOASID notification handler such that IOMMU driver
can perform PASID teardown upon receiving an unexpected IOASID free
event.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/intel/iommu.c |   2 +
 drivers/iommu/intel/svm.c   | 109 +++++++++++++++++++++++++++++++++++-
 include/linux/intel-iommu.h |   2 +
 3 files changed, 111 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index eb9868061545..d602e89c40d2 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -3313,6 +3313,8 @@ static int __init init_dmars(void)
 			pr_err("Failed to allocate host PASID set %lu\n",
 				PTR_ERR(host_pasid_set));
 			intel_iommu_sm = 0;
+		} else {
+			intel_svm_add_pasid_notifier();
 		}
 	}
 
diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index f75699ddb923..b5bb9b578281 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -96,6 +96,104 @@ static inline bool intel_svm_capable(struct intel_iommu *iommu)
 	return iommu->flags & VTD_FLAG_SVM_CAPABLE;
 }
 
+static inline void intel_svm_drop_pasid(ioasid_t pasid)
+{
+	/*
+	 * Detaching SPID results in UNBIND notification on the set, we must
+	 * do this before dropping the IOASID reference, otherwise the
+	 * notification chain may get destroyed.
+	 */
+	ioasid_detach_spid(pasid);
+	ioasid_detach_data(pasid);
+	ioasid_put(NULL, pasid);
+}
+
+static DEFINE_MUTEX(pasid_mutex);
+#define pasid_lock_held() lock_is_held(&pasid_mutex.dep_map)
+
+static void intel_svm_free_async_fn(struct work_struct *work)
+{
+	struct intel_svm *svm = container_of(work, struct intel_svm, work);
+	struct intel_svm_dev *sdev;
+
+	/*
+	 * Unbind all devices associated with this PASID which is
+	 * being freed by other users such as VFIO.
+	 */
+	mutex_lock(&pasid_mutex);
+	list_for_each_entry_rcu(sdev, &svm->devs, list, pasid_lock_held()) {
+		/* Does not poison forward pointer */
+		list_del_rcu(&sdev->list);
+		spin_lock(&sdev->iommu->lock);
+		intel_pasid_tear_down_entry(sdev->iommu, sdev->dev,
+					svm->pasid, true);
+		intel_svm_drain_prq(sdev->dev, svm->pasid);
+		spin_unlock(&sdev->iommu->lock);
+		kfree_rcu(sdev, rcu);
+	}
+	/*
+	 * We may not be the last user to drop the reference but since
+	 * the PASID is in FREE_PENDING state, no one can get new reference.
+	 * Therefore, we can safely free the private data svm.
+	 */
+	intel_svm_drop_pasid(svm->pasid);
+
+	/*
+	 * Free before unbind can only happen with host PASIDs used for
+	 * guest SVM. We get here because ioasid_free is called with
+	 * outstanding references. So we need to drop the reference
+	 * such that the PASID can be reclaimed. unbind_gpasid() after this
+	 * will not result in dropping refcount since the private data is
+	 * already detached.
+	 */
+	kfree(svm);
+
+	mutex_unlock(&pasid_mutex);
+}
+
+
+static int pasid_status_change(struct notifier_block *nb,
+				unsigned long code, void *data)
+{
+	struct ioasid_nb_args *args = (struct ioasid_nb_args *)data;
+	struct intel_svm *svm = (struct intel_svm *)args->pdata;
+	int ret = NOTIFY_DONE;
+
+	/*
+	 * Notification private data is a choice of vendor driver when the
+	 * IOASID is allocated or attached after allocation. When the data
+	 * type changes, we must make modifications here accordingly.
+	 */
+	if (code == IOASID_NOTIFY_FREE) {
+		/*
+		 * If PASID UNBIND happens before FREE, private data of the
+		 * IOASID should be NULL, then we don't need to do anything.
+		 */
+		if (!svm)
+			goto done;
+		if (args->id != svm->pasid) {
+			pr_warn("Notify PASID does not match data %d : %d\n",
+				args->id, svm->pasid);
+			goto done;
+		}
+		if (!ioasid_queue_work(&svm->work))
+			pr_warn("Cleanup work already queued\n");
+		return NOTIFY_OK;
+	}
+done:
+	return ret;
+}
+
+static struct notifier_block pasid_nb = {
+	.notifier_call = pasid_status_change,
+};
+
+void intel_svm_add_pasid_notifier(void)
+{
+	/* Listen to all PASIDs, not specific to a set */
+	ioasid_register_notifier(NULL, &pasid_nb);
+}
+
 void intel_svm_check(struct intel_iommu *iommu)
 {
 	if (!pasid_supported(iommu))
@@ -240,7 +338,6 @@ static const struct mmu_notifier_ops intel_mmuops = {
 	.invalidate_range = intel_invalidate_range,
 };
 
-static DEFINE_MUTEX(pasid_mutex);
 static LIST_HEAD(global_svm_list);
 
 #define for_each_svm_dev(sdev, svm, d)			\
@@ -367,8 +464,16 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
 		if (data->flags & IOMMU_SVA_GPASID_VAL) {
 			svm->gpasid = data->gpasid;
 			svm->flags |= SVM_FLAG_GUEST_PASID;
+			ioasid_attach_spid(data->hpasid, data->gpasid);
 		}
 		ioasid_attach_data(data->hpasid, svm);
+		ioasid_get(NULL, svm->pasid);
+		sdev->iommu = iommu;
+		/*
+		 * Set up cleanup async work in case IOASID core notify us PASID
+		 * is freed before unbind.
+		 */
+		INIT_WORK(&svm->work, intel_svm_free_async_fn);
 		INIT_LIST_HEAD_RCU(&svm->devs);
 	}
 	sdev = kzalloc(sizeof(*sdev), GFP_KERNEL);
@@ -464,7 +569,7 @@ int intel_svm_unbind_gpasid(struct device *dev, u32 pasid)
 				 * the unbind, IOMMU driver will get notified
 				 * and perform cleanup.
 				 */
-				ioasid_detach_data(pasid);
+				intel_svm_drop_pasid(pasid);
 				kfree(svm);
 			}
 		}
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 09c6a0bf3892..b1b8914e1564 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -757,6 +757,7 @@ void intel_svm_unbind(struct iommu_sva *handle);
 u32 intel_svm_get_pasid(struct iommu_sva *handle);
 int intel_svm_page_response(struct device *dev, struct iommu_fault_event *evt,
 			    struct iommu_page_response *msg);
+void intel_svm_add_pasid_notifier(void);
 
 struct svm_dev_ops;
 
@@ -783,6 +784,7 @@ struct intel_svm {
 	int gpasid; /* In case that guest PASID is different from host PASID */
 	struct list_head devs;
 	struct list_head list;
+	struct work_struct work; /* For deferred clean up */
 };
 #else
 static inline void intel_svm_check(struct intel_iommu *iommu) {}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 269+ messages in thread

* [RFC PATCH 15/18] cgroup: Introduce ioasids controller
  2021-02-27 22:01 [PATCH V4 00/18] IOASID extensions for guest SVA Jacob Pan
                   ` (13 preceding siblings ...)
  2021-02-27 22:01 ` [PATCH V4 14/18] iommu/vt-d: Listen to IOASID notifications Jacob Pan
@ 2021-02-27 22:01 ` Jacob Pan
  2021-03-03 15:44   ` Tejun Heo
  2021-02-27 22:01 ` [RFC PATCH 16/18] iommu/ioasid: Consult IOASIDs cgroup for allocation Jacob Pan
                   ` (3 subsequent siblings)
  18 siblings, 1 reply; 269+ messages in thread
From: Jacob Pan @ 2021-02-27 22:01 UTC (permalink / raw)
  To: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker
  Cc: Alex Williamson, Eric Auger, Jason Gunthorpe, Jonathan Corbet,
	Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao, Dave Jiang, Jacob Pan

IOASIDs are used to associate DMA requests with virtual address spaces.
They are a system-wide limited resource made available to the userspace
applications. Let it be VMs or user-space device drivers.

This RFC patch introduces a cgroup controller to address the following
problems:
1. Some user applications exhaust all the available IOASIDs thus
depriving others of the same host.
2. System admins need to provision VMs based on their needs for IOASIDs,
e.g. the number of VMs with assigned devices that perform DMA requests
with PASID.

This patch is nowhere near its completion, it merely provides the basic
functionality for resource distribution and cgroup hierarchy
organizational changes.

Since this is part of a greater effort to enable Shared Virtual Address
(SVA) virtualization. We would like to have a direction check and
collect feedback early. For details, please refer to the documentation:
Documentation/admin-guide/cgroup-v1/ioasids.rst

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 include/linux/cgroup_subsys.h |   4 +
 include/linux/ioasid.h        |  17 ++
 init/Kconfig                  |   7 +
 kernel/cgroup/Makefile        |   1 +
 kernel/cgroup/ioasids.c       | 345 ++++++++++++++++++++++++++++++++++
 5 files changed, 374 insertions(+)
 create mode 100644 kernel/cgroup/ioasids.c

diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index acb77dcff3b4..cda75ecdcdcb 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -57,6 +57,10 @@ SUBSYS(hugetlb)
 SUBSYS(pids)
 #endif
 
+#if IS_ENABLED(CONFIG_CGROUP_IOASIDS)
+SUBSYS(ioasids)
+#endif
+
 #if IS_ENABLED(CONFIG_CGROUP_RDMA)
 SUBSYS(rdma)
 #endif
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index 4547086797df..5ea4710efb02 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -135,8 +135,25 @@ void ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
 				void *data);
 int ioasid_register_notifier_mm(struct mm_struct *mm, struct notifier_block *nb);
 void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct notifier_block *nb);
+#ifdef CONFIG_CGROUP_IOASIDS
+int ioasid_cg_charge(struct ioasid_set *set);
+void ioasid_cg_uncharge(struct ioasid_set *set);
+#else
+/* No cgroup control, allocation will proceed until run out total pool */
+static inline int ioasid_cg_charge(struct ioasid_set *set)
+{
+	return 0;
+}
+
+static inline int ioasid_cg_uncharge(struct ioasid_set *set)
+{
+	return 0;
+}
+#endif /* CGROUP_IOASIDS */
 bool ioasid_queue_work(struct work_struct *work);
+
 #else /* !CONFIG_IOASID */
+
 static inline void ioasid_install_capacity(ioasid_t total)
 {
 }
diff --git a/init/Kconfig b/init/Kconfig
index b77c60f8b963..9a23683dad98 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1017,6 +1017,13 @@ config CGROUP_PIDS
 	  since the PIDs limit only affects a process's ability to fork, not to
 	  attach to a cgroup.
 
+config CGROUP_IOASIDS
+	bool "IOASIDs controller"
+	depends on IOASID
+	help
+	  Provides enforcement of IO Address Space ID limits in the scope of a
+	  cgroup.
+
 config CGROUP_RDMA
 	bool "RDMA controller"
 	help
diff --git a/kernel/cgroup/Makefile b/kernel/cgroup/Makefile
index 5d7a76bfbbb7..c5ad7c9a2305 100644
--- a/kernel/cgroup/Makefile
+++ b/kernel/cgroup/Makefile
@@ -3,6 +3,7 @@ obj-y := cgroup.o rstat.o namespace.o cgroup-v1.o freezer.o
 
 obj-$(CONFIG_CGROUP_FREEZER) += legacy_freezer.o
 obj-$(CONFIG_CGROUP_PIDS) += pids.o
+obj-$(CONFIG_CGROUP_IOASIDS) += ioasids.o
 obj-$(CONFIG_CGROUP_RDMA) += rdma.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_CGROUP_DEBUG) += debug.o
diff --git a/kernel/cgroup/ioasids.c b/kernel/cgroup/ioasids.c
new file mode 100644
index 000000000000..ac43813da6ad
--- /dev/null
+++ b/kernel/cgroup/ioasids.c
@@ -0,0 +1,345 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * IO Address Space ID limiting controller for cgroups.
+ *
+ */
+#define pr_fmt(fmt)	"ioasids_cg: " fmt
+
+#include <linux/kernel.h>
+#include <linux/threads.h>
+#include <linux/atomic.h>
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/ioasid.h>
+#include <linux/sched/mm.h>
+#include <linux/sched/task.h>
+
+#define IOASIDS_MAX_STR "max"
+static DEFINE_MUTEX(ioasids_cg_lock);
+
+struct ioasids_cgroup {
+	struct cgroup_subsys_state	css;
+	atomic64_t			counter;
+	atomic64_t			limit;
+	struct cgroup_file		events_file;
+	/* Number of times allocations failed because limit was hit. */
+	atomic64_t			events_limit;
+};
+
+static struct ioasids_cgroup *css_ioasids(struct cgroup_subsys_state *css)
+{
+	return container_of(css, struct ioasids_cgroup, css);
+}
+
+static struct ioasids_cgroup *parent_ioasids(struct ioasids_cgroup *ioasids)
+{
+	return css_ioasids(ioasids->css.parent);
+}
+
+static struct cgroup_subsys_state *
+ioasids_css_alloc(struct cgroup_subsys_state *parent)
+{
+	struct ioasids_cgroup *ioasids;
+
+	ioasids = kzalloc(sizeof(struct ioasids_cgroup), GFP_KERNEL);
+	if (!ioasids)
+		return ERR_PTR(-ENOMEM);
+
+	atomic64_set(&ioasids->counter, 0);
+	atomic64_set(&ioasids->limit, 0);
+	atomic64_set(&ioasids->events_limit, 0);
+	return &ioasids->css;
+}
+
+static void ioasids_css_free(struct cgroup_subsys_state *css)
+{
+	kfree(css_ioasids(css));
+}
+
+/**
+ * ioasids_cancel - uncharge the local IOASID count
+ * @ioasids: the ioasid cgroup state
+ * @num: the number of ioasids to cancel
+ *
+ */
+static void ioasids_cancel(struct ioasids_cgroup *ioasids, int num)
+{
+	WARN_ON_ONCE(atomic64_add_negative(-num, &ioasids->counter));
+}
+
+/**
+ * ioasids_uncharge - hierarchically uncharge the ioasid count
+ * @ioasids: the ioasid cgroup state
+ * @num: the number of ioasids to uncharge
+ */
+static void ioasids_uncharge(struct ioasids_cgroup *ioasids, int num)
+{
+	struct ioasids_cgroup *p;
+
+	for (p = ioasids; parent_ioasids(p); p = parent_ioasids(p))
+		ioasids_cancel(p, num);
+}
+
+/**
+ * ioasids_charge - hierarchically charge the ioasid count
+ * @ioasids: the ioasid cgroup state
+ * @num: the number of ioasids to charge
+ */
+static void ioasids_charge(struct ioasids_cgroup *ioasids, int num)
+{
+	struct ioasids_cgroup *p;
+
+	for (p = ioasids; parent_ioasids(p); p = parent_ioasids(p))
+		atomic64_add(num, &p->counter);
+}
+
+/**
+ * ioasids_try_charge - hierarchically try to charge the ioasid count
+ * @ioasids: the ioasid cgroup state
+ * @num: the number of ioasids to charge
+ */
+static int ioasids_try_charge(struct ioasids_cgroup *ioasids, int num)
+{
+	struct ioasids_cgroup *p, *q;
+
+	for (p = ioasids; parent_ioasids(p); p = parent_ioasids(p)) {
+		int64_t new = atomic64_add_return(num, &p->counter);
+		int64_t limit = atomic64_read(&p->limit);
+
+		if (new > limit)
+			goto revert;
+	}
+
+	return 0;
+
+revert:
+	for (q = ioasids; q != p; q = parent_ioasids(q))
+		ioasids_cancel(q, num);
+	ioasids_cancel(p, num);
+	cgroup_file_notify(&ioasids->events_file);
+
+	return -EAGAIN;
+}
+
+
+/**
+ * ioasid_cg_charge - Check and charge IOASIDs cgroup
+ *
+ * @set: IOASID set used for allocation
+ *
+ * The IOASID quota is managed per cgroup, all process based allocations
+ * must be validated per cgroup hierarchy.
+ * Return 0 if a single IOASID can be allocated or error if failed in various
+ * checks.
+ */
+int ioasid_cg_charge(struct ioasid_set *set)
+{
+	struct mm_struct *mm = get_task_mm(current);
+	struct cgroup_subsys_state *css;
+	struct ioasids_cgroup *ioasids;
+	int ret = 0;
+
+	/* Must be called with a valid mm, not during process exit */
+	if (set->type != IOASID_SET_TYPE_MM)
+		return ret;
+	if (!mm)
+		return -EINVAL;
+	/* We only charge user process allocated PASIDs */
+	if (set->type != IOASID_SET_TYPE_MM) {
+		ret = -EINVAL;
+		goto exit_drop;
+	}
+	if (set->token != mm) {
+		pr_err("No permisson to allocate IOASID\n");
+		ret = -EPERM;
+		goto exit_drop;
+	}
+	rcu_read_lock();
+	css = task_css(current, ioasids_cgrp_id);
+	ioasids = css_ioasids(css);
+	rcu_read_unlock();
+	ret = ioasids_try_charge(ioasids, 1);
+	if (ret)
+		pr_warn("%s: Unable to charge IOASID %d\n", __func__, ret);
+exit_drop:
+	mmput_async(mm);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(ioasid_cg_charge);
+
+/* Uncharge IOASIDs cgroup after freeing an IOASID */
+void ioasid_cg_uncharge(struct ioasid_set *set)
+{
+	struct cgroup_subsys_state *css;
+	struct ioasids_cgroup *ioasids;
+	struct mm_struct *mm;
+
+	/* We only charge user process allocated PASIDs */
+	if (set->type != IOASID_SET_TYPE_MM)
+		return;
+	mm = set->token;
+	if (!mmget_not_zero(mm)) {
+		pr_err("MM defunct! Cannot uncharge IOASID\n");
+		return;
+	}
+	rcu_read_lock();
+	css = task_css(current, ioasids_cgrp_id);
+	ioasids = css_ioasids(css);
+	rcu_read_unlock();
+	ioasids_uncharge(ioasids, 1);
+	mmput_async(mm);
+}
+EXPORT_SYMBOL_GPL(ioasid_cg_uncharge);
+
+static int ioasids_can_attach(struct cgroup_taskset *tset)
+{
+	struct cgroup_subsys_state *dst_css;
+	static struct ioasid_set *set;
+	struct task_struct *leader;
+
+	/*
+	 * IOASIDs are managed at per process level, we only support domain mode
+	 * in task management model. Loop through all processes by each thread
+	 * leader, charge the leader's css.
+	 */
+	cgroup_taskset_for_each_leader(leader, dst_css, tset) {
+		struct ioasids_cgroup *ioasids = css_ioasids(dst_css);
+		struct cgroup_subsys_state *old_css;
+		struct ioasids_cgroup *old_ioasids;
+		struct mm_struct *mm = get_task_mm(leader);
+
+		set = ioasid_find_mm_set(mm);
+		mmput(mm);
+		if (!set)
+			continue;
+
+		old_css = task_css(leader, ioasids_cgrp_id);
+		old_ioasids = css_ioasids(old_css);
+
+		ioasids_charge(ioasids, atomic_read(&set->nr_ioasids));
+		ioasids_uncharge(old_ioasids, atomic_read(&set->nr_ioasids));
+	}
+
+	return 0;
+}
+
+static void ioasids_cancel_attach(struct cgroup_taskset *tset)
+{
+	struct cgroup_subsys_state *dst_css;
+	struct task_struct *task;
+
+	cgroup_taskset_for_each(task, dst_css, tset) {
+		struct ioasids_cgroup *ioasids = css_ioasids(dst_css);
+		struct cgroup_subsys_state *old_css;
+		struct ioasids_cgroup *old_ioasids;
+
+		old_css = task_css(task, ioasids_cgrp_id);
+		old_ioasids = css_ioasids(old_css);
+
+		ioasids_charge(old_ioasids, 1);
+		ioasids_uncharge(ioasids, 1);
+	}
+}
+
+static ssize_t ioasids_max_write(struct kernfs_open_file *of, char *buf,
+			      size_t nbytes, loff_t off)
+{
+	struct cgroup_subsys_state *css = of_css(of);
+	struct ioasids_cgroup *ioasids = css_ioasids(css);
+	int64_t limit, limit_cur;
+	int err;
+
+	mutex_lock(&ioasids_cg_lock);
+	/* Check whether we are growing or shrinking */
+	limit_cur = atomic64_read(&ioasids->limit);
+	buf = strstrip(buf);
+	if (!strcmp(buf, IOASIDS_MAX_STR)) {
+		/* Returns how many IOASIDs was in the pool */
+		limit = ioasid_reserve_capacity(0);
+		ioasid_reserve_capacity(limit - limit_cur);
+		goto set_limit;
+	}
+	err = kstrtoll(buf, 0, &limit);
+	if (err)
+		goto done_unlock;
+
+	err = nbytes;
+	/* Check whether we are growing or shrinking */
+	limit_cur = atomic64_read(&ioasids->limit);
+	if (limit < 0 || limit == limit_cur) {
+		err = -EINVAL;
+		goto done_unlock;
+	}
+	if (limit < limit_cur)
+		err = ioasid_cancel_capacity(limit_cur - limit);
+	else
+		err = ioasid_reserve_capacity(limit - limit_cur);
+	if (err < 0)
+		goto done_unlock;
+
+set_limit:
+	err = nbytes;
+	atomic64_set(&ioasids->limit, limit);
+done_unlock:
+	mutex_unlock(&ioasids_cg_lock);
+	return err;
+}
+
+static int ioasids_max_show(struct seq_file *sf, void *v)
+{
+	struct cgroup_subsys_state *css = seq_css(sf);
+	struct ioasids_cgroup *ioasids = css_ioasids(css);
+	int64_t limit = atomic64_read(&ioasids->limit);
+
+	seq_printf(sf, "%lld\n", limit);
+
+	return 0;
+}
+
+static s64 ioasids_current_read(struct cgroup_subsys_state *css,
+			     struct cftype *cft)
+{
+	struct ioasids_cgroup *ioasids = css_ioasids(css);
+
+	return atomic64_read(&ioasids->counter);
+}
+
+static int ioasids_events_show(struct seq_file *sf, void *v)
+{
+	struct ioasids_cgroup *ioasids = css_ioasids(seq_css(sf));
+
+	seq_printf(sf, "max %lld\n", (s64)atomic64_read(&ioasids->events_limit));
+	return 0;
+}
+
+static struct cftype ioasids_files[] = {
+	{
+		.name = "max",
+		.write = ioasids_max_write,
+		.seq_show = ioasids_max_show,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+	{
+		.name = "current",
+		.read_s64 = ioasids_current_read,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+	{
+		.name = "events",
+		.seq_show = ioasids_events_show,
+		.file_offset = offsetof(struct ioasids_cgroup, events_file),
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+	{ }	/* terminate */
+};
+
+struct cgroup_subsys ioasids_cgrp_subsys = {
+	.css_alloc	= ioasids_css_alloc,
+	.css_free	= ioasids_css_free,
+	.can_attach	= ioasids_can_attach,
+	.cancel_attach	= ioasids_cancel_attach,
+	.legacy_cftypes	= ioasids_files,
+	.dfl_cftypes	= ioasids_files,
+	.threaded	= false,
+};
+
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 269+ messages in thread

* [RFC PATCH 16/18] iommu/ioasid: Consult IOASIDs cgroup for allocation
  2021-02-27 22:01 [PATCH V4 00/18] IOASID extensions for guest SVA Jacob Pan
                   ` (14 preceding siblings ...)
  2021-02-27 22:01 ` [RFC PATCH 15/18] cgroup: Introduce ioasids controller Jacob Pan
@ 2021-02-27 22:01 ` Jacob Pan
  2021-02-27 22:01 ` [RFC PATCH 17/18] docs: cgroup-v1: Add IOASIDs controller Jacob Pan
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 269+ messages in thread
From: Jacob Pan @ 2021-02-27 22:01 UTC (permalink / raw)
  To: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker
  Cc: Alex Williamson, Eric Auger, Jason Gunthorpe, Jonathan Corbet,
	Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao, Dave Jiang, Jacob Pan

Once IOASIDs cgroup is active, we must consult the limitation set up
by the cgroups during allocation. Freeing IOASIDs also need to return
the quota back to the cgroup.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/ioasid.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
index d42b39ca2c8b..fd3f5729c71d 100644
--- a/drivers/iommu/ioasid.c
+++ b/drivers/iommu/ioasid.c
@@ -782,7 +782,10 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
 
 	spin_lock(&ioasid_allocator_lock);
 	/* Check if the IOASID set has been allocated and initialized */
-	if (!ioasid_set_is_valid(set))
+	if (!set || !ioasid_set_is_valid(set))
+		goto done_unlock;
+
+	if (set->type == IOASID_SET_TYPE_MM && ioasid_cg_charge(set))
 		goto done_unlock;
 
 	if (set->quota <= atomic_read(&set->nr_ioasids)) {
@@ -832,6 +835,7 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
 	goto done_unlock;
 exit_free:
 	kfree(data);
+	ioasid_cg_uncharge(set);
 done_unlock:
 	spin_unlock(&ioasid_allocator_lock);
 	return id;
@@ -849,6 +853,7 @@ static void ioasid_do_free_locked(struct ioasid_data *data)
 		kfree_rcu(ioasid_data, rcu);
 	}
 	atomic_dec(&data->set->nr_ioasids);
+	ioasid_cg_uncharge(data->set);
 	xa_erase(&data->set->xa, data->id);
 	/* Destroy the set if empty */
 	if (!atomic_read(&data->set->nr_ioasids))
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 269+ messages in thread

* [RFC PATCH 17/18] docs: cgroup-v1: Add IOASIDs controller
  2021-02-27 22:01 [PATCH V4 00/18] IOASID extensions for guest SVA Jacob Pan
                   ` (15 preceding siblings ...)
  2021-02-27 22:01 ` [RFC PATCH 16/18] iommu/ioasid: Consult IOASIDs cgroup for allocation Jacob Pan
@ 2021-02-27 22:01 ` Jacob Pan
  2021-02-27 22:01 ` [RFC PATCH 18/18] ioasid: Add /dev/ioasid for userspace Jacob Pan
  2021-03-02 12:58 ` [PATCH V4 00/18] IOASID extensions for guest SVA Liu, Yi L
  18 siblings, 0 replies; 269+ messages in thread
From: Jacob Pan @ 2021-02-27 22:01 UTC (permalink / raw)
  To: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker
  Cc: Alex Williamson, Eric Auger, Jason Gunthorpe, Jonathan Corbet,
	Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao, Dave Jiang, Jacob Pan

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 Documentation/admin-guide/cgroup-v1/index.rst |   1 +
 .../admin-guide/cgroup-v1/ioasids.rst         | 110 ++++++++++++++++++
 2 files changed, 111 insertions(+)
 create mode 100644 Documentation/admin-guide/cgroup-v1/ioasids.rst

diff --git a/Documentation/admin-guide/cgroup-v1/index.rst b/Documentation/admin-guide/cgroup-v1/index.rst
index 226f64473e8e..f5e307dc4dbb 100644
--- a/Documentation/admin-guide/cgroup-v1/index.rst
+++ b/Documentation/admin-guide/cgroup-v1/index.rst
@@ -15,6 +15,7 @@ Control Groups version 1
     devices
     freezer-subsystem
     hugetlb
+    ioasids
     memcg_test
     memory
     net_cls
diff --git a/Documentation/admin-guide/cgroup-v1/ioasids.rst b/Documentation/admin-guide/cgroup-v1/ioasids.rst
new file mode 100644
index 000000000000..b30eb41bf1be
--- /dev/null
+++ b/Documentation/admin-guide/cgroup-v1/ioasids.rst
@@ -0,0 +1,110 @@
+========================================
+I/O Address Space ID (IOASID) Controller
+========================================
+
+Acronyms
+--------
+PASID:
+	Process Address Space ID, defined by PCIe
+SVA:
+	Shared Virtual Address
+
+Introduction
+------------
+
+IOASIDs are used to associate DMA requests with virtual address spaces. As
+a system-wide limited¹ resource, its constraints are managed by the IOASIDs
+cgroup subsystem. The specific use cases are:
+
+1. Some user applications exhaust all the available IOASIDs thus depriving
+   others of the same host.
+
+2. System admins need to provision VMs based on their needs for IOASIDs,
+   e.g. the number of VMs with assigned devices that perform DMA requests
+   with PASID.
+
+The IOASID subsystem consists of three components:
+
+- IOASID core: provides APIs for allocation, pool management,
+  notifications and refcounting. See Documentation/driver-api/ioasid.rst
+  for details
+- IOASID user:  provides user allocation interface via /dev/ioasid
+- IOASID cgroup controller: manage resource distribution
+
+Resource Distribution Model
+---------------------------
+IOASID allocation is process-based in that IOASIDs are tied to page tables²,
+the threaded model is not supported. The allocation is rejected by the
+cgroup hierarchy once a limit is reached. However, organizational changes
+such as moving processes across cgroups are exempted. Therefore, it is
+possible to have ioasids.current > ioasids.max. It is not possible to do
+further allocation after the organizational change that exceeds the max.
+
+The system capacity of the IOASIDs is default to PCIe PASID size of 20 bits.
+IOASID core provides API to adjust the system capacity based on platforms.
+IOASIDs are used by both user applications (e.g. VMs and userspace drivers)
+and kernel (e.g. supervisor SVA). However, only user allocation is subject
+to cgroup constraints. Host kernel allocates a pool of IOASIDs where its
+quota is subtracted from the system capacity. IOASIDs cgroup consults with
+the IOASID core for available capacity when a new cgroup limit is granted.
+Upon creation, no IOASID allocation is allowed by the user processes within
+the new cgroup.
+
+Usage
+-----
+CGroup filesystem has the following IOASIDs controller specific entries:
+::
+
+ ioasids.current
+ ioasids.events
+ ioasids.max
+
+To use the IOASIDs controller, set ioasids.max to the limit of the number
+of IOASIDs that can be allocated. The file ioasids.current shows the current
+number of IOASIDs allocated within the cgroup.
+
+Example
+--------
+1. Mount the cgroup2 FS ::
+
+	$ mount -t cgroup2 none /mnt/cg2/
+
+2. Add ioasids controller ::
+
+	$ echo '+ioasids' > /mnt/cg2/cgroup.subtree_control
+
+3. Create a hierarchy, set non-zero limit (default 0) ::
+
+	$ mkdir /mnt/cg2/test1
+	$ echo 5 > /mnt/cg2/test1/ioasids.max
+
+4. Allocate IOASIDs within limit should succeed ::
+
+	$echo $$ > /mnt/cg2/test1/cgroup.procs
+	Do IOASID allocation via /dev/ioasid
+	ioasids.current:1
+	ioasids.max:5
+
+5. Attempt to allocate IOASIDs beyond limit should fail ::
+
+	ioasids.current:5
+	ioasids.max:5
+
+6. Attach a new process with IOASID already allocated to a cgroup could
+result in ioasids.current > ioasids.max, e.g. process with PID 1234 under
+a cgroup with IOASIDs controller has one IOASID allocated, moving it to
+test1 cgroup ::
+
+	$echo 1234 > /mnt/cg2/test1/cgroup.procs
+	ioasids.current:6
+	ioasids.max:5
+
+Notes
+-----
+¹ When IOASID is used for PCI Express PASID, the range is limited to the
+PASID size of 20 bits. For a device that its resources can be shared across
+the platform, the IOASID namespace must be system-wide in order to uniquely
+identify DMA request with PASID inside the device.
+
+² The primary use case is SVA, where CPU page tables are shared with DMA via
+IOMMU.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 269+ messages in thread

* [RFC PATCH 18/18] ioasid: Add /dev/ioasid for userspace
  2021-02-27 22:01 [PATCH V4 00/18] IOASID extensions for guest SVA Jacob Pan
                   ` (16 preceding siblings ...)
  2021-02-27 22:01 ` [RFC PATCH 17/18] docs: cgroup-v1: Add IOASIDs controller Jacob Pan
@ 2021-02-27 22:01 ` Jacob Pan
  2021-03-10 19:23   ` Jason Gunthorpe
  2021-03-02 12:58 ` [PATCH V4 00/18] IOASID extensions for guest SVA Liu, Yi L
  18 siblings, 1 reply; 269+ messages in thread
From: Jacob Pan @ 2021-02-27 22:01 UTC (permalink / raw)
  To: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker
  Cc: Alex Williamson, Eric Auger, Jason Gunthorpe, Jonathan Corbet,
	Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao, Dave Jiang

From: Liu Yi L <yi.l.liu@intel.com>

I/O Address Space IDs (IOASIDs) is used to tag DMA requests to target
multiple DMA address spaces for physical devices. Its PCI terminology
is called PASID (Process Address Space ID). Platforms with PASID support
can provide PASID granularity DMA isolation, which is very useful for
efficient and secure device sharing (SVA, subdevice passthrough, etc.).

Today only kernel drivers are allowed to allocate IOASIDs [1]. This patch
aims to extend this capability to userspace as required in device pass-
through scenarios. For example, a userspace driver may want to create its
own DMA address spaces besides the default IOVA address space established
by the kernel on the assigned device (e.g. vDPA control vq [2] and guest
SVA [3]), thus need to get IOASIDs from the kernel IOASID allocator for
tagging. In concept, each device can have its own IOASID space, thus it's
also possible for userspace driver to manage a private IOASID space itself,
say, when PF/VF is assigned. However it doesn't work for subdevice pass-
through, as multiple subdevices under the same parent device share a single
IOASID space thus IOASIDs must be centrally managed by the kernel in such
case.

This patch introduces a /dev/ioasid interface for this purpose (per discussion
in [4]). An IOASID is just a number before it is tagged to a specific DMA
address space. The actual IOASID tagging (to DMA requests) and association
(with DMA address spaces) operations from userspace are scrutinized by specific
device passthrough frameworks, which must ensure that a malicious driver
cannot program arbitrary IOASIDs to its assigned device to access DMA address
spaces that don't belong to it, this is out of the scope of this patch (a
reference VFIO implementation will be posted soon).

Open:

PCIe PASID is 20bit implying a space with 1M IOASIDs. although it's plenty
there was an open [4] on whether this user interface is open to all processes
or only selective processes (e.g. with device assigned). In this patchseries,
a cgroup controller is introduced to manage IOASID quota that a process is
allowed to use. A cgroup-enabled system may by default set quota=0 to disallow
IOASID allocation for most processes, and then having the virt management
stack to adjust the quota for a process which gets device assigned. But yeah,
we are also willing to hear more suggestions.

[1] https://lore.kernel.org/linux-iommu/1565900005-62508-8-git-send-email-jacob.jun.pan@linux.intel.com/
[2] https://lore.kernel.org/kvm/20201216064818.48239-1-jasowang@redhat.com/
[3] https://lore.kernel.org/linux-iommu/1599734733-6431-1-git-send-email-yi.l.liu@intel.com/
[4] https://lore.kernel.org/kvm/20201014171055.328a52f4@w520.home/

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 Documentation/userspace-api/index.rst  |   1 +
 Documentation/userspace-api/ioasid.rst |  49 ++++
 drivers/iommu/Kconfig                  |   5 +
 drivers/iommu/Makefile                 |   1 +
 drivers/iommu/intel/Kconfig            |   1 +
 drivers/iommu/ioasid_user.c            | 297 +++++++++++++++++++++++++
 include/linux/ioasid.h                 |  26 +++
 include/linux/miscdevice.h             |   1 +
 include/uapi/linux/ioasid.h            |  98 ++++++++
 9 files changed, 479 insertions(+)
 create mode 100644 Documentation/userspace-api/ioasid.rst
 create mode 100644 drivers/iommu/ioasid_user.c
 create mode 100644 include/uapi/linux/ioasid.h

diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
index acd2cc2a538d..69e1be7c67ee 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -24,6 +24,7 @@ place where this information is gathered.
    ioctl/index
    iommu
    media/index
+   ioasid
 
 .. only::  subproject and html
 
diff --git a/Documentation/userspace-api/ioasid.rst b/Documentation/userspace-api/ioasid.rst
new file mode 100644
index 000000000000..879d6cbae858
--- /dev/null
+++ b/Documentation/userspace-api/ioasid.rst
@@ -0,0 +1,49 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. ioasid:
+
+=====================================
+IOASID Userspace API
+=====================================
+
+The IOASID UAPI is used for userspace IOASID allocation/free requests,
+thus IOASID management is centralized in the IOASID core[1] in the kernel. The
+primary use case is guest Shared Virtual Address (SVA) today.
+
+Requests such as allocation/free can be issued by the users and managed
+on a per-process basis through the ioasid core. Upon opening ("/dev/ioasid"),
+a process obtains a unique handle associated with the process's mm_struct.
+This handle is mapped to an FD in the userspace. Only a single open is
+allowed per process.
+
+File descriptors can be transferred across processes by employing fork() or
+UNIX domain socket. FDs obtained by transfer cannot be used to perform
+IOASID requests. The following behaviors are recommended for the
+applications:
+
+ - forked children close the parent's IOASID FDs immediately, open new
+   /dev/ioasid FDs if IOASID allocation is desired
+
+ - do not share FDs via UNIX domain socket, e.g. via sendmsg
+
+================
+Userspace APIs
+================
+
+/dev/ioasid provides below ioctls:
+
+*) IOASID_GET_API_VERSION: returns the API version, userspace should check
+   the API version first with the one it has embedded.
+*) IOASID_GET_INFO: returns the information on the /dev/ioasid.
+   - ioasid_bits: the ioasid bit width supported by this uAPI, userspace
+     should check the ioasid_bits returned by this ioctl with the ioasid
+     bits it wants and should fail if it's smaller than the one that
+     userspace wants, otherwise, allocation will be failed.
+*) IOASID_REQUEST_ALLOC: returns an IOASID which is allocated in kernel within
+   the specified ioasid range.
+*) IOASID_REQUEST_FREE: free an IOASID per userspace's request.
+
+For detailed definition, please see include/uapi/linux/ioasid.h.
+
+.. contents:: :local:
+
+[1] Documentation/driver-api/ioasid.rst
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 192ef8f61310..830f4ec28a16 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -7,6 +7,11 @@ config IOMMU_IOVA
 config IOASID
 	tristate
 
+config IOASID_USER
+	tristate
+	depends on IOASID
+	default n
+
 # IOMMU_API always gets selected by whoever wants it.
 config IOMMU_API
 	bool
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 61bd30cd8369..305dd019ff49 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -9,6 +9,7 @@ obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE_ARMV7S) += io-pgtable-arm-v7s.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
 obj-$(CONFIG_IOASID) += ioasid.o
+obj-$(CONFIG_IOASID_USER) += ioasid_user.o
 obj-$(CONFIG_IOMMU_IOVA) += iova.o
 obj-$(CONFIG_OF_IOMMU)	+= of_iommu.o
 obj-$(CONFIG_MSM_IOMMU) += msm_iommu.o
diff --git a/drivers/iommu/intel/Kconfig b/drivers/iommu/intel/Kconfig
index 28a3d1596c76..a6d9dea61d58 100644
--- a/drivers/iommu/intel/Kconfig
+++ b/drivers/iommu/intel/Kconfig
@@ -13,6 +13,7 @@ config INTEL_IOMMU
 	select DMAR_TABLE
 	select SWIOTLB
 	select IOASID
+	select IOASID_USER
 	select IOMMU_DMA
 	help
 	  DMA remapping (DMAR) devices support enables independent address
diff --git a/drivers/iommu/ioasid_user.c b/drivers/iommu/ioasid_user.c
new file mode 100644
index 000000000000..2f8957cd055a
--- /dev/null
+++ b/drivers/iommu/ioasid_user.c
@@ -0,0 +1,297 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Support IOASID allocation/free from user space.
+ *
+ * Copyright (C) 2021 Intel Corporation.
+ *     Author: Liu Yi L <yi.l.liu@intel.com>
+ *
+ */
+
+#include <linux/ioasid.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/sched/mm.h>
+#include <linux/miscdevice.h>
+
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "Liu Yi L <yi.l.liu@intel.com>"
+#define DRIVER_DESC     "IOASID management for user space"
+
+/* Current user ioasid uapi supports 31 bits */
+#define IOASID_BITS	31
+
+struct ioasid_user_token {
+	unsigned long long val;
+};
+
+struct ioasid_user {
+	struct kref		kref;
+	struct ioasid_set	*ioasid_set;
+	struct mutex		lock;
+	struct list_head	next;
+	struct ioasid_user_token	token;
+};
+
+static struct mutex		ioasid_user_lock;
+static struct list_head		ioasid_user_list;
+
+/* called with ioasid_user_lock held */
+static void ioasid_user_release(struct kref *kref)
+{
+	struct ioasid_user *iuser = container_of(kref, struct ioasid_user, kref);
+
+	ioasid_free_all_in_set(iuser->ioasid_set);
+	list_del(&iuser->next);
+	mutex_unlock(&ioasid_user_lock);
+	ioasid_set_free(iuser->ioasid_set);
+	kfree(iuser);
+}
+
+void ioasid_user_put(struct ioasid_user *iuser)
+{
+	kref_put_mutex(&iuser->kref, ioasid_user_release, &ioasid_user_lock);
+}
+EXPORT_SYMBOL_GPL(ioasid_user_put);
+
+static void ioasid_user_get(struct ioasid_user *iuser)
+{
+	kref_get(&iuser->kref);
+}
+
+struct ioasid_user *ioasid_user_get_from_task(struct task_struct *task)
+{
+	struct mm_struct *mm = get_task_mm(task);
+	unsigned long long val = (unsigned long long)mm;
+	struct ioasid_user *iuser;
+	bool found = false;
+
+	if (!mm)
+		return NULL;
+
+	mutex_lock(&ioasid_user_lock);
+	/* Search existing ioasid_user with current mm pointer */
+	list_for_each_entry(iuser, &ioasid_user_list, next) {
+		if (iuser->token.val == val) {
+			ioasid_user_get(iuser);
+			found = true;
+			break;
+		}
+	}
+
+	mmput(mm);
+
+	mutex_unlock(&ioasid_user_lock);
+	return found ? iuser : NULL;
+}
+EXPORT_SYMBOL_GPL(ioasid_user_get_from_task);
+
+void ioasid_user_for_each_id(struct ioasid_user *iuser, void *data,
+			    void (*fn)(ioasid_t id, void *data))
+{
+	mutex_lock(&iuser->lock);
+	ioasid_set_for_each_ioasid(iuser->ioasid_set, fn, data);
+	mutex_unlock(&iuser->lock);
+}
+EXPORT_SYMBOL_GPL(ioasid_user_for_each_id);
+
+static int ioasid_fops_open(struct inode *inode, struct file *filep)
+{
+	struct mm_struct *mm = get_task_mm(current);
+	unsigned long long val = (unsigned long long)mm;
+	struct ioasid_set *iset;
+	struct ioasid_user *iuser;
+	int ret = 0;
+
+	mutex_lock(&ioasid_user_lock);
+	/* Only allow one single open per process */
+	list_for_each_entry(iuser, &ioasid_user_list, next) {
+		if (iuser->token.val == val) {
+			ret = -EBUSY;
+			goto out;
+		}
+	}
+
+	iuser = kzalloc(sizeof(*iuser), GFP_KERNEL);
+	if (!iuser) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	/*
+	 * IOASID core provides a 'IOASID set' concept to track all
+	 * IOASIDs associated with a token. Here we use mm_struct as
+	 * the token and create a IOASID set per mm_struct. All the
+	 * containers of the process share the same IOASID set.
+	 */
+	iset = ioasid_set_alloc(mm, 0, IOASID_SET_TYPE_MM);
+	if (IS_ERR(iset)) {
+		kfree(iuser);
+		ret = PTR_ERR(iset);
+		goto out;
+	}
+
+	iuser->ioasid_set = iset;
+	kref_init(&iuser->kref);
+	iuser->token.val = val;
+	mutex_init(&iuser->lock);
+	filep->private_data = iuser;
+
+	list_add(&iuser->next, &ioasid_user_list);
+out:
+	mutex_unlock(&ioasid_user_lock);
+	mmput(mm);
+	return ret;
+}
+
+static int ioasid_fops_release(struct inode *inode, struct file *filep)
+{
+	struct ioasid_user *iuser = filep->private_data;
+
+	filep->private_data = NULL;
+
+	ioasid_user_put(iuser);
+
+	return 0;
+}
+
+static int ioasid_get_info(struct ioasid_user *iuser, unsigned long arg)
+{
+	struct ioasid_info info;
+	unsigned long minsz;
+
+	minsz = offsetofend(struct ioasid_info, ioasid_bits);
+
+	if (copy_from_user(&info, (void __user *)arg, minsz))
+		return -EFAULT;
+
+	if (info.argsz < minsz || info.flags)
+		return -EINVAL;
+
+	info.ioasid_bits = IOASID_BITS;
+
+	return copy_to_user((void __user *)arg, &info, minsz) ? -EFAULT : 0;
+}
+
+static int ioasid_alloc_request(struct ioasid_user *iuser, unsigned long arg)
+{
+	struct ioasid_alloc_request req;
+	unsigned long minsz;
+	ioasid_t ioasid;
+
+	minsz = offsetofend(struct ioasid_alloc_request, range);
+
+	if (copy_from_user(&req, (void __user *)arg, minsz))
+		return -EFAULT;
+
+	if (req.argsz < minsz || req.flags)
+		return -EINVAL;
+
+	if (req.range.min > req.range.max ||
+	    req.range.min >= (1 << IOASID_BITS) ||
+	    req.range.max >= (1 << IOASID_BITS))
+		return -EINVAL;
+
+	ioasid = ioasid_alloc(iuser->ioasid_set, req.range.min,
+			    req.range.max, NULL);
+
+	if (ioasid == INVALID_IOASID)
+		return -EINVAL;
+
+	return ioasid;
+
+}
+
+static int ioasid_free_request(struct ioasid_user *iuser, unsigned long arg)
+{
+	int ioasid;
+
+	if (copy_from_user(&ioasid, (void __user *)arg, sizeof(ioasid)))
+		return -EFAULT;
+
+	if (ioasid < 0)
+		return -EINVAL;
+
+	ioasid_free(iuser->ioasid_set, ioasid);
+
+	return 0;
+}
+
+static long ioasid_fops_unl_ioctl(struct file *filep,
+				  unsigned int cmd, unsigned long arg)
+{
+	struct ioasid_user *iuser = filep->private_data;
+	long ret = -EINVAL;
+
+	if (!iuser)
+		return ret;
+
+	mutex_lock(&iuser->lock);
+
+	switch (cmd) {
+	case IOASID_GET_API_VERSION:
+		ret = IOASID_API_VERSION;
+		break;
+	case IOASID_GET_INFO:
+		ret = ioasid_get_info(iuser, arg);
+		break;
+	case IOASID_REQUEST_ALLOC:
+		ret = ioasid_alloc_request(iuser, arg);
+		break;
+	case IOASID_REQUEST_FREE:
+		ret = ioasid_free_request(iuser, arg);
+		break;
+	default:
+		pr_err("Unsupported cmd %u\n", cmd);
+		break;
+	}
+
+	mutex_unlock(&iuser->lock);
+	return ret;
+}
+
+static const struct file_operations ioasid_user_fops = {
+	.owner		= THIS_MODULE,
+	.open		= ioasid_fops_open,
+	.release	= ioasid_fops_release,
+	.unlocked_ioctl	= ioasid_fops_unl_ioctl,
+};
+
+static struct miscdevice ioasid_user = {
+	.minor = IOASID_MINOR,
+	.name = "ioasid_user",
+	.fops = &ioasid_user_fops,
+	.nodename = "ioasid",
+	.mode = S_IRUGO | S_IWUGO,
+};
+
+
+static int __init ioasid_user_init(void)
+{
+	int ret;
+
+	ret = misc_register(&ioasid_user);
+	if (ret) {
+		pr_err("ioasid_user: misc device register failed\n");
+		return ret;
+	}
+
+	mutex_init(&ioasid_user_lock);
+	INIT_LIST_HEAD(&ioasid_user_list);
+	return 0;
+}
+
+static void __exit ioasid_user_exit(void)
+{
+	WARN_ON(!list_empty(&ioasid_user_list));
+	misc_deregister(&ioasid_user);
+}
+
+module_init(ioasid_user_init);
+module_exit(ioasid_user_exit);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index 5ea4710efb02..b82abe6325f7 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -6,6 +6,7 @@
 #include <linux/errno.h>
 #include <linux/xarray.h>
 #include <linux/refcount.h>
+#include <uapi/linux/ioasid.h>
 
 #define INVALID_IOASID ((ioasid_t)-1)
 typedef unsigned int ioasid_t;
@@ -152,6 +153,31 @@ static inline int ioasid_cg_uncharge(struct ioasid_set *set)
 #endif /* CGROUP_IOASIDS */
 bool ioasid_queue_work(struct work_struct *work);
 
+/* IOASID userspace support */
+struct ioasid_user;
+#if IS_ENABLED(CONFIG_IOASID_USER)
+extern struct ioasid_user *ioasid_user_get_from_task(struct task_struct *task);
+extern void ioasid_user_put(struct ioasid_user *iuser);
+extern void ioasid_user_for_each_id(struct ioasid_user *iuser, void *data,
+				   void (*fn)(ioasid_t id, void *data));
+
+#else /* CONFIG_IOASID_USER */
+static inline struct ioasid_user *
+ioasid_user_get_from_task(struct task_struct *task)
+{
+	return ERR_PTR(-ENOTTY);
+}
+
+static inline void ioasid_user_put(struct ioasid_user *iuser)
+{
+}
+
+static inline void ioasid_user_for_each_id(struct ioasid_user *iuser, void *data,
+					  void (*fn)(ioasid_t id, void *data))
+{
+}
+#endif /* CONFIG_IOASID_USER */
+
 #else /* !CONFIG_IOASID */
 
 static inline void ioasid_install_capacity(ioasid_t total)
diff --git a/include/linux/miscdevice.h b/include/linux/miscdevice.h
index 0676f18093f9..9823901f11a4 100644
--- a/include/linux/miscdevice.h
+++ b/include/linux/miscdevice.h
@@ -21,6 +21,7 @@
 #define APOLLO_MOUSE_MINOR	7	/* unused */
 #define PC110PAD_MINOR		9	/* unused */
 /*#define ADB_MOUSE_MINOR	10	FIXME OBSOLETE */
+#define IOASID_MINOR		129     /* /dev/ioasid     */
 #define WATCHDOG_MINOR		130	/* Watchdog timer     */
 #define TEMP_MINOR		131	/* Temperature Sensor */
 #define APM_MINOR_DEV		134
diff --git a/include/uapi/linux/ioasid.h b/include/uapi/linux/ioasid.h
new file mode 100644
index 000000000000..1529070c0317
--- /dev/null
+++ b/include/uapi/linux/ioasid.h
@@ -0,0 +1,98 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * PASID (Processor Address Space ID) is a PCIe concept for tagging
+ * address spaces in DMA requests. When system-wide PASID allocation
+ * is required by the underlying iommu driver (e.g. Intel VT-d), this
+ * provides an interface for userspace to request ioasid alloc/free
+ * for its assigned devices.
+ *
+ * Copyright (C) 2021 Intel Corporation.  All rights reserved.
+ *     Author: Liu Yi L <yi.l.liu@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#ifndef _UAPI_IOASID_H
+#define _UAPI_IOASID_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+#include <linux/ioasid.h>
+
+#define IOASID_API_VERSION	0
+
+
+/* Kernel & User level defines for IOASID IOCTLs. */
+
+#define IOASID_TYPE	('i')
+#define IOASID_BASE	100
+
+/* -------- IOCTLs for IOASID file descriptor (/dev/ioasid) -------- */
+
+/**
+ * IOASID_GET_API_VERSION - _IO(IOASID_TYPE, IOASID_BASE + 0)
+ *
+ * Report the version of the IOASID API.  This allows us to bump the entire
+ * API version should we later need to add or change features in incompatible
+ * ways.
+ * Return: IOASID_API_VERSION
+ * Availability: Always
+ */
+#define IOASID_GET_API_VERSION		_IO(IOASID_TYPE, IOASID_BASE + 0)
+
+/**
+ * IOASID_GET_INFO - _IOR(IOASID_TYPE, IOASID_BASE + 1, struct ioasid_info)
+ *
+ * Retrieve information about the IOASID object. Fills in provided
+ * struct ioasid_info. Caller sets argsz.
+ *
+ * @argsz:	 user filled size of this data.
+ * @flags:	 currently reserved for future extension. must set to 0.
+ * @ioasid_bits: maximum supported PASID bits, 0 represents no PASID
+ *		 support.
+
+ * Availability: Always
+ */
+struct ioasid_info {
+	__u32	argsz;
+	__u32	flags;
+	__u32	ioasid_bits;
+};
+#define IOASID_GET_INFO _IO(IOASID_TYPE, IOASID_BASE + 1)
+
+/**
+ * IOASID_REQUEST_ALLOC - _IOWR(IOASID_TYPE, IOASID_BASE + 2,
+ *					struct ioasid_request)
+ *
+ * Alloc a PASID within @range. @range is [min, max], which means both
+ * @min and @max are inclusive.
+ * User space should provide min, max no more than the ioasid bits reports
+ * in ioasid_info via IOASID_GET_INFO.
+ *
+ * @argsz: user filled size of this data.
+ * @flags: currently reserved for future extension. must set to 0.
+ * @range: allocated ioasid is expected in the range.
+ *
+ * returns: allocated ID on success, -errno on failure
+ */
+struct ioasid_alloc_request {
+	__u32	argsz;
+	__u32	flags;
+	struct {
+		__u32	min;
+		__u32	max;
+	} range;
+};
+#define IOASID_REQUEST_ALLOC	_IO(IOASID_TYPE, IOASID_BASE + 2)
+
+/**
+ * IOASID_REQUEST_FREE - _IOWR(IOASID_TYPE, IOASID_BASE + 3, int)
+ *
+ * Free a PASID.
+ *
+ * returns: 0 on success, -errno on failure
+ */
+#define IOASID_REQUEST_FREE	_IO(IOASID_TYPE, IOASID_BASE + 3)
+
+#endif /* _UAPI_IOASID_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 269+ messages in thread

* RE: [PATCH V4 00/18] IOASID extensions for guest SVA
  2021-02-27 22:01 [PATCH V4 00/18] IOASID extensions for guest SVA Jacob Pan
                   ` (17 preceding siblings ...)
  2021-02-27 22:01 ` [RFC PATCH 18/18] ioasid: Add /dev/ioasid for userspace Jacob Pan
@ 2021-03-02 12:58 ` Liu, Yi L
  18 siblings, 0 replies; 269+ messages in thread
From: Liu, Yi L @ 2021-03-02 12:58 UTC (permalink / raw)
  To: Jacob Pan, LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu,
	cgroups, Tejun Heo, Li Zefan, Johannes Weiner,
	Jean-Philippe Brucker
  Cc: Alex Williamson, Eric Auger, Jason Gunthorpe, Jonathan Corbet,
	Raj, Ashok, Tian, Kevin, Wu, Hao, Jiang, Dave

> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Sent: Sunday, February 28, 2021 6:01 AM
>
> I/O Address Space ID (IOASID) core code was introduced in v5.5 as a generic
> kernel allocator service for both PCIe Process Address Space ID (PASID) and
> ARM SMMU's Substream ID. IOASIDs are used to associate DMA requests
> with
> virtual address spaces, including both host and guest.
> 
> In addition to providing basic ID allocation, ioasid_set was defined as a
> token that is shared by a group of IOASIDs. This set token can be used
> for permission checking, but lack some features to address the following
> needs by guest Shared Virtual Address (SVA).
> - Manage IOASIDs by group, group ownership, quota, etc.
> - State synchronization among IOASID users (e.g. IOMMU driver, KVM,
> device
> drivers)
> - Non-identity guest-host IOASID mapping
> - Lifecycle management
> 
> This patchset introduces the following extensions as solutions to the
> problems above.
> - Redefine and extend IOASID set such that IOASIDs can be managed by
> groups/pools.
> - Add notifications for IOASID state synchronization
> - Extend reference counting for life cycle alignment among multiple users
> - Support ioasid_set private IDs, which can be used as guest IOASIDs
> - Add a new cgroup controller for resource distribution
> 
> Please refer to Documentation/admin-guide/cgroup-v1/ioasids.rst and
> Documentation/driver-api/ioasid.rst in the enclosed patches for more
> details.
> 
> Based on discussions on LKML[1], a direction change was made in v4 such
> that
> the user interfaces for IOASID allocation are extracted from VFIO
> subsystem. The proposed IOASID subsystem now consists of three
> components:
> 1. IOASID core[01-14]: provides APIs for allocation, pool management,
>   notifications, and refcounting.
> 2. IOASID cgroup controller[RFC 15-17]: manage resource distribution[2].
> 3. IOASID user[RFC 18]:  provides user allocation interface via /dev/ioasid
> 
> This patchset only included VT-d driver as users of some of the new APIs.
> VFIO and KVM patches are coming up to fully utilize the APIs introduced
> here.
>
> [1] https://lore.kernel.org/linux-iommu/1599734733-6431-1-git-send-email-
> yi.l.liu@intel.com/
> [2] Note that ioasid quota management code can be removed once the
> IOASIDs
> cgroup is ratified.
> 
> You can find this series, VFIO, KVM, and IOASID user at:
> https://github.com/jacobpan/linux.git ioasid_v4
> (VFIO and KVM patches will be available at this branch when published.)

VFIO and QEMU series are listed below:

VFIO: https://lore.kernel.org/linux-iommu/20210302203545.436623-1-yi.l.liu@intel.com/
QEMU: https://lore.kernel.org/qemu-devel/20210302203827.437645-1-yi.l.liu@intel.com/T/#t

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [RFC PATCH 15/18] cgroup: Introduce ioasids controller
  2021-02-27 22:01 ` [RFC PATCH 15/18] cgroup: Introduce ioasids controller Jacob Pan
@ 2021-03-03 15:44   ` Tejun Heo
  2021-03-03 21:17     ` Jacob Pan
  0 siblings, 1 reply; 269+ messages in thread
From: Tejun Heo @ 2021-03-03 15:44 UTC (permalink / raw)
  To: Jacob Pan
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Li Zefan, Johannes Weiner, Jean-Philippe Brucker,
	Alex Williamson, Eric Auger, Jason Gunthorpe, Jonathan Corbet,
	Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao, Dave Jiang

On Sat, Feb 27, 2021 at 02:01:23PM -0800, Jacob Pan wrote:
> IOASIDs are used to associate DMA requests with virtual address spaces.
> They are a system-wide limited resource made available to the userspace
> applications. Let it be VMs or user-space device drivers.
> 
> This RFC patch introduces a cgroup controller to address the following
> problems:
> 1. Some user applications exhaust all the available IOASIDs thus
> depriving others of the same host.
> 2. System admins need to provision VMs based on their needs for IOASIDs,
> e.g. the number of VMs with assigned devices that perform DMA requests
> with PASID.

Please take a look at the proposed misc controller:

 http://lkml.kernel.org/r/20210302081705.1990283-2-vipinsh@google.com

Would that fit your bill?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [RFC PATCH 15/18] cgroup: Introduce ioasids controller
  2021-03-03 15:44   ` Tejun Heo
@ 2021-03-03 21:17     ` Jacob Pan
  2021-03-04  0:02       ` Jacob Pan
  0 siblings, 1 reply; 269+ messages in thread
From: Jacob Pan @ 2021-03-03 21:17 UTC (permalink / raw)
  To: Tejun Heo
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Li Zefan, Johannes Weiner, Jean-Philippe Brucker,
	Alex Williamson, Eric Auger, Jason Gunthorpe, Jonathan Corbet,
	Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao, Dave Jiang,
	jacob.jun.pan

Hi Tejun,

On Wed, 3 Mar 2021 10:44:28 -0500, Tejun Heo <tj@kernel.org> wrote:

> On Sat, Feb 27, 2021 at 02:01:23PM -0800, Jacob Pan wrote:
> > IOASIDs are used to associate DMA requests with virtual address spaces.
> > They are a system-wide limited resource made available to the userspace
> > applications. Let it be VMs or user-space device drivers.
> > 
> > This RFC patch introduces a cgroup controller to address the following
> > problems:
> > 1. Some user applications exhaust all the available IOASIDs thus
> > depriving others of the same host.
> > 2. System admins need to provision VMs based on their needs for IOASIDs,
> > e.g. the number of VMs with assigned devices that perform DMA requests
> > with PASID.  
> 
> Please take a look at the proposed misc controller:
> 
>  http://lkml.kernel.org/r/20210302081705.1990283-2-vipinsh@google.com
> 
> Would that fit your bill?
The interface definitely can be reused. But IOASID has a different behavior
in terms of migration and ownership checking. I guess SEV key IDs are not
tied to a process whereas IOASIDs are. Perhaps this can be solved by
adding
+	.can_attach	= ioasids_can_attach,
+	.cancel_attach	= ioasids_cancel_attach,
Let me give it a try and come back.

Thanks for the pointer.

Jacob

> 
> Thanks.
> 


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [RFC PATCH 15/18] cgroup: Introduce ioasids controller
  2021-03-03 21:17     ` Jacob Pan
@ 2021-03-04  0:02       ` Jacob Pan
  2021-03-04  0:23         ` Jason Gunthorpe
  2021-03-04  9:49         ` Jean-Philippe Brucker
  0 siblings, 2 replies; 269+ messages in thread
From: Jacob Pan @ 2021-03-04  0:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jason Gunthorpe, Jonathan Corbet, Raj Ashok, Tian,
	Kevin, Yi Liu, Wu Hao, Dave Jiang, jacob.jun.pan

Hi Jacob,

On Wed, 3 Mar 2021 13:17:26 -0800, Jacob Pan
<jacob.jun.pan@linux.intel.com> wrote:

> Hi Tejun,
> 
> On Wed, 3 Mar 2021 10:44:28 -0500, Tejun Heo <tj@kernel.org> wrote:
> 
> > On Sat, Feb 27, 2021 at 02:01:23PM -0800, Jacob Pan wrote:  
> > > IOASIDs are used to associate DMA requests with virtual address
> > > spaces. They are a system-wide limited resource made available to the
> > > userspace applications. Let it be VMs or user-space device drivers.
> > > 
> > > This RFC patch introduces a cgroup controller to address the following
> > > problems:
> > > 1. Some user applications exhaust all the available IOASIDs thus
> > > depriving others of the same host.
> > > 2. System admins need to provision VMs based on their needs for
> > > IOASIDs, e.g. the number of VMs with assigned devices that perform
> > > DMA requests with PASID.    
> > 
> > Please take a look at the proposed misc controller:
> > 
> >  http://lkml.kernel.org/r/20210302081705.1990283-2-vipinsh@google.com
> > 
> > Would that fit your bill?  
> The interface definitely can be reused. But IOASID has a different
> behavior in terms of migration and ownership checking. I guess SEV key
> IDs are not tied to a process whereas IOASIDs are. Perhaps this can be
> solved by adding
> +	.can_attach	= ioasids_can_attach,
> +	.cancel_attach	= ioasids_cancel_attach,
> Let me give it a try and come back.
> 
While I am trying to fit the IOASIDs cgroup in to the misc cgroup proposal.
I'd like to have a direction check on whether this idea of using cgroup for
IOASID/PASID resource management is viable.

Alex/Jason/Jean and everyone, your feedback is much appreciated.

> Thanks for the pointer.
> 
> Jacob
> 
> > 
> > Thanks.
> >   
> 
> 
> Thanks,
> 
> Jacob


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [RFC PATCH 15/18] cgroup: Introduce ioasids controller
  2021-03-04  0:02       ` Jacob Pan
@ 2021-03-04  0:23         ` Jason Gunthorpe
  2021-03-04  9:49         ` Jean-Philippe Brucker
  1 sibling, 0 replies; 269+ messages in thread
From: Jason Gunthorpe @ 2021-03-04  0:23 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Tejun Heo, LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu,
	cgroups, Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj Ashok, Tian, Kevin, Yi Liu,
	Wu Hao, Dave Jiang

On Wed, Mar 03, 2021 at 04:02:05PM -0800, Jacob Pan wrote:
> > The interface definitely can be reused. But IOASID has a different
> > behavior in terms of migration and ownership checking. I guess SEV key
> > IDs are not tied to a process whereas IOASIDs are. Perhaps this can be
> > solved by adding
> > +	.can_attach	= ioasids_can_attach,
> > +	.cancel_attach	= ioasids_cancel_attach,
> > Let me give it a try and come back.
> > 
> While I am trying to fit the IOASIDs cgroup in to the misc cgroup proposal.
> I'd like to have a direction check on whether this idea of using cgroup for
> IOASID/PASID resource management is viable.
> 
> Alex/Jason/Jean and everyone, your feedback is much appreciated.

IMHO I can't think of anything else to enforce some limit on a HW
scarce resource that unpriv userspace can consume.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [RFC PATCH 15/18] cgroup: Introduce ioasids controller
  2021-03-04  0:02       ` Jacob Pan
  2021-03-04  0:23         ` Jason Gunthorpe
@ 2021-03-04  9:49         ` Jean-Philippe Brucker
  2021-03-04 17:46           ` Jacob Pan
  1 sibling, 1 reply; 269+ messages in thread
From: Jean-Philippe Brucker @ 2021-03-04  9:49 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Tejun Heo, LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu,
	cgroups, Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jason Gunthorpe, Jonathan Corbet, Raj Ashok, Tian,
	Kevin, Yi Liu, Wu Hao, Dave Jiang

On Wed, Mar 03, 2021 at 04:02:05PM -0800, Jacob Pan wrote:
> Hi Jacob,
> 
> On Wed, 3 Mar 2021 13:17:26 -0800, Jacob Pan
> <jacob.jun.pan@linux.intel.com> wrote:
> 
> > Hi Tejun,
> > 
> > On Wed, 3 Mar 2021 10:44:28 -0500, Tejun Heo <tj@kernel.org> wrote:
> > 
> > > On Sat, Feb 27, 2021 at 02:01:23PM -0800, Jacob Pan wrote:  
> > > > IOASIDs are used to associate DMA requests with virtual address
> > > > spaces. They are a system-wide limited resource made available to the
> > > > userspace applications. Let it be VMs or user-space device drivers.
> > > > 
> > > > This RFC patch introduces a cgroup controller to address the following
> > > > problems:
> > > > 1. Some user applications exhaust all the available IOASIDs thus
> > > > depriving others of the same host.
> > > > 2. System admins need to provision VMs based on their needs for
> > > > IOASIDs, e.g. the number of VMs with assigned devices that perform
> > > > DMA requests with PASID.    
> > > 
> > > Please take a look at the proposed misc controller:
> > > 
> > >  http://lkml.kernel.org/r/20210302081705.1990283-2-vipinsh@google.com
> > > 
> > > Would that fit your bill?  
> > The interface definitely can be reused. But IOASID has a different
> > behavior in terms of migration and ownership checking. I guess SEV key
> > IDs are not tied to a process whereas IOASIDs are. Perhaps this can be
> > solved by adding
> > +	.can_attach	= ioasids_can_attach,
> > +	.cancel_attach	= ioasids_cancel_attach,
> > Let me give it a try and come back.
> > 
> While I am trying to fit the IOASIDs cgroup in to the misc cgroup proposal.
> I'd like to have a direction check on whether this idea of using cgroup for
> IOASID/PASID resource management is viable.

Yes, even for host SVA it would be good to have a cgroup. Currently the
number of shared address spaces is naturally limited by number of
processes, which can be controlled with rlimit and cgroup. But on Arm the
hardware limit on shared address spaces is 64k (number of ASIDs), easily
exhausted with the default PASID and PID limits. So a cgroup for managing
this resource is more than welcome.

It looks like your current implementation is very dependent on
IOASID_SET_TYPE_MM?  I'll need to do more reading about cgroup to see how
easily it can be adapted to host SVA which uses IOASID_SET_TYPE_NULL.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [RFC PATCH 15/18] cgroup: Introduce ioasids controller
  2021-03-04  9:49         ` Jean-Philippe Brucker
@ 2021-03-04 17:46           ` Jacob Pan
  2021-03-04 17:54             ` Jason Gunthorpe
  2021-03-05  8:30             ` Jean-Philippe Brucker
  0 siblings, 2 replies; 269+ messages in thread
From: Jacob Pan @ 2021-03-04 17:46 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Tejun Heo, LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu,
	cgroups, Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jason Gunthorpe, Jonathan Corbet, Raj Ashok, Tian,
	Kevin, Yi Liu, Wu Hao, Dave Jiang, jacob.jun.pan

Hi Jean-Philippe,

On Thu, 4 Mar 2021 10:49:37 +0100, Jean-Philippe Brucker
<jean-philippe@linaro.org> wrote:

> On Wed, Mar 03, 2021 at 04:02:05PM -0800, Jacob Pan wrote:
> > Hi Jacob,
> > 
> > On Wed, 3 Mar 2021 13:17:26 -0800, Jacob Pan
> > <jacob.jun.pan@linux.intel.com> wrote:
> >   
> > > Hi Tejun,
> > > 
> > > On Wed, 3 Mar 2021 10:44:28 -0500, Tejun Heo <tj@kernel.org> wrote:
> > >   
> > > > On Sat, Feb 27, 2021 at 02:01:23PM -0800, Jacob Pan wrote:    
> > > > > IOASIDs are used to associate DMA requests with virtual address
> > > > > spaces. They are a system-wide limited resource made available to
> > > > > the userspace applications. Let it be VMs or user-space device
> > > > > drivers.
> > > > > 
> > > > > This RFC patch introduces a cgroup controller to address the
> > > > > following problems:
> > > > > 1. Some user applications exhaust all the available IOASIDs thus
> > > > > depriving others of the same host.
> > > > > 2. System admins need to provision VMs based on their needs for
> > > > > IOASIDs, e.g. the number of VMs with assigned devices that perform
> > > > > DMA requests with PASID.      
> > > > 
> > > > Please take a look at the proposed misc controller:
> > > > 
> > > >  http://lkml.kernel.org/r/20210302081705.1990283-2-vipinsh@google.com
> > > > 
> > > > Would that fit your bill?    
> > > The interface definitely can be reused. But IOASID has a different
> > > behavior in terms of migration and ownership checking. I guess SEV key
> > > IDs are not tied to a process whereas IOASIDs are. Perhaps this can be
> > > solved by adding
> > > +	.can_attach	= ioasids_can_attach,
> > > +	.cancel_attach	= ioasids_cancel_attach,
> > > Let me give it a try and come back.
> > >   
> > While I am trying to fit the IOASIDs cgroup in to the misc cgroup
> > proposal. I'd like to have a direction check on whether this idea of
> > using cgroup for IOASID/PASID resource management is viable.  
> 
> Yes, even for host SVA it would be good to have a cgroup. Currently the
> number of shared address spaces is naturally limited by number of
> processes, which can be controlled with rlimit and cgroup. But on Arm the
> hardware limit on shared address spaces is 64k (number of ASIDs), easily
> exhausted with the default PASID and PID limits. So a cgroup for managing
> this resource is more than welcome.
> 
> It looks like your current implementation is very dependent on
> IOASID_SET_TYPE_MM?  I'll need to do more reading about cgroup to see how
> easily it can be adapted to host SVA which uses IOASID_SET_TYPE_NULL.
> 
Right, I was assuming have three use cases of IOASIDs:
1. host supervisor SVA (not a concern, just one init_mm to bind)
2. host user SVA, either one IOASID per process or perhaps some private
IOASID for private address space
3. VM use for guest SVA, each IOASID is bound to a guest process

My current cgroup proposal applies to #3 with IOASID_SET_TYPE_MM, which is
allocated by the new /dev/ioasid interface.

For #2, I was thinking you can limit the host process via PIDs cgroup? i.e.
limit fork. So the host IOASIDs are currently allocated from the system pool
with quota of chosen by iommu_sva_init() in my patch, 0 means unlimited use
whatever is available. https://lkml.org/lkml/2021/2/28/18


> Thanks,
> Jean


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [RFC PATCH 15/18] cgroup: Introduce ioasids controller
  2021-03-04 17:46           ` Jacob Pan
@ 2021-03-04 17:54             ` Jason Gunthorpe
  2021-03-04 19:01               ` Jacob Pan
  2021-03-05  8:30             ` Jean-Philippe Brucker
  1 sibling, 1 reply; 269+ messages in thread
From: Jason Gunthorpe @ 2021-03-04 17:54 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Jean-Philippe Brucker, Tejun Heo, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Johannes Weiner,
	Jean-Philippe Brucker, Alex Williamson, Eric Auger,
	Jonathan Corbet, Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao,
	Dave Jiang

On Thu, Mar 04, 2021 at 09:46:03AM -0800, Jacob Pan wrote:

> Right, I was assuming have three use cases of IOASIDs:
> 1. host supervisor SVA (not a concern, just one init_mm to bind)
> 2. host user SVA, either one IOASID per process or perhaps some private
> IOASID for private address space
> 3. VM use for guest SVA, each IOASID is bound to a guest process
> 
> My current cgroup proposal applies to #3 with IOASID_SET_TYPE_MM, which is
> allocated by the new /dev/ioasid interface.
> 
> For #2, I was thinking you can limit the host process via PIDs cgroup? i.e.
> limit fork. So the host IOASIDs are currently allocated from the system pool
> with quota of chosen by iommu_sva_init() in my patch, 0 means unlimited use
> whatever is available. https://lkml.org/lkml/2021/2/28/18

Why do we need two pools?

If PASID's are limited then why does it matter how the PASID was
allocated? Either the thing requesting it is below the limit, or it
isn't.

For something like qemu I'd expect to put the qemu process in a cgroup
with 1 PASID. Who cares what qemu uses the PASID for, or how it was
allocated?

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [RFC PATCH 15/18] cgroup: Introduce ioasids controller
  2021-03-04 17:54             ` Jason Gunthorpe
@ 2021-03-04 19:01               ` Jacob Pan
  2021-03-04 19:02                 ` Jason Gunthorpe
  0 siblings, 1 reply; 269+ messages in thread
From: Jacob Pan @ 2021-03-04 19:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Tejun Heo, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Johannes Weiner,
	Jean-Philippe Brucker, Alex Williamson, Eric Auger,
	Jonathan Corbet, Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao,
	Dave Jiang, jacob.jun.pan

Hi Jason,

On Thu, 4 Mar 2021 13:54:02 -0400, Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Thu, Mar 04, 2021 at 09:46:03AM -0800, Jacob Pan wrote:
> 
> > Right, I was assuming have three use cases of IOASIDs:
> > 1. host supervisor SVA (not a concern, just one init_mm to bind)
> > 2. host user SVA, either one IOASID per process or perhaps some private
> > IOASID for private address space
> > 3. VM use for guest SVA, each IOASID is bound to a guest process
> > 
> > My current cgroup proposal applies to #3 with IOASID_SET_TYPE_MM, which
> > is allocated by the new /dev/ioasid interface.
> > 
> > For #2, I was thinking you can limit the host process via PIDs cgroup?
> > i.e. limit fork. So the host IOASIDs are currently allocated from the
> > system pool with quota of chosen by iommu_sva_init() in my patch, 0
> > means unlimited use whatever is available.
> > https://lkml.org/lkml/2021/2/28/18  
> 
> Why do we need two pools?
> 
> If PASID's are limited then why does it matter how the PASID was
> allocated? Either the thing requesting it is below the limit, or it
> isn't.
> 
you are right. it should be tracked based on the process regardless it is
allocated by the user (/dev/ioasid) or indirectly by kernel drivers during
iommu_sva_bind_device(). Need to consolidate both 2 and 3 and
decouple cgroup and IOASID set.

> For something like qemu I'd expect to put the qemu process in a cgroup
> with 1 PASID. Who cares what qemu uses the PASID for, or how it was
> allocated?
> 
For vSVA, we will need one PASID per guest process. But that is up to the
admin based on whether or how many SVA capable devices are directly
assigned.

> Jason


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [RFC PATCH 15/18] cgroup: Introduce ioasids controller
  2021-03-04 19:01               ` Jacob Pan
@ 2021-03-04 19:02                 ` Jason Gunthorpe
  2021-03-04 21:28                   ` Jacob Pan
  0 siblings, 1 reply; 269+ messages in thread
From: Jason Gunthorpe @ 2021-03-04 19:02 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Jean-Philippe Brucker, Tejun Heo, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Johannes Weiner,
	Jean-Philippe Brucker, Alex Williamson, Eric Auger,
	Jonathan Corbet, Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao,
	Dave Jiang

On Thu, Mar 04, 2021 at 11:01:44AM -0800, Jacob Pan wrote:

> > For something like qemu I'd expect to put the qemu process in a cgroup
> > with 1 PASID. Who cares what qemu uses the PASID for, or how it was
> > allocated?
> 
> For vSVA, we will need one PASID per guest process. But that is up to the
> admin based on whether or how many SVA capable devices are directly
> assigned.

I hope the virtual IOMMU driver can communicate the PASID limit and
the cgroup machinery in the guest can know what the actual limit is.

I was thinking of a case where qemu is using a single PASID to setup
the guest kVA or similar

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [RFC PATCH 15/18] cgroup: Introduce ioasids controller
  2021-03-04 19:02                 ` Jason Gunthorpe
@ 2021-03-04 21:28                   ` Jacob Pan
  0 siblings, 0 replies; 269+ messages in thread
From: Jacob Pan @ 2021-03-04 21:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Tejun Heo, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Johannes Weiner,
	Jean-Philippe Brucker, Alex Williamson, Eric Auger,
	Jonathan Corbet, Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao,
	Dave Jiang, jacob.jun.pan

Hi Jason,

On Thu, 4 Mar 2021 15:02:53 -0400, Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Thu, Mar 04, 2021 at 11:01:44AM -0800, Jacob Pan wrote:
> 
> > > For something like qemu I'd expect to put the qemu process in a cgroup
> > > with 1 PASID. Who cares what qemu uses the PASID for, or how it was
> > > allocated?  
> > 
> > For vSVA, we will need one PASID per guest process. But that is up to
> > the admin based on whether or how many SVA capable devices are directly
> > assigned.  
> 
> I hope the virtual IOMMU driver can communicate the PASID limit and
> the cgroup machinery in the guest can know what the actual limit is.
> 
For VT-d, emulated vIOMMU can communicate with the guest IOMMU driver on how
many PASID bits are supported (extended cap reg PASID size fields). But it
cannot communicate how many PASIDs are in the pool(host cgroup capacity).

The QEMU process may not be the only one in a cgroup so it cannot give hard
guarantees. I don't see a good way to communicate accurately at runtime as
the process migrates or limit changes.

We were thinking to adopt the "Limits" model as defined in the cgroup-v2
doc.
"
Limits
------

A child can only consume upto the configured amount of the resource.
Limits can be over-committed - the sum of the limits of children can
exceed the amount of resource available to the parent.
"

So the guest cgroup would still think it has full 20 bits of PASID at its
disposal. But PASID allocation may fail before reaching the full 20 bits
(2M).
Similar on the host side, we only enforce the limit set by the cgroup but
not guarantee it.

> I was thinking of a case where qemu is using a single PASID to setup
> the guest kVA or similar
> 
got it.

> Jason


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [RFC PATCH 15/18] cgroup: Introduce ioasids controller
  2021-03-04 17:46           ` Jacob Pan
  2021-03-04 17:54             ` Jason Gunthorpe
@ 2021-03-05  8:30             ` Jean-Philippe Brucker
  2021-03-05 17:16               ` Jean-Philippe Brucker
  2021-03-05 18:20               ` Jacob Pan
  1 sibling, 2 replies; 269+ messages in thread
From: Jean-Philippe Brucker @ 2021-03-05  8:30 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Tejun Heo, LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu,
	cgroups, Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jason Gunthorpe, Jonathan Corbet, Raj Ashok, Tian,
	Kevin, Yi Liu, Wu Hao, Dave Jiang

On Thu, Mar 04, 2021 at 09:46:03AM -0800, Jacob Pan wrote:
> Hi Jean-Philippe,
> 
> On Thu, 4 Mar 2021 10:49:37 +0100, Jean-Philippe Brucker
> <jean-philippe@linaro.org> wrote:
> 
> > On Wed, Mar 03, 2021 at 04:02:05PM -0800, Jacob Pan wrote:
> > > Hi Jacob,
> > > 
> > > On Wed, 3 Mar 2021 13:17:26 -0800, Jacob Pan
> > > <jacob.jun.pan@linux.intel.com> wrote:
> > >   
> > > > Hi Tejun,
> > > > 
> > > > On Wed, 3 Mar 2021 10:44:28 -0500, Tejun Heo <tj@kernel.org> wrote:
> > > >   
> > > > > On Sat, Feb 27, 2021 at 02:01:23PM -0800, Jacob Pan wrote:    
> > > > > > IOASIDs are used to associate DMA requests with virtual address
> > > > > > spaces. They are a system-wide limited resource made available to
> > > > > > the userspace applications. Let it be VMs or user-space device
> > > > > > drivers.
> > > > > > 
> > > > > > This RFC patch introduces a cgroup controller to address the
> > > > > > following problems:
> > > > > > 1. Some user applications exhaust all the available IOASIDs thus
> > > > > > depriving others of the same host.
> > > > > > 2. System admins need to provision VMs based on their needs for
> > > > > > IOASIDs, e.g. the number of VMs with assigned devices that perform
> > > > > > DMA requests with PASID.      
> > > > > 
> > > > > Please take a look at the proposed misc controller:
> > > > > 
> > > > >  http://lkml.kernel.org/r/20210302081705.1990283-2-vipinsh@google.com
> > > > > 
> > > > > Would that fit your bill?    
> > > > The interface definitely can be reused. But IOASID has a different
> > > > behavior in terms of migration and ownership checking. I guess SEV key
> > > > IDs are not tied to a process whereas IOASIDs are. Perhaps this can be
> > > > solved by adding
> > > > +	.can_attach	= ioasids_can_attach,
> > > > +	.cancel_attach	= ioasids_cancel_attach,
> > > > Let me give it a try and come back.
> > > >   
> > > While I am trying to fit the IOASIDs cgroup in to the misc cgroup
> > > proposal. I'd like to have a direction check on whether this idea of
> > > using cgroup for IOASID/PASID resource management is viable.  
> > 
> > Yes, even for host SVA it would be good to have a cgroup. Currently the
> > number of shared address spaces is naturally limited by number of
> > processes, which can be controlled with rlimit and cgroup. But on Arm the
> > hardware limit on shared address spaces is 64k (number of ASIDs), easily
> > exhausted with the default PASID and PID limits. So a cgroup for managing
> > this resource is more than welcome.
> > 
> > It looks like your current implementation is very dependent on
> > IOASID_SET_TYPE_MM?  I'll need to do more reading about cgroup to see how
> > easily it can be adapted to host SVA which uses IOASID_SET_TYPE_NULL.
> > 
> Right, I was assuming have three use cases of IOASIDs:
> 1. host supervisor SVA (not a concern, just one init_mm to bind)
> 2. host user SVA, either one IOASID per process or perhaps some private
> IOASID for private address space
> 3. VM use for guest SVA, each IOASID is bound to a guest process
> 
> My current cgroup proposal applies to #3 with IOASID_SET_TYPE_MM, which is
> allocated by the new /dev/ioasid interface.
> 
> For #2, I was thinking you can limit the host process via PIDs cgroup? i.e.
> limit fork.

That works but isn't perfect, because the hardware resource of shared
address spaces can be much lower that PID limit - 16k ASIDs on Arm. To
allow an admin to fairly distribute that resource we could introduce
another cgroup just to limit the number of shared address spaces, but
limiting the number of IOASIDs does the trick.

> So the host IOASIDs are currently allocated from the system pool
> with quota of chosen by iommu_sva_init() in my patch, 0 means unlimited use
> whatever is available. https://lkml.org/lkml/2021/2/28/18

Yes that's sensible, but it would be good to plan the cgroup user
interface to work for #2 as well, even if we don't implement it right
away.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [RFC PATCH 15/18] cgroup: Introduce ioasids controller
  2021-03-05  8:30             ` Jean-Philippe Brucker
@ 2021-03-05 17:16               ` Jean-Philippe Brucker
  2021-03-05 18:20               ` Jacob Pan
  1 sibling, 0 replies; 269+ messages in thread
From: Jean-Philippe Brucker @ 2021-03-05 17:16 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Tejun Heo, LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu,
	cgroups, Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jason Gunthorpe, Jonathan Corbet, Raj Ashok, Tian,
	Kevin, Yi Liu, Wu Hao, Dave Jiang

On Fri, Mar 05, 2021 at 09:30:49AM +0100, Jean-Philippe Brucker wrote:
> That works but isn't perfect, because the hardware resource of shared
> address spaces can be much lower that PID limit - 16k ASIDs on Arm. To

Sorry I meant 16-bit here - 64k

Thanks,
Jean

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [RFC PATCH 15/18] cgroup: Introduce ioasids controller
  2021-03-05  8:30             ` Jean-Philippe Brucker
  2021-03-05 17:16               ` Jean-Philippe Brucker
@ 2021-03-05 18:20               ` Jacob Pan
  1 sibling, 0 replies; 269+ messages in thread
From: Jacob Pan @ 2021-03-05 18:20 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Tejun Heo, LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu,
	cgroups, Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jason Gunthorpe, Jonathan Corbet, Raj Ashok, Tian,
	Kevin, Yi Liu, Wu Hao, Dave Jiang, jacob.jun.pan

Hi Jean-Philippe,

On Fri, 5 Mar 2021 09:30:49 +0100, Jean-Philippe Brucker
<jean-philippe@linaro.org> wrote:

> On Thu, Mar 04, 2021 at 09:46:03AM -0800, Jacob Pan wrote:
> > Hi Jean-Philippe,
> > 
> > On Thu, 4 Mar 2021 10:49:37 +0100, Jean-Philippe Brucker
> > <jean-philippe@linaro.org> wrote:
> >   
> > > On Wed, Mar 03, 2021 at 04:02:05PM -0800, Jacob Pan wrote:  
> > > > Hi Jacob,
> > > > 
> > > > On Wed, 3 Mar 2021 13:17:26 -0800, Jacob Pan
> > > > <jacob.jun.pan@linux.intel.com> wrote:
> > > >     
> > > > > Hi Tejun,
> > > > > 
> > > > > On Wed, 3 Mar 2021 10:44:28 -0500, Tejun Heo <tj@kernel.org>
> > > > > wrote: 
> > > > > > On Sat, Feb 27, 2021 at 02:01:23PM -0800, Jacob Pan wrote:      
> > > > > > > IOASIDs are used to associate DMA requests with virtual
> > > > > > > address spaces. They are a system-wide limited resource made
> > > > > > > available to the userspace applications. Let it be VMs or
> > > > > > > user-space device drivers.
> > > > > > > 
> > > > > > > This RFC patch introduces a cgroup controller to address the
> > > > > > > following problems:
> > > > > > > 1. Some user applications exhaust all the available IOASIDs
> > > > > > > thus depriving others of the same host.
> > > > > > > 2. System admins need to provision VMs based on their needs
> > > > > > > for IOASIDs, e.g. the number of VMs with assigned devices
> > > > > > > that perform DMA requests with PASID.        
> > > > > > 
> > > > > > Please take a look at the proposed misc controller:
> > > > > > 
> > > > > >  http://lkml.kernel.org/r/20210302081705.1990283-2-vipinsh@google.com
> > > > > > 
> > > > > > Would that fit your bill?      
> > > > > The interface definitely can be reused. But IOASID has a different
> > > > > behavior in terms of migration and ownership checking. I guess
> > > > > SEV key IDs are not tied to a process whereas IOASIDs are.
> > > > > Perhaps this can be solved by adding
> > > > > +	.can_attach	= ioasids_can_attach,
> > > > > +	.cancel_attach	= ioasids_cancel_attach,
> > > > > Let me give it a try and come back.
> > > > >     
> > > > While I am trying to fit the IOASIDs cgroup in to the misc cgroup
> > > > proposal. I'd like to have a direction check on whether this idea of
> > > > using cgroup for IOASID/PASID resource management is viable.    
> > > 
> > > Yes, even for host SVA it would be good to have a cgroup. Currently
> > > the number of shared address spaces is naturally limited by number of
> > > processes, which can be controlled with rlimit and cgroup. But on Arm
> > > the hardware limit on shared address spaces is 64k (number of ASIDs),
> > > easily exhausted with the default PASID and PID limits. So a cgroup
> > > for managing this resource is more than welcome.
> > > 
> > > It looks like your current implementation is very dependent on
> > > IOASID_SET_TYPE_MM?  I'll need to do more reading about cgroup to see
> > > how easily it can be adapted to host SVA which uses
> > > IOASID_SET_TYPE_NULL. 
> > Right, I was assuming have three use cases of IOASIDs:
> > 1. host supervisor SVA (not a concern, just one init_mm to bind)
> > 2. host user SVA, either one IOASID per process or perhaps some private
> > IOASID for private address space
> > 3. VM use for guest SVA, each IOASID is bound to a guest process
> > 
> > My current cgroup proposal applies to #3 with IOASID_SET_TYPE_MM, which
> > is allocated by the new /dev/ioasid interface.
> > 
> > For #2, I was thinking you can limit the host process via PIDs cgroup?
> > i.e. limit fork.  
> 
> That works but isn't perfect, because the hardware resource of shared
> address spaces can be much lower that PID limit - 16k ASIDs on Arm. To
> allow an admin to fairly distribute that resource we could introduce
> another cgroup just to limit the number of shared address spaces, but
> limiting the number of IOASIDs does the trick.
> 
make sense. it would be cleaner to have a single approach to limit IOASIDs
(as Jason asked).

> > So the host IOASIDs are currently allocated from the system pool
> > with quota of chosen by iommu_sva_init() in my patch, 0 means unlimited
> > use whatever is available. https://lkml.org/lkml/2021/2/28/18  
> 
> Yes that's sensible, but it would be good to plan the cgroup user
> interface to work for #2 as well, even if we don't implement it right
> away.
> 
will do it in the next version.

> Thanks,
> Jean


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [RFC PATCH 18/18] ioasid: Add /dev/ioasid for userspace
  2021-02-27 22:01 ` [RFC PATCH 18/18] ioasid: Add /dev/ioasid for userspace Jacob Pan
@ 2021-03-10 19:23   ` Jason Gunthorpe
  2021-03-11 22:55     ` Jacob Pan
  0 siblings, 1 reply; 269+ messages in thread
From: Jason Gunthorpe @ 2021-03-10 19:23 UTC (permalink / raw)
  To: Jacob Pan
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker,
	Alex Williamson, Eric Auger, Jonathan Corbet, Raj Ashok, Tian,
	Kevin, Yi Liu, Wu Hao, Dave Jiang

On Sat, Feb 27, 2021 at 02:01:26PM -0800, Jacob Pan wrote:

> +/* -------- IOCTLs for IOASID file descriptor (/dev/ioasid) -------- */
> +
> +/**
> + * IOASID_GET_API_VERSION - _IO(IOASID_TYPE, IOASID_BASE + 0)
> + *
> + * Report the version of the IOASID API.  This allows us to bump the entire
> + * API version should we later need to add or change features in incompatible
> + * ways.
> + * Return: IOASID_API_VERSION
> + * Availability: Always
> + */
> +#define IOASID_GET_API_VERSION		_IO(IOASID_TYPE, IOASID_BASE + 0)

I think this is generally a bad idea, if you change the API later then
also change the ioctl numbers and everything should work out

eg use the 4th argument to IOC to specify something about the ABI

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [RFC PATCH 18/18] ioasid: Add /dev/ioasid for userspace
  2021-03-10 19:23   ` Jason Gunthorpe
@ 2021-03-11 22:55     ` Jacob Pan
  2021-03-12 14:54       ` Jason Gunthorpe
  0 siblings, 1 reply; 269+ messages in thread
From: Jacob Pan @ 2021-03-11 22:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker,
	Alex Williamson, Eric Auger, Jonathan Corbet, Raj Ashok, Tian,
	Kevin, Yi Liu, Wu Hao, Dave Jiang, jacob.jun.pan

Hi Jason,

Thanks for the review.

On Wed, 10 Mar 2021 15:23:01 -0400, Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Sat, Feb 27, 2021 at 02:01:26PM -0800, Jacob Pan wrote:
> 
> > +/* -------- IOCTLs for IOASID file descriptor (/dev/ioasid) -------- */
> > +
> > +/**
> > + * IOASID_GET_API_VERSION - _IO(IOASID_TYPE, IOASID_BASE + 0)
> > + *
> > + * Report the version of the IOASID API.  This allows us to bump the
> > entire
> > + * API version should we later need to add or change features in
> > incompatible
> > + * ways.
> > + * Return: IOASID_API_VERSION
> > + * Availability: Always
> > + */
> > +#define IOASID_GET_API_VERSION		_IO(IOASID_TYPE,
> > IOASID_BASE + 0)  
> 
> I think this is generally a bad idea, if you change the API later then
> also change the ioctl numbers and everything should work out
> 
> eg use the 4th argument to IOC to specify something about the ABI
> 
Let me try to understand the idea, do you mean something like this?
#define IOASID_GET_INFO _IOC(_IOC_NONE, IOASID_TYPE, IOASID_BASE + 1,
sizeof(struct ioasid_info))

If we later change the size of struct ioasid_info, IOASID_GET_INFO would be
a different ioctl number. Then we will break the existing user space that
uses the old number. So I am guessing you meant we need to have a different
name also. i.e.

#define IOASID_GET_INFO_V2 _IOC(_IOC_NONE, IOASID_TYPE, IOASID_BASE + 1,
sizeof(struct ioasid_info_v2))

We can get rid of the API version, just have individual IOCTL version.
Is that right?

> Jason


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [RFC PATCH 18/18] ioasid: Add /dev/ioasid for userspace
  2021-03-11 22:55     ` Jacob Pan
@ 2021-03-12 14:54       ` Jason Gunthorpe
  0 siblings, 0 replies; 269+ messages in thread
From: Jason Gunthorpe @ 2021-03-12 14:54 UTC (permalink / raw)
  To: Jacob Pan
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker,
	Alex Williamson, Eric Auger, Jonathan Corbet, Raj Ashok, Tian,
	Kevin, Yi Liu, Wu Hao, Dave Jiang

On Thu, Mar 11, 2021 at 02:55:34PM -0800, Jacob Pan wrote:
> Hi Jason,
> 
> Thanks for the review.
> 
> On Wed, 10 Mar 2021 15:23:01 -0400, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Sat, Feb 27, 2021 at 02:01:26PM -0800, Jacob Pan wrote:
> > 
> > > +/* -------- IOCTLs for IOASID file descriptor (/dev/ioasid) -------- */
> > > +
> > > +/**
> > > + * IOASID_GET_API_VERSION - _IO(IOASID_TYPE, IOASID_BASE + 0)
> > > + *
> > > + * Report the version of the IOASID API.  This allows us to bump the
> > > entire
> > > + * API version should we later need to add or change features in
> > > incompatible
> > > + * ways.
> > > + * Return: IOASID_API_VERSION
> > > + * Availability: Always
> > > + */
> > > +#define IOASID_GET_API_VERSION		_IO(IOASID_TYPE,
> > > IOASID_BASE + 0)  
> > 
> > I think this is generally a bad idea, if you change the API later then
> > also change the ioctl numbers and everything should work out
> > 
> > eg use the 4th argument to IOC to specify something about the ABI
> > 
> Let me try to understand the idea, do you mean something like this?
> #define IOASID_GET_INFO _IOC(_IOC_NONE, IOASID_TYPE, IOASID_BASE + 1,
> sizeof(struct ioasid_info))
> 
> If we later change the size of struct ioasid_info, IOASID_GET_INFO would be
> a different ioctl number. Then we will break the existing user space that
> uses the old number. So I am guessing you meant we need to have a different
> name also. i.e.

Something like that is more appropriate. Generally we should not be
planning to 'remove' IOCTLs. The kernel must always have backwards
compat, so any new format you introduce down the road has to have new
IOCTL number so the old format can continue to be supported.

Negotiation of support can usually by done by probing for ENOIOCTLCMD
or similar on the new ioctls, not an API version

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-02-27 22:01 ` [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs Jacob Pan
@ 2021-03-19  0:22   ` Jacob Pan
  2021-03-19  9:58     ` Jean-Philippe Brucker
  0 siblings, 1 reply; 269+ messages in thread
From: Jacob Pan @ 2021-03-19  0:22 UTC (permalink / raw)
  To: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker
  Cc: Alex Williamson, Eric Auger, Jason Gunthorpe, Jonathan Corbet,
	Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao, Dave Jiang,
	jacob.jun.pan

Hi Jean,

Slightly off the title. As we are moving to use cgroup to limit PASID
allocations, it would be much simpler if we enforce on the current task.

However, iommu_sva_alloc_pasid() takes an mm_struct pointer as argument
which implies it can be something other the the current task mm. So far all
kernel callers use current task mm. Is there a use case for doing PASID
allocation on behalf of another mm? If not, can we remove the mm argument?

Thanks,

Jacob

>  /**
>   * iommu_sva_alloc_pasid - Allocate a PASID for the mm
> @@ -35,11 +44,11 @@ int iommu_sva_alloc_pasid(struct mm_struct *mm,
> ioasid_t min, ioasid_t max) mutex_lock(&iommu_sva_lock);
>  	if (mm->pasid) {
>  		if (mm->pasid >= min && mm->pasid <= max)
> -			ioasid_get(mm->pasid);
> +			ioasid_get(iommu_sva_pasid, mm->pasid);
>  		else
>  			ret = -EOVERFLOW;
>  	} else {
> -		pasid = ioasid_alloc(&iommu_sva_pasid, min, max, mm);
> +		pasid = ioasid_alloc(iommu_sva_pasid, min, max, mm);
>  		if (pasid == INVALID_IOASID)
>  			ret = -ENOMEM;

Thanks,

Jacob

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-19  0:22   ` Jacob Pan
@ 2021-03-19  9:58     ` Jean-Philippe Brucker
  2021-03-19 12:46       ` Jason Gunthorpe
  2021-03-19 17:14       ` Jacob Pan
  0 siblings, 2 replies; 269+ messages in thread
From: Jean-Philippe Brucker @ 2021-03-19  9:58 UTC (permalink / raw)
  To: Jacob Pan
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker,
	Alex Williamson, Eric Auger, Jason Gunthorpe, Jonathan Corbet,
	Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao, Dave Jiang

Hi Jacob,

On Thu, Mar 18, 2021 at 05:22:34PM -0700, Jacob Pan wrote:
> Hi Jean,
> 
> Slightly off the title. As we are moving to use cgroup to limit PASID
> allocations, it would be much simpler if we enforce on the current task.

Yes I think we should do that. Is there a problem with charging the
process that does the PASID allocation even if the PASID indexes some
other mm?

> However, iommu_sva_alloc_pasid() takes an mm_struct pointer as argument
> which implies it can be something other the the current task mm. So far all
> kernel callers use current task mm. Is there a use case for doing PASID
> allocation on behalf of another mm? If not, can we remove the mm argument?

This would effectively remove the mm parameter from
iommu_sva_bind_device(). I'm not opposed to that, but reintroducing it
later will be difficult if IOMMU drivers start assuming that the bound mm
is from current.

Although there is no use for it at the moment (only two upstream users and
it looks like amdkfd always uses current too), I quite like the
client-server model where the privileged process does bind() and programs
the hardware queue on behalf of the client process.

Thanks,
Jean


^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-19  9:58     ` Jean-Philippe Brucker
@ 2021-03-19 12:46       ` Jason Gunthorpe
  2021-03-19 13:41         ` Jean-Philippe Brucker
  2021-03-19 17:14       ` Jacob Pan
  1 sibling, 1 reply; 269+ messages in thread
From: Jason Gunthorpe @ 2021-03-19 12:46 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Jacob Pan, LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu,
	cgroups, Tejun Heo, Li Zefan, Johannes Weiner,
	Jean-Philippe Brucker, Alex Williamson, Eric Auger,
	Jonathan Corbet, Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao,
	Dave Jiang

On Fri, Mar 19, 2021 at 10:58:41AM +0100, Jean-Philippe Brucker wrote:

> Although there is no use for it at the moment (only two upstream users and
> it looks like amdkfd always uses current too), I quite like the
> client-server model where the privileged process does bind() and programs
> the hardware queue on behalf of the client process.

This creates a lot complexity, how do does process A get a secure
reference to B? How does it access the memory in B to setup the HW?

Why do we need separation anyhow? SVM devices are supposed to be
secure or they shouldn't do SVM.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-19 12:46       ` Jason Gunthorpe
@ 2021-03-19 13:41         ` Jean-Philippe Brucker
  2021-03-19 13:54           ` Jason Gunthorpe
  0 siblings, 1 reply; 269+ messages in thread
From: Jean-Philippe Brucker @ 2021-03-19 13:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jacob Pan, LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu,
	cgroups, Tejun Heo, Li Zefan, Johannes Weiner,
	Jean-Philippe Brucker, Alex Williamson, Eric Auger,
	Jonathan Corbet, Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao,
	Dave Jiang

On Fri, Mar 19, 2021 at 09:46:45AM -0300, Jason Gunthorpe wrote:
> On Fri, Mar 19, 2021 at 10:58:41AM +0100, Jean-Philippe Brucker wrote:
> 
> > Although there is no use for it at the moment (only two upstream users and
> > it looks like amdkfd always uses current too), I quite like the
> > client-server model where the privileged process does bind() and programs
> > the hardware queue on behalf of the client process.
> 
> This creates a lot complexity, how do does process A get a secure
> reference to B? How does it access the memory in B to setup the HW?

mm_access() for example, and passing addresses via IPC

> Why do we need separation anyhow? SVM devices are supposed to be
> secure or they shouldn't do SVM.

Right

Thanks,
Jean

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-19 13:41         ` Jean-Philippe Brucker
@ 2021-03-19 13:54           ` Jason Gunthorpe
  2021-03-19 18:22             ` Jacob Pan
  0 siblings, 1 reply; 269+ messages in thread
From: Jason Gunthorpe @ 2021-03-19 13:54 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Jacob Pan, LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu,
	cgroups, Tejun Heo, Li Zefan, Johannes Weiner,
	Jean-Philippe Brucker, Alex Williamson, Eric Auger,
	Jonathan Corbet, Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao,
	Dave Jiang

On Fri, Mar 19, 2021 at 02:41:32PM +0100, Jean-Philippe Brucker wrote:
> On Fri, Mar 19, 2021 at 09:46:45AM -0300, Jason Gunthorpe wrote:
> > On Fri, Mar 19, 2021 at 10:58:41AM +0100, Jean-Philippe Brucker wrote:
> > 
> > > Although there is no use for it at the moment (only two upstream users and
> > > it looks like amdkfd always uses current too), I quite like the
> > > client-server model where the privileged process does bind() and programs
> > > the hardware queue on behalf of the client process.
> > 
> > This creates a lot complexity, how do does process A get a secure
> > reference to B? How does it access the memory in B to setup the HW?
> 
> mm_access() for example, and passing addresses via IPC

I'd rather the source process establish its own PASID and then pass
the rights to use it to some other process via FD passing than try to
go the other way. There are lots of security questions with something
like mm_access.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-19  9:58     ` Jean-Philippe Brucker
  2021-03-19 12:46       ` Jason Gunthorpe
@ 2021-03-19 17:14       ` Jacob Pan
  1 sibling, 0 replies; 269+ messages in thread
From: Jacob Pan @ 2021-03-19 17:14 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker,
	Alex Williamson, Eric Auger, Jason Gunthorpe, Jonathan Corbet,
	Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao, Dave Jiang,
	jacob.jun.pan

Hi Jean-Philippe,

On Fri, 19 Mar 2021 10:58:41 +0100, Jean-Philippe Brucker
<jean-philippe@linaro.org> wrote:

> > Slightly off the title. As we are moving to use cgroup to limit PASID
> > allocations, it would be much simpler if we enforce on the current
> > task.  
> 
> Yes I think we should do that. Is there a problem with charging the
> process that does the PASID allocation even if the PASID indexes some
> other mm?
Besides complexity, my second concern is that we are sharing the misc
cgroup controller with other resources that do not have such behavior.

Cgroup v2 also has unified hierarchy which also requires coherent behavior
among controllers.

Thanks,

Jacob

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-19 13:54           ` Jason Gunthorpe
@ 2021-03-19 18:22             ` Jacob Pan
  2021-03-22  9:24               ` Jean-Philippe Brucker
  2021-03-22 12:03               ` Jason Gunthorpe
  0 siblings, 2 replies; 269+ messages in thread
From: Jacob Pan @ 2021-03-19 18:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj Ashok, Tian, Kevin, Yi Liu,
	Wu Hao, Dave Jiang, jacob.jun.pan

Hi Jason,

On Fri, 19 Mar 2021 10:54:32 -0300, Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Fri, Mar 19, 2021 at 02:41:32PM +0100, Jean-Philippe Brucker wrote:
> > On Fri, Mar 19, 2021 at 09:46:45AM -0300, Jason Gunthorpe wrote:  
> > > On Fri, Mar 19, 2021 at 10:58:41AM +0100, Jean-Philippe Brucker wrote:
> > >   
> > > > Although there is no use for it at the moment (only two upstream
> > > > users and it looks like amdkfd always uses current too), I quite
> > > > like the client-server model where the privileged process does
> > > > bind() and programs the hardware queue on behalf of the client
> > > > process.  
> > > 
> > > This creates a lot complexity, how do does process A get a secure
> > > reference to B? How does it access the memory in B to setup the HW?  
> > 
> > mm_access() for example, and passing addresses via IPC  
> 
> I'd rather the source process establish its own PASID and then pass
> the rights to use it to some other process via FD passing than try to
> go the other way. There are lots of security questions with something
> like mm_access.
> 

Thank you all for the input, it sounds like we are OK to remove mm argument
from iommu_sva_bind_device() and iommu_sva_alloc_pasid() for now?

Let me try to summarize PASID allocation as below:

Interfaces	| Usage	|  Limit	| bind¹ |User visible
--------------------------------------------------------------------
/dev/ioasid²	| G-SVA/IOVA	|  cgroup	| No	|Yes
--------------------------------------------------------------------
char dev³	| SVA		|  cgroup	| Yes	|No
--------------------------------------------------------------------
iommu driver	| default PASID|  no		| No	|No
--------------------------------------------------------------------
kernel		| super SVA	| no		| yes   |No
--------------------------------------------------------------------

¹ Allocated during SVA bind
² PASIDs allocated via /dev/ioasid are not bound to any mm. But its
  ownership is assigned to the process that does the allocation.
³ Include uacce, other private device driver char dev such as idxd

Currently, the proposed /dev/ioasid interface does not map individual PASID
with an FD. The FD is at the ioasid_set granularity and bond to the current
mm. We could extend the IOCTLs to cover individual PASID-FD passing case
when use cases arise. Would this work?

Thanks,

Jacob

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-19 18:22             ` Jacob Pan
@ 2021-03-22  9:24               ` Jean-Philippe Brucker
  2021-03-24 17:02                 ` Jacob Pan
  2021-03-22 12:03               ` Jason Gunthorpe
  1 sibling, 1 reply; 269+ messages in thread
From: Jean-Philippe Brucker @ 2021-03-22  9:24 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Jason Gunthorpe, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, cgroups, Tejun Heo, Li Zefan, Johannes Weiner,
	Jean-Philippe Brucker, Alex Williamson, Eric Auger,
	Jonathan Corbet, Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao,
	Dave Jiang

On Fri, Mar 19, 2021 at 11:22:21AM -0700, Jacob Pan wrote:
> Hi Jason,
> 
> On Fri, 19 Mar 2021 10:54:32 -0300, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Fri, Mar 19, 2021 at 02:41:32PM +0100, Jean-Philippe Brucker wrote:
> > > On Fri, Mar 19, 2021 at 09:46:45AM -0300, Jason Gunthorpe wrote:  
> > > > On Fri, Mar 19, 2021 at 10:58:41AM +0100, Jean-Philippe Brucker wrote:
> > > >   
> > > > > Although there is no use for it at the moment (only two upstream
> > > > > users and it looks like amdkfd always uses current too), I quite
> > > > > like the client-server model where the privileged process does
> > > > > bind() and programs the hardware queue on behalf of the client
> > > > > process.  
> > > > 
> > > > This creates a lot complexity, how do does process A get a secure
> > > > reference to B? How does it access the memory in B to setup the HW?  
> > > 
> > > mm_access() for example, and passing addresses via IPC  
> > 
> > I'd rather the source process establish its own PASID and then pass
> > the rights to use it to some other process via FD passing than try to
> > go the other way. There are lots of security questions with something
> > like mm_access.
> > 
> 
> Thank you all for the input, it sounds like we are OK to remove mm argument
> from iommu_sva_bind_device() and iommu_sva_alloc_pasid() for now?

Fine by me. By the way the IDXD currently missues the bind API for
supervisor PASID, and the drvdata parameter isn't otherwise used. This
would be a good occasion to clean both. The new bind prototype could be:

struct iommu_sva *iommu_sva_bind_device(struct device *dev, int flags)

And a flag IOMMU_SVA_BIND_SUPERVISOR (not that I plan to implement it in
the SMMU, but I think we need to clean the current usage)

> 
> Let me try to summarize PASID allocation as below:
> 
> Interfaces	| Usage	|  Limit	| bind¹ |User visible
> --------------------------------------------------------------------
> /dev/ioasid²	| G-SVA/IOVA	|  cgroup	| No	|Yes
> --------------------------------------------------------------------
> char dev³	| SVA		|  cgroup	| Yes	|No
> --------------------------------------------------------------------
> iommu driver	| default PASID|  no		| No	|No

Is this PASID #0?

> --------------------------------------------------------------------
> kernel		| super SVA	| no		| yes   |No
> --------------------------------------------------------------------

Also wondering about device driver allocating auxiliary domains for their
private use, to do iommu_map/unmap on private PASIDs (a clean replacement
to super SVA, for example). Would that go through the same path as
/dev/ioasid and use the cgroup of current task?

Thanks,
Jean

> 
> ¹ Allocated during SVA bind
> ² PASIDs allocated via /dev/ioasid are not bound to any mm. But its
>   ownership is assigned to the process that does the allocation.
> ³ Include uacce, other private device driver char dev such as idxd
> 
> Currently, the proposed /dev/ioasid interface does not map individual PASID
> with an FD. The FD is at the ioasid_set granularity and bond to the current
> mm. We could extend the IOCTLs to cover individual PASID-FD passing case
> when use cases arise. Would this work?
> 
> Thanks,
> 
> Jacob

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-19 18:22             ` Jacob Pan
  2021-03-22  9:24               ` Jean-Philippe Brucker
@ 2021-03-22 12:03               ` Jason Gunthorpe
  2021-03-24 19:05                 ` Jacob Pan
  1 sibling, 1 reply; 269+ messages in thread
From: Jason Gunthorpe @ 2021-03-22 12:03 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj Ashok, Tian, Kevin, Yi Liu,
	Wu Hao, Dave Jiang

On Fri, Mar 19, 2021 at 11:22:21AM -0700, Jacob Pan wrote:
> Hi Jason,
> 
> On Fri, 19 Mar 2021 10:54:32 -0300, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Fri, Mar 19, 2021 at 02:41:32PM +0100, Jean-Philippe Brucker wrote:
> > > On Fri, Mar 19, 2021 at 09:46:45AM -0300, Jason Gunthorpe wrote:  
> > > > On Fri, Mar 19, 2021 at 10:58:41AM +0100, Jean-Philippe Brucker wrote:
> > > >   
> > > > > Although there is no use for it at the moment (only two upstream
> > > > > users and it looks like amdkfd always uses current too), I quite
> > > > > like the client-server model where the privileged process does
> > > > > bind() and programs the hardware queue on behalf of the client
> > > > > process.  
> > > > 
> > > > This creates a lot complexity, how do does process A get a secure
> > > > reference to B? How does it access the memory in B to setup the HW?  
> > > 
> > > mm_access() for example, and passing addresses via IPC  
> > 
> > I'd rather the source process establish its own PASID and then pass
> > the rights to use it to some other process via FD passing than try to
> > go the other way. There are lots of security questions with something
> > like mm_access.
> > 
> 
> Thank you all for the input, it sounds like we are OK to remove mm argument
> from iommu_sva_bind_device() and iommu_sva_alloc_pasid() for now?
> 
> Let me try to summarize PASID allocation as below:
> 
> Interfaces	| Usage	|  Limit	| bind¹ |User visible
> /dev/ioasid²	| G-SVA/IOVA	|  cgroup	| No	|Yes
> char dev³	| SVA		|  cgroup	| Yes	|No
> iommu driver	| default PASID|  no		| No	|No
> kernel		| super SVA	| no		| yes   |No
> 
> ¹ Allocated during SVA bind
> ² PASIDs allocated via /dev/ioasid are not bound to any mm. But its
>   ownership is assigned to the process that does the allocation.

What does "not bound to a mm" mean?

IMHO a use created PASID is either bound to a mm (current) at creation
time, or it will never be bound to a mm and its page table is under
user control via /dev/ioasid.

I thought the whole point of something like a /dev/ioasid was to get
away from each and every device creating its own PASID interface?

It maybe somewhat reasonable that some devices could have some easy
'make a SVA PASID on current' interface built in, but anything more
complicated should use /dev/ioasid, and anything consuming PASID
should also have an API to import and attach a PASID from /dev/ioasid.

> Currently, the proposed /dev/ioasid interface does not map individual PASID
> with an FD. The FD is at the ioasid_set granularity and bond to the current
> mm. We could extend the IOCTLs to cover individual PASID-FD passing case
> when use cases arise. Would this work?

Is it a good idea that the FD is per ioasid_set ? What is the set used
for?

Usually kernel interfaces work nicer with a one fd/one object model.

But even if it is a set, you could pass the set between co-operating
processes and the PASID can be created in the correct 'current'. But
there is all kinds of security questsions as soon as you start doing
anything like this - is there really a use case?

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-22  9:24               ` Jean-Philippe Brucker
@ 2021-03-24 17:02                 ` Jacob Pan
  2021-03-24 17:03                   ` Jason Gunthorpe
  2021-03-25 10:26                   ` Jean-Philippe Brucker
  0 siblings, 2 replies; 269+ messages in thread
From: Jacob Pan @ 2021-03-24 17:02 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Jason Gunthorpe, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, cgroups, Tejun Heo, Li Zefan, Johannes Weiner,
	Jean-Philippe Brucker, Alex Williamson, Eric Auger,
	Jonathan Corbet, Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao,
	Dave Jiang, jacob.jun.pan

Hi Jean-Philippe,

On Mon, 22 Mar 2021 10:24:00 +0100, Jean-Philippe Brucker
<jean-philippe@linaro.org> wrote:

> On Fri, Mar 19, 2021 at 11:22:21AM -0700, Jacob Pan wrote:
> > Hi Jason,
> > 
> > On Fri, 19 Mar 2021 10:54:32 -0300, Jason Gunthorpe <jgg@nvidia.com>
> > wrote: 
> > > On Fri, Mar 19, 2021 at 02:41:32PM +0100, Jean-Philippe Brucker
> > > wrote:  
> > > > On Fri, Mar 19, 2021 at 09:46:45AM -0300, Jason Gunthorpe wrote:    
> > > > > On Fri, Mar 19, 2021 at 10:58:41AM +0100, Jean-Philippe Brucker
> > > > > wrote: 
> > > > > > Although there is no use for it at the moment (only two upstream
> > > > > > users and it looks like amdkfd always uses current too), I quite
> > > > > > like the client-server model where the privileged process does
> > > > > > bind() and programs the hardware queue on behalf of the client
> > > > > > process.    
> > > > > 
> > > > > This creates a lot complexity, how do does process A get a secure
> > > > > reference to B? How does it access the memory in B to setup the
> > > > > HW?    
> > > > 
> > > > mm_access() for example, and passing addresses via IPC    
> > > 
> > > I'd rather the source process establish its own PASID and then pass
> > > the rights to use it to some other process via FD passing than try to
> > > go the other way. There are lots of security questions with something
> > > like mm_access.
> > >   
> > 
> > Thank you all for the input, it sounds like we are OK to remove mm
> > argument from iommu_sva_bind_device() and iommu_sva_alloc_pasid() for
> > now?  
> 
> Fine by me. By the way the IDXD currently missues the bind API for
> supervisor PASID, and the drvdata parameter isn't otherwise used. This
> would be a good occasion to clean both. The new bind prototype could be:
> 
> struct iommu_sva *iommu_sva_bind_device(struct device *dev, int flags)
> 
yes, we really just hijacked drvdata as flags, it would be cleaner to use
flags explicitly.

> And a flag IOMMU_SVA_BIND_SUPERVISOR (not that I plan to implement it in
> the SMMU, but I think we need to clean the current usage)
> 
You mean move #define SVM_FLAG_SUPERVISOR_MODE out of Intel code to be a
generic flag in iommu-sva-lib.h called IOMMU_SVA_BIND_SUPERVISOR?

I agree if that is the proposal.

> > 
> > Let me try to summarize PASID allocation as below:
> > 
> > Interfaces	| Usage	|  Limit	| bind¹ |User visible
> > --------------------------------------------------------------------
> > /dev/ioasid²	| G-SVA/IOVA	|  cgroup	| No
> > |Yes
> > --------------------------------------------------------------------
> > char dev³	| SVA		|  cgroup	| Yes	|No
> > --------------------------------------------------------------------
> > iommu driver	| default PASID|  no		| No	|No
> >  
> 
> Is this PASID #0?
> 
True for native case but not limited to PASID#0 for guest case. E.g. for
mdev assignment with guest IOVA, the guest PASID would #0, but the host aux
domain default PASID can be non-zero. Here I meant to include both cases.

> > --------------------------------------------------------------------
> > kernel		| super SVA	| no		| yes   |No
> > --------------------------------------------------------------------  
> 
> Also wondering about device driver allocating auxiliary domains for their
> private use, to do iommu_map/unmap on private PASIDs (a clean replacement
> to super SVA, for example). Would that go through the same path as
> /dev/ioasid and use the cgroup of current task?
>
For the in-kernel private use, I don't think we should restrict based on
cgroup, since there is no affinity to user processes. I also think the
PASID allocation should just use kernel API instead of /dev/ioasid. Why
would user space need to know the actual PASID # for device private domains?
Maybe I missed your idea?

> Thanks,
> Jean
> 
> > 
> > ¹ Allocated during SVA bind
> > ² PASIDs allocated via /dev/ioasid are not bound to any mm. But its
> >   ownership is assigned to the process that does the allocation.
> > ³ Include uacce, other private device driver char dev such as idxd
> > 
> > Currently, the proposed /dev/ioasid interface does not map individual
> > PASID with an FD. The FD is at the ioasid_set granularity and bond to
> > the current mm. We could extend the IOCTLs to cover individual PASID-FD
> > passing case when use cases arise. Would this work?
> > 
> > Thanks,
> > 
> > Jacob  


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-24 17:02                 ` Jacob Pan
@ 2021-03-24 17:03                   ` Jason Gunthorpe
  2021-03-24 22:12                     ` Jacob Pan
  2021-03-25 10:26                   ` Jean-Philippe Brucker
  1 sibling, 1 reply; 269+ messages in thread
From: Jason Gunthorpe @ 2021-03-24 17:03 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj Ashok, Tian, Kevin, Yi Liu,
	Wu Hao, Dave Jiang

On Wed, Mar 24, 2021 at 10:02:46AM -0700, Jacob Pan wrote:
> > Also wondering about device driver allocating auxiliary domains for their
> > private use, to do iommu_map/unmap on private PASIDs (a clean replacement
> > to super SVA, for example). Would that go through the same path as
> > /dev/ioasid and use the cgroup of current task?
>
> For the in-kernel private use, I don't think we should restrict based on
> cgroup, since there is no affinity to user processes. I also think the
> PASID allocation should just use kernel API instead of /dev/ioasid. Why
> would user space need to know the actual PASID # for device private domains?
> Maybe I missed your idea?

There is not much in the kernel that isn't triggered by a process, I
would be careful about the idea that there is a class of users that
can consume a cgroup controlled resource without being inside the
cgroup.

We've got into trouble before overlooking this and with something
greenfield like PASID it would be best built in to the API to prevent
a mistake. eg accepting a cgroup or process input to the allocator.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-22 12:03               ` Jason Gunthorpe
@ 2021-03-24 19:05                 ` Jacob Pan
  2021-03-29 16:31                   ` Jason Gunthorpe
  0 siblings, 1 reply; 269+ messages in thread
From: Jacob Pan @ 2021-03-24 19:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj Ashok, Tian, Kevin, Yi Liu,
	Wu Hao, Dave Jiang, jacob.jun.pan

Hi Jason,

On Mon, 22 Mar 2021 09:03:00 -0300, Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Fri, Mar 19, 2021 at 11:22:21AM -0700, Jacob Pan wrote:
> > Hi Jason,
> > 
> > On Fri, 19 Mar 2021 10:54:32 -0300, Jason Gunthorpe <jgg@nvidia.com>
> > wrote: 
> > > On Fri, Mar 19, 2021 at 02:41:32PM +0100, Jean-Philippe Brucker
> > > wrote:  
> > > > On Fri, Mar 19, 2021 at 09:46:45AM -0300, Jason Gunthorpe wrote:    
> > > > > On Fri, Mar 19, 2021 at 10:58:41AM +0100, Jean-Philippe Brucker
> > > > > wrote: 
> > > > > > Although there is no use for it at the moment (only two upstream
> > > > > > users and it looks like amdkfd always uses current too), I quite
> > > > > > like the client-server model where the privileged process does
> > > > > > bind() and programs the hardware queue on behalf of the client
> > > > > > process.    
> > > > > 
> > > > > This creates a lot complexity, how do does process A get a secure
> > > > > reference to B? How does it access the memory in B to setup the
> > > > > HW?    
> > > > 
> > > > mm_access() for example, and passing addresses via IPC    
> > > 
> > > I'd rather the source process establish its own PASID and then pass
> > > the rights to use it to some other process via FD passing than try to
> > > go the other way. There are lots of security questions with something
> > > like mm_access.
> > >   
> > 
> > Thank you all for the input, it sounds like we are OK to remove mm
> > argument from iommu_sva_bind_device() and iommu_sva_alloc_pasid() for
> > now?
> > 
> > Let me try to summarize PASID allocation as below:
> > 
> > Interfaces	| Usage	|  Limit	| bind¹ |User visible
> > /dev/ioasid²	| G-SVA/IOVA	|  cgroup	| No
> > |Yes char dev³	| SVA		|  cgroup	|
> > Yes	|No iommu driver	| default PASID|  no
> > | No	|No kernel		| super SVA	| no
> > 	| yes   |No
> > 
> > ¹ Allocated during SVA bind
> > ² PASIDs allocated via /dev/ioasid are not bound to any mm. But its
> >   ownership is assigned to the process that does the allocation.  
> 
> What does "not bound to a mm" mean?
> 
I meant, the IOASID allocated via /dev/ioasid is in a clean state (just a
number). It's initial state is not bound to an mm. Unlike, sva_bind_device()
where the IOASID is allocated during bind time.

The use case is to support guest SVA bind, where allocation and bind are in
two separate steps.

> IMHO a use created PASID is either bound to a mm (current) at creation
> time, or it will never be bound to a mm and its page table is under
> user control via /dev/ioasid.
> 
True for PASID used in native SVA bind. But for binding with a guest mm,
PASID is allocated first (VT-d virtual cmd interface Spec 10.4.44), the
bind with the host IOMMU when vIOMMU PASID cache is invalidated.

Our intention is to have two separate interfaces:
1. /dev/ioasid (allocation/free only)
2. /dev/sva (handles all SVA related activities including page tables)

> I thought the whole point of something like a /dev/ioasid was to get
> away from each and every device creating its own PASID interface?
> 
yes, but only for the use cases that need to expose PASID to the userspace.
AFAICT, the cases are:
1. guest SVA (bind guest mm)
2. full PF/VF assignment(not mediated) where guest driver want to program
the actual PASID onto the device.

> It maybe somewhat reasonable that some devices could have some easy
> 'make a SVA PASID on current' interface built in,
I agree, this is the case PASID is hidden from the userspace, right? e.g.
uacce.

> but anything more
> complicated should use /dev/ioasid, and anything consuming PASID
> should also have an API to import and attach a PASID from /dev/ioasid.
> 
Would the above two use cases constitute the "complicated" criteria? Or we
should say anything that need the explicit PASID value has to through
/dev/ioasid?

Could you give some highlevel hint on the APIs that hook up IOASID
allocated from /dev/ioasid and use cases that combine device and domain
information? Yi is working on /dev/sva RFC, it would be good to have a
direction check.

> > Currently, the proposed /dev/ioasid interface does not map individual
> > PASID with an FD. The FD is at the ioasid_set granularity and bond to
> > the current mm. We could extend the IOCTLs to cover individual PASID-FD
> > passing case when use cases arise. Would this work?  
> 
> Is it a good idea that the FD is per ioasid_set ?
We were thinking the allocation IOCTL is on a per set basis, then we know
the ownership of between PASIDs and its set. If per PASID FD is needed, we
can extend.

> What is the set used
> for?
> 
I tried to document the concept in
https://lore.kernel.org/lkml/1614463286-97618-2-git-send-email-jacob.jun.pan@linux.intel.com/

In terms of usage for guest SVA, an ioasid_set is mostly tied to a host mm,
the use case is as the following:
1. Identify a pool of PASIDs for permission checking (below to the same VM),
e.g. only allow SVA binding for PASIDs allocated from the same set.

2. Allow different PASID-aware kernel subsystems to associate, e.g. KVM,
device drivers, and IOMMU driver. i.e. each KVM instance only cares about
the ioasid_set associated with the VM. Events notifications are also within
the ioasid_set to synchronize PASID states.

3. Guest-Host PASID look up (each set has its own XArray to store the
mapping)

4. Quota control (going away once we have cgroup)

> Usually kernel interfaces work nicer with a one fd/one object model.
> 
> But even if it is a set, you could pass the set between co-operating
> processes and the PASID can be created in the correct 'current'. But
> there is all kinds of security questsions as soon as you start doing
> anything like this - is there really a use case?
> 
We don't see a use case for passing ioasid_set to another process. All the
four use cases above are for the current process.

> Jason


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-24 17:03                   ` Jason Gunthorpe
@ 2021-03-24 22:12                     ` Jacob Pan
  2021-03-25 10:21                       ` Jean-Philippe Brucker
  0 siblings, 1 reply; 269+ messages in thread
From: Jacob Pan @ 2021-03-24 22:12 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj Ashok, Tian, Kevin, Yi Liu,
	Wu Hao, Dave Jiang, jacob.jun.pan

Hi Jason,

On Wed, 24 Mar 2021 14:03:38 -0300, Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Mar 24, 2021 at 10:02:46AM -0700, Jacob Pan wrote:
> > > Also wondering about device driver allocating auxiliary domains for
> > > their private use, to do iommu_map/unmap on private PASIDs (a clean
> > > replacement to super SVA, for example). Would that go through the
> > > same path as /dev/ioasid and use the cgroup of current task?  
> >
> > For the in-kernel private use, I don't think we should restrict based on
> > cgroup, since there is no affinity to user processes. I also think the
> > PASID allocation should just use kernel API instead of /dev/ioasid. Why
> > would user space need to know the actual PASID # for device private
> > domains? Maybe I missed your idea?  
> 
> There is not much in the kernel that isn't triggered by a process, I
> would be careful about the idea that there is a class of users that
> can consume a cgroup controlled resource without being inside the
> cgroup.
> 
> We've got into trouble before overlooking this and with something
> greenfield like PASID it would be best built in to the API to prevent
> a mistake. eg accepting a cgroup or process input to the allocator.
> 
Make sense. But I think we only allow charging the current cgroup, how about
I add the following to ioasid_alloc():

	misc_cg = get_current_misc_cg();
	ret = misc_cg_try_charge(MISC_CG_RES_IOASID, misc_cg, 1);
	if (ret) {
		put_misc_cg(misc_cg);
		return ret;
	}

BTW, IOASID will be one of the resources under the proposed misc cgroup.

Thanks,

Jacob

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-24 22:12                     ` Jacob Pan
@ 2021-03-25 10:21                       ` Jean-Philippe Brucker
  2021-03-25 17:02                         ` Jacob Pan
  0 siblings, 1 reply; 269+ messages in thread
From: Jean-Philippe Brucker @ 2021-03-25 10:21 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Jason Gunthorpe, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, cgroups, Tejun Heo, Li Zefan, Johannes Weiner,
	Jean-Philippe Brucker, Alex Williamson, Eric Auger,
	Jonathan Corbet, Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao,
	Dave Jiang

On Wed, Mar 24, 2021 at 03:12:30PM -0700, Jacob Pan wrote:
> Hi Jason,
> 
> On Wed, 24 Mar 2021 14:03:38 -0300, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Mar 24, 2021 at 10:02:46AM -0700, Jacob Pan wrote:
> > > > Also wondering about device driver allocating auxiliary domains for
> > > > their private use, to do iommu_map/unmap on private PASIDs (a clean
> > > > replacement to super SVA, for example). Would that go through the
> > > > same path as /dev/ioasid and use the cgroup of current task?  
> > >
> > > For the in-kernel private use, I don't think we should restrict based on
> > > cgroup, since there is no affinity to user processes. I also think the
> > > PASID allocation should just use kernel API instead of /dev/ioasid. Why
> > > would user space need to know the actual PASID # for device private
> > > domains? Maybe I missed your idea?  
> > 
> > There is not much in the kernel that isn't triggered by a process, I
> > would be careful about the idea that there is a class of users that
> > can consume a cgroup controlled resource without being inside the
> > cgroup.
> > 
> > We've got into trouble before overlooking this and with something
> > greenfield like PASID it would be best built in to the API to prevent
> > a mistake. eg accepting a cgroup or process input to the allocator.
> > 
> Make sense. But I think we only allow charging the current cgroup, how about
> I add the following to ioasid_alloc():
> 
> 	misc_cg = get_current_misc_cg();
> 	ret = misc_cg_try_charge(MISC_CG_RES_IOASID, misc_cg, 1);
> 	if (ret) {
> 		put_misc_cg(misc_cg);
> 		return ret;
> 	}

Does that allow PASID allocation during driver probe, in kernel_init or
modprobe context?

Thanks,
Jean


^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-24 17:02                 ` Jacob Pan
  2021-03-24 17:03                   ` Jason Gunthorpe
@ 2021-03-25 10:26                   ` Jean-Philippe Brucker
  1 sibling, 0 replies; 269+ messages in thread
From: Jean-Philippe Brucker @ 2021-03-25 10:26 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Jason Gunthorpe, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, cgroups, Tejun Heo, Li Zefan, Johannes Weiner,
	Jean-Philippe Brucker, Alex Williamson, Eric Auger,
	Jonathan Corbet, Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao,
	Dave Jiang

On Wed, Mar 24, 2021 at 10:02:46AM -0700, Jacob Pan wrote:
> > And a flag IOMMU_SVA_BIND_SUPERVISOR (not that I plan to implement it in
> > the SMMU, but I think we need to clean the current usage)
> > 
> You mean move #define SVM_FLAG_SUPERVISOR_MODE out of Intel code to be a
> generic flag in iommu-sva-lib.h called IOMMU_SVA_BIND_SUPERVISOR?

Yes, though it would need to be in iommu.h since it's used by device
drivers

> > Also wondering about device driver allocating auxiliary domains for their
> > private use, to do iommu_map/unmap on private PASIDs (a clean replacement
> > to super SVA, for example). Would that go through the same path as
> > /dev/ioasid and use the cgroup of current task?
> >
> For the in-kernel private use, I don't think we should restrict based on
> cgroup, since there is no affinity to user processes. I also think the
> PASID allocation should just use kernel API instead of /dev/ioasid. Why
> would user space need to know the actual PASID # for device private domains?
> Maybe I missed your idea?

No that's my bad, I didn't get the role of /dev/ioasid. Let me give the
series a proper read.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-25 10:21                       ` Jean-Philippe Brucker
@ 2021-03-25 17:02                         ` Jacob Pan
  2021-03-25 17:16                           ` Jason Gunthorpe
  0 siblings, 1 reply; 269+ messages in thread
From: Jacob Pan @ 2021-03-25 17:02 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Jason Gunthorpe, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, cgroups, Tejun Heo, Li Zefan, Johannes Weiner,
	Jean-Philippe Brucker, Alex Williamson, Eric Auger,
	Jonathan Corbet, Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao,
	Dave Jiang, jacob.jun.pan

Hi Jean-Philippe,

On Thu, 25 Mar 2021 11:21:40 +0100, Jean-Philippe Brucker
<jean-philippe@linaro.org> wrote:

> On Wed, Mar 24, 2021 at 03:12:30PM -0700, Jacob Pan wrote:
> > Hi Jason,
> > 
> > On Wed, 24 Mar 2021 14:03:38 -0300, Jason Gunthorpe <jgg@nvidia.com>
> > wrote: 
> > > On Wed, Mar 24, 2021 at 10:02:46AM -0700, Jacob Pan wrote:  
> > > > > Also wondering about device driver allocating auxiliary domains
> > > > > for their private use, to do iommu_map/unmap on private PASIDs (a
> > > > > clean replacement to super SVA, for example). Would that go
> > > > > through the same path as /dev/ioasid and use the cgroup of
> > > > > current task?    
> > > >
> > > > For the in-kernel private use, I don't think we should restrict
> > > > based on cgroup, since there is no affinity to user processes. I
> > > > also think the PASID allocation should just use kernel API instead
> > > > of /dev/ioasid. Why would user space need to know the actual PASID
> > > > # for device private domains? Maybe I missed your idea?    
> > > 
> > > There is not much in the kernel that isn't triggered by a process, I
> > > would be careful about the idea that there is a class of users that
> > > can consume a cgroup controlled resource without being inside the
> > > cgroup.
> > > 
> > > We've got into trouble before overlooking this and with something
> > > greenfield like PASID it would be best built in to the API to prevent
> > > a mistake. eg accepting a cgroup or process input to the allocator.
> > >   
> > Make sense. But I think we only allow charging the current cgroup, how
> > about I add the following to ioasid_alloc():
> > 
> > 	misc_cg = get_current_misc_cg();
> > 	ret = misc_cg_try_charge(MISC_CG_RES_IOASID, misc_cg, 1);
> > 	if (ret) {
> > 		put_misc_cg(misc_cg);
> > 		return ret;
> > 	}  
> 
> Does that allow PASID allocation during driver probe, in kernel_init or
> modprobe context?
> 
Good point. Yes, you can get cgroup subsystem state in kernel_init for
charging/uncharging. I would think module_init should work also since it is
after kernel_init. I have tried the following:
static int __ref kernel_init(void *unused)
 {
        int ret;
+       struct cgroup_subsys_state *css;
+       css = task_get_css(current, pids_cgrp_id);

But that would imply:
1. IOASID has to be built-in, not as module
2. IOASIDs charged on PID1/init would not subject to cgroup limit since it
will be in the root cgroup and we don't support migration nor will migrate.

Then it comes back to the question of why do we try to limit in-kernel
users per cgroup if we can't enforce these cases.

> Thanks,
> Jean
> 


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-25 17:02                         ` Jacob Pan
@ 2021-03-25 17:16                           ` Jason Gunthorpe
  2021-03-25 18:23                             ` Jacob Pan
  2021-03-26  8:06                             ` Jean-Philippe Brucker
  0 siblings, 2 replies; 269+ messages in thread
From: Jason Gunthorpe @ 2021-03-25 17:16 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj Ashok, Tian, Kevin, Yi Liu,
	Wu Hao, Dave Jiang

On Thu, Mar 25, 2021 at 10:02:36AM -0700, Jacob Pan wrote:
> Hi Jean-Philippe,
> 
> On Thu, 25 Mar 2021 11:21:40 +0100, Jean-Philippe Brucker
> <jean-philippe@linaro.org> wrote:
> 
> > On Wed, Mar 24, 2021 at 03:12:30PM -0700, Jacob Pan wrote:
> > > Hi Jason,
> > > 
> > > On Wed, 24 Mar 2021 14:03:38 -0300, Jason Gunthorpe <jgg@nvidia.com>
> > > wrote: 
> > > > On Wed, Mar 24, 2021 at 10:02:46AM -0700, Jacob Pan wrote:  
> > > > > > Also wondering about device driver allocating auxiliary domains
> > > > > > for their private use, to do iommu_map/unmap on private PASIDs (a
> > > > > > clean replacement to super SVA, for example). Would that go
> > > > > > through the same path as /dev/ioasid and use the cgroup of
> > > > > > current task?    
> > > > >
> > > > > For the in-kernel private use, I don't think we should restrict
> > > > > based on cgroup, since there is no affinity to user processes. I
> > > > > also think the PASID allocation should just use kernel API instead
> > > > > of /dev/ioasid. Why would user space need to know the actual PASID
> > > > > # for device private domains? Maybe I missed your idea?    
> > > > 
> > > > There is not much in the kernel that isn't triggered by a process, I
> > > > would be careful about the idea that there is a class of users that
> > > > can consume a cgroup controlled resource without being inside the
> > > > cgroup.
> > > > 
> > > > We've got into trouble before overlooking this and with something
> > > > greenfield like PASID it would be best built in to the API to prevent
> > > > a mistake. eg accepting a cgroup or process input to the allocator.
> > > >   
> > > Make sense. But I think we only allow charging the current cgroup, how
> > > about I add the following to ioasid_alloc():
> > > 
> > > 	misc_cg = get_current_misc_cg();
> > > 	ret = misc_cg_try_charge(MISC_CG_RES_IOASID, misc_cg, 1);
> > > 	if (ret) {
> > > 		put_misc_cg(misc_cg);
> > > 		return ret;
> > > 	}  
> > 
> > Does that allow PASID allocation during driver probe, in kernel_init or
> > modprobe context?
> > 
> Good point. Yes, you can get cgroup subsystem state in kernel_init for
> charging/uncharging. I would think module_init should work also since it is
> after kernel_init. I have tried the following:
> static int __ref kernel_init(void *unused)
>  {
>         int ret;
> +       struct cgroup_subsys_state *css;
> +       css = task_get_css(current, pids_cgrp_id);
> 
> But that would imply:
> 1. IOASID has to be built-in, not as module
> 2. IOASIDs charged on PID1/init would not subject to cgroup limit since it
> will be in the root cgroup and we don't support migration nor will migrate.
> 
> Then it comes back to the question of why do we try to limit in-kernel
> users per cgroup if we can't enforce these cases.

Are these real use cases? Why would a driver binding to a device
create a single kernel pasid at bind time? Why wouldn't it use
untagged DMA?

When someone needs it they can rework it and explain why they are
doing something sane.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-25 17:16                           ` Jason Gunthorpe
@ 2021-03-25 18:23                             ` Jacob Pan
  2021-03-26  8:06                             ` Jean-Philippe Brucker
  1 sibling, 0 replies; 269+ messages in thread
From: Jacob Pan @ 2021-03-25 18:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj Ashok, Tian, Kevin, Yi Liu,
	Wu Hao, Dave Jiang, jacob.jun.pan

Hi Jason,

On Thu, 25 Mar 2021 14:16:45 -0300, Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Thu, Mar 25, 2021 at 10:02:36AM -0700, Jacob Pan wrote:
> > Hi Jean-Philippe,
> > 
> > On Thu, 25 Mar 2021 11:21:40 +0100, Jean-Philippe Brucker
> > <jean-philippe@linaro.org> wrote:
> >   
> > > On Wed, Mar 24, 2021 at 03:12:30PM -0700, Jacob Pan wrote:  
> > > > Hi Jason,
> > > > 
> > > > On Wed, 24 Mar 2021 14:03:38 -0300, Jason Gunthorpe <jgg@nvidia.com>
> > > > wrote:   
> > > > > On Wed, Mar 24, 2021 at 10:02:46AM -0700, Jacob Pan wrote:    
> > > > > > > Also wondering about device driver allocating auxiliary
> > > > > > > domains for their private use, to do iommu_map/unmap on
> > > > > > > private PASIDs (a clean replacement to super SVA, for
> > > > > > > example). Would that go through the same path as /dev/ioasid
> > > > > > > and use the cgroup of current task?      
> > > > > >
> > > > > > For the in-kernel private use, I don't think we should restrict
> > > > > > based on cgroup, since there is no affinity to user processes. I
> > > > > > also think the PASID allocation should just use kernel API
> > > > > > instead of /dev/ioasid. Why would user space need to know the
> > > > > > actual PASID # for device private domains? Maybe I missed your
> > > > > > idea?      
> > > > > 
> > > > > There is not much in the kernel that isn't triggered by a
> > > > > process, I would be careful about the idea that there is a class
> > > > > of users that can consume a cgroup controlled resource without
> > > > > being inside the cgroup.
> > > > > 
> > > > > We've got into trouble before overlooking this and with something
> > > > > greenfield like PASID it would be best built in to the API to
> > > > > prevent a mistake. eg accepting a cgroup or process input to the
> > > > > allocator. 
> > > > Make sense. But I think we only allow charging the current cgroup,
> > > > how about I add the following to ioasid_alloc():
> > > > 
> > > > 	misc_cg = get_current_misc_cg();
> > > > 	ret = misc_cg_try_charge(MISC_CG_RES_IOASID, misc_cg, 1);
> > > > 	if (ret) {
> > > > 		put_misc_cg(misc_cg);
> > > > 		return ret;
> > > > 	}    
> > > 
> > > Does that allow PASID allocation during driver probe, in kernel_init
> > > or modprobe context?
> > >   
> > Good point. Yes, you can get cgroup subsystem state in kernel_init for
> > charging/uncharging. I would think module_init should work also since
> > it is after kernel_init. I have tried the following:
> > static int __ref kernel_init(void *unused)
> >  {
> >         int ret;
> > +       struct cgroup_subsys_state *css;
> > +       css = task_get_css(current, pids_cgrp_id);
> > 
> > But that would imply:
> > 1. IOASID has to be built-in, not as module
> > 2. IOASIDs charged on PID1/init would not subject to cgroup limit since
> > it will be in the root cgroup and we don't support migration nor will
> > migrate.
> > 
> > Then it comes back to the question of why do we try to limit in-kernel
> > users per cgroup if we can't enforce these cases.  
> 
> Are these real use cases? Why would a driver binding to a device
> create a single kernel pasid at bind time? Why wouldn't it use
> untagged DMA?
> 
For VT-d, I don't see such use cases. All PASID allocations by the kernel
drivers has proper process context.

> When someone needs it they can rework it and explain why they are
> doing something sane.
> 
Agreed.

> Jason


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-25 17:16                           ` Jason Gunthorpe
  2021-03-25 18:23                             ` Jacob Pan
@ 2021-03-26  8:06                             ` Jean-Philippe Brucker
  2021-03-30 13:07                               ` Jason Gunthorpe
  1 sibling, 1 reply; 269+ messages in thread
From: Jean-Philippe Brucker @ 2021-03-26  8:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jacob Pan, LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu,
	cgroups, Tejun Heo, Li Zefan, Johannes Weiner,
	Jean-Philippe Brucker, Alex Williamson, Eric Auger,
	Jonathan Corbet, Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao,
	Dave Jiang

On Thu, Mar 25, 2021 at 02:16:45PM -0300, Jason Gunthorpe wrote:
> On Thu, Mar 25, 2021 at 10:02:36AM -0700, Jacob Pan wrote:
> > Hi Jean-Philippe,
> > 
> > On Thu, 25 Mar 2021 11:21:40 +0100, Jean-Philippe Brucker
> > <jean-philippe@linaro.org> wrote:
> > 
> > > On Wed, Mar 24, 2021 at 03:12:30PM -0700, Jacob Pan wrote:
> > > > Hi Jason,
> > > > 
> > > > On Wed, 24 Mar 2021 14:03:38 -0300, Jason Gunthorpe <jgg@nvidia.com>
> > > > wrote: 
> > > > > On Wed, Mar 24, 2021 at 10:02:46AM -0700, Jacob Pan wrote:  
> > > > > > > Also wondering about device driver allocating auxiliary domains
> > > > > > > for their private use, to do iommu_map/unmap on private PASIDs (a
> > > > > > > clean replacement to super SVA, for example). Would that go
> > > > > > > through the same path as /dev/ioasid and use the cgroup of
> > > > > > > current task?    
> > > > > >
> > > > > > For the in-kernel private use, I don't think we should restrict
> > > > > > based on cgroup, since there is no affinity to user processes. I
> > > > > > also think the PASID allocation should just use kernel API instead
> > > > > > of /dev/ioasid. Why would user space need to know the actual PASID
> > > > > > # for device private domains? Maybe I missed your idea?    
> > > > > 
> > > > > There is not much in the kernel that isn't triggered by a process, I
> > > > > would be careful about the idea that there is a class of users that
> > > > > can consume a cgroup controlled resource without being inside the
> > > > > cgroup.
> > > > > 
> > > > > We've got into trouble before overlooking this and with something
> > > > > greenfield like PASID it would be best built in to the API to prevent
> > > > > a mistake. eg accepting a cgroup or process input to the allocator.
> > > > >   
> > > > Make sense. But I think we only allow charging the current cgroup, how
> > > > about I add the following to ioasid_alloc():
> > > > 
> > > > 	misc_cg = get_current_misc_cg();
> > > > 	ret = misc_cg_try_charge(MISC_CG_RES_IOASID, misc_cg, 1);
> > > > 	if (ret) {
> > > > 		put_misc_cg(misc_cg);
> > > > 		return ret;
> > > > 	}  
> > > 
> > > Does that allow PASID allocation during driver probe, in kernel_init or
> > > modprobe context?
> > > 
> > Good point. Yes, you can get cgroup subsystem state in kernel_init for
> > charging/uncharging. I would think module_init should work also since it is
> > after kernel_init. I have tried the following:
> > static int __ref kernel_init(void *unused)
> >  {
> >         int ret;
> > +       struct cgroup_subsys_state *css;
> > +       css = task_get_css(current, pids_cgrp_id);
> > 
> > But that would imply:
> > 1. IOASID has to be built-in, not as module

If IOASID is a module, the device driver will probe once the IOMMU module
is available, which I think always happens in probe deferral kworker.

> > 2. IOASIDs charged on PID1/init would not subject to cgroup limit since it
> > will be in the root cgroup and we don't support migration nor will migrate.
> > 
> > Then it comes back to the question of why do we try to limit in-kernel
> > users per cgroup if we can't enforce these cases.

It may be better to explicitly pass a cgroup during allocation as Jason
suggested. That way anyone using the API will have to be aware of this and
pass the root cgroup if that's what they want.

> Are these real use cases? Why would a driver binding to a device
> create a single kernel pasid at bind time? Why wouldn't it use
> untagged DMA?

It's not inconceivable to have a control queue doing DMA tagged with
PASID. The devices I know either use untagged DMA, or have a choice to use
a PASID. We're not outright forbidding PASID allocation at boot (I don't
think we can or should) and we won't be able to check every use of the
API, so I'm trying to figure out whether it will always default to root
cgroup, or crash in some corner case.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-24 19:05                 ` Jacob Pan
@ 2021-03-29 16:31                   ` Jason Gunthorpe
  2021-03-29 22:55                     ` Jacob Pan
                                       ` (3 more replies)
  0 siblings, 4 replies; 269+ messages in thread
From: Jason Gunthorpe @ 2021-03-29 16:31 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj Ashok, Tian, Kevin, Yi Liu,
	Wu Hao, Dave Jiang

On Wed, Mar 24, 2021 at 12:05:28PM -0700, Jacob Pan wrote:

> > IMHO a use created PASID is either bound to a mm (current) at creation
> > time, or it will never be bound to a mm and its page table is under
> > user control via /dev/ioasid.
> > 
> True for PASID used in native SVA bind. But for binding with a guest mm,
> PASID is allocated first (VT-d virtual cmd interface Spec 10.4.44), the
> bind with the host IOMMU when vIOMMU PASID cache is invalidated.
> 
> Our intention is to have two separate interfaces:
> 1. /dev/ioasid (allocation/free only)
> 2. /dev/sva (handles all SVA related activities including page tables)

I'm not sure I understand why you'd want to have two things. Doesn't
that just complicate everything?

Manipulating the ioasid, including filling it with page tables, seems
an integral inseperable part of the whole interface. Why have two ?

> > I thought the whole point of something like a /dev/ioasid was to get
> > away from each and every device creating its own PASID interface?
> > 
> yes, but only for the use cases that need to expose PASID to the
> userspace.

Why "but only"? This thing should reach for a higher generality, not
just be contained to solve some problem within qemu.

> > It maybe somewhat reasonable that some devices could have some easy
> > 'make a SVA PASID on current' interface built in,
> I agree, this is the case PASID is hidden from the userspace, right? e.g.
> uacce.

"hidden", I guess, but does it matter so much?

The PASID would still consume a cgroup credit

> > but anything more
> > complicated should use /dev/ioasid, and anything consuming PASID
> > should also have an API to import and attach a PASID from /dev/ioasid.
> > 
> Would the above two use cases constitute the "complicated" criteria? Or we
> should say anything that need the explicit PASID value has to through
> /dev/ioasid?

Anything that needs more that creating a hidden PASID link'd to
current should use the full interface.

> In terms of usage for guest SVA, an ioasid_set is mostly tied to a host mm,
> the use case is as the following:

From that doc:

  It is imperative to enforce
  VM-IOASID ownership such that a malicious guest cannot target DMA
  traffic outside its own IOASIDs, or free an active IOASID that belongs
  to another VM.

Huh?

Security in a PASID world comes from the IOMMU blocking access to the
PASID except from approved PCI-ID's. If a VF/PF is assigned to a guest
then that guest can cause the device to issue any PASID by having
complete control and the vIOMMU is supposed to tell the real IOMMU
what PASID's the device is alowed to access.

If a device is sharing a single PCI function with different security
contexts (eg vfio mdev) then the device itself is responsible to
ensure that only the secure interface can program a PASID and a less
secure context can never self-enroll. 

Here the mdev driver would have to consule with the vIOMMU to ensure
the mdev device is allowed to access the PASID - is that what this
set stuff is about? 

If yes, it is backwards. The MDEV is the thing doing the security, the
MDEV should have the list of allowed PASID's and a single PASID
created under /dev/ioasid should be loaded into MDEV with some 'Ok you
can use PASID xyz from FD abc' command.

Because you absolutely don't want to have a generic 'set' that all the
mdevs are sharing as that violates the basic security principle at the
start - each and every device must have a unique list of what PASID's
it can talk to.

> 1. Identify a pool of PASIDs for permission checking (below to the same VM),
> e.g. only allow SVA binding for PASIDs allocated from the same set.
> 
> 2. Allow different PASID-aware kernel subsystems to associate, e.g. KVM,
> device drivers, and IOMMU driver. i.e. each KVM instance only cares about
> the ioasid_set associated with the VM. Events notifications are also within
> the ioasid_set to synchronize PASID states.
> 
> 3. Guest-Host PASID look up (each set has its own XArray to store the
> mapping)
> 
> 4. Quota control (going away once we have cgroup)

It sounds worrysome things have gone this way.

I'd say you shoul have a single /dev/ioasid per VM and KVM should
attach to that - it should get all the global events/etc that are not
device specific.

permission checking *must* be done on a per-device level, either inside the
mdev driver, or inside the IOMMU at a per-PCI device level.

Not sure what guest-host PASID means, these have to be 1:1 for device
assignment to work - why would use something else for mdev?

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-29 16:31                   ` Jason Gunthorpe
@ 2021-03-29 22:55                     ` Jacob Pan
  2021-03-30 13:43                       ` Jason Gunthorpe
  2021-03-30  1:37                     ` Tian, Kevin
                                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 269+ messages in thread
From: Jacob Pan @ 2021-03-29 22:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj Ashok, Tian, Kevin, Yi Liu,
	Wu Hao, Dave Jiang, jacob.jun.pan

Hi Jason,

On Mon, 29 Mar 2021 13:31:47 -0300, Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Mar 24, 2021 at 12:05:28PM -0700, Jacob Pan wrote:
> 
> > > IMHO a use created PASID is either bound to a mm (current) at creation
> > > time, or it will never be bound to a mm and its page table is under
> > > user control via /dev/ioasid.
> > >   
> > True for PASID used in native SVA bind. But for binding with a guest mm,
> > PASID is allocated first (VT-d virtual cmd interface Spec 10.4.44), the
> > bind with the host IOMMU when vIOMMU PASID cache is invalidated.
> > 
> > Our intention is to have two separate interfaces:
> > 1. /dev/ioasid (allocation/free only)
> > 2. /dev/sva (handles all SVA related activities including page tables)  
> 
> I'm not sure I understand why you'd want to have two things. Doesn't
> that just complicate everything?
> 
> Manipulating the ioasid, including filling it with page tables, seems
> an integral inseperable part of the whole interface. Why have two ?
> 
In one of the earlier discussions, I was made aware of some use cases (by
AMD, iirc) where PASID can be used w/o IOMMU. That is why I tried to keep
ioasid a separate subsystem. Other than that, I don't see an issue
combining the two.

> > > I thought the whole point of something like a /dev/ioasid was to get
> > > away from each and every device creating its own PASID interface?
> > >   
> > yes, but only for the use cases that need to expose PASID to the
> > userspace.  
> 
> Why "but only"? This thing should reach for a higher generality, not
> just be contained to solve some problem within qemu.
> 
I totally agree in terms of generality. I was just trying to point out
existing framework or drivers such as uacce and idxd driver does not have a
need to use /dev/ioasid.

> > > It maybe somewhat reasonable that some devices could have some easy
> > > 'make a SVA PASID on current' interface built in,  
> > I agree, this is the case PASID is hidden from the userspace, right?
> > e.g. uacce.  
> 
> "hidden", I guess, but does it matter so much?
> 
it matters when it comes to which interface to choose. Use /dev/ioasid to
allocate if PASID value cannot be hidden. Use some other interface for bind
current and allocate if a PASID is not visible to the user.

> The PASID would still consume a cgroup credit
> 
yes, credit still consumed. Just the PASID value is hidden.

> > > but anything more
> > > complicated should use /dev/ioasid, and anything consuming PASID
> > > should also have an API to import and attach a PASID from /dev/ioasid.
> > >   
> > Would the above two use cases constitute the "complicated" criteria? Or
> > we should say anything that need the explicit PASID value has to through
> > /dev/ioasid?  
> 
> Anything that needs more that creating a hidden PASID link'd to
> current should use the full interface.
> 
Yes, I think we are on the same page. For example, today's uacce or idxd
driver creates a hidden PASID when user does open(), where a new WQ is
provisioned and bound to current mm. This is the case where /dev/ioasid is
not needed.

> > In terms of usage for guest SVA, an ioasid_set is mostly tied to a host
> > mm, the use case is as the following:  
> 
> From that doc:
> 
>   It is imperative to enforce
>   VM-IOASID ownership such that a malicious guest cannot target DMA
>   traffic outside its own IOASIDs, or free an active IOASID that belongs
>   to another VM.
> 
> Huh?
> 
Sorry, I am not following. In the doc, I have an example to show the
ioasid_set to VM/mm mapping. We use mm as the ioasid_set token to identify
who the owner of an IOASID is. i.e. who allocated the IOASID. Non-owner
cannot perform bind page table or free operations.

Section: IOASID Set Private ID (SPID)
 .------------------.    .------------------.
 |   VM 1           |    |   VM 2           |
 |                  |    |                  |
 |------------------|    |------------------|
 | GPASID/SPID 101  |    | GPASID/SPID 101  |
 '------------------'    -------------------'     Guest
 __________|______________________|____________________
           |                      |               Host
           v                      v
 .------------------.    .------------------.
 | Host IOASID 201  |    | Host IOASID 202  |
 '------------------'    '------------------'
 |   IOASID set 1   |    |   IOASID set 2   |
 '------------------'    '------------------'


> Security in a PASID world comes from the IOMMU blocking access to the
> PASID except from approved PCI-ID's. If a VF/PF is assigned to a guest
> then that guest can cause the device to issue any PASID by having
> complete control and the vIOMMU is supposed to tell the real IOMMU
> what PASID's the device is alowed to access.
> 
Yes, each PF/VF has its own PASID table. The device can do whatever
it wants as long as the PASID is present in the table. Programming of the
pIOMMU PASID table entry, however, is controlled by the host.

IMHO, there are two levels of security here:
1. A PASID can only be used by a secure context
2. A device can only use allowed PASIDs (PASID namespace is system-wide but
PASID table storage is per PF/VF)

IOASID set is designed for #1.

> If a device is sharing a single PCI function with different security
> contexts (eg vfio mdev) then the device itself is responsible to
> ensure that only the secure interface can program a PASID and a less
> secure context can never self-enroll. 
> 
If two mdevs from the same PF dev are assigned to two VMs, the PASID
table will be shared. IOASID set ensures one VM cannot program another VM's
PASIDs. I assume 'secure context' is per VM when it comes to host PASID.

> Here the mdev driver would have to consule with the vIOMMU to ensure
> the mdev device is allowed to access the PASID - is that what this
> set stuff is about? 
> 
No. the mdev driver consults with IOASID core When the guest programs a
guest PASID on to he mdev. VDCM driver does a lookup:
host_pasid = ioasid_find_by_spid(ioasid_set, guest_pasid);

If the guest_pasid does not exist in the ioasid_set, the mdev programming
fails; if the guest_pasid does exist but it maps to a wrong host PASID, the
damage is limited to the guest itself.

> If yes, it is backwards. The MDEV is the thing doing the security, the
> MDEV should have the list of allowed PASID's and a single PASID
> created under /dev/ioasid should be loaded into MDEV with some 'Ok you
> can use PASID xyz from FD abc' command.
> 
I guess that is not the case. For VT-d dedicated WQ, there is only one
PASID can be programmed onto the device. Programming the PASID with
/dev/sva FD abc command will be checked against its mm where /dev/ioasid is
used to do the allocation.

For a single shared WQ assigned to multiple VMs, there will be one mdev per
VM. Again, FD commands is limited to the PASIDs allocated for the VM.

For a single share WQ assigned to one VM, it can be bound to multiple guest
processes/PASIDs. Host IOMMU driver maintains a list of the PASIDs and
ensures that they are only programmed on to the per device PASID table.

> Because you absolutely don't want to have a generic 'set' that all the
> mdevs are sharing as that violates the basic security principle at the
> start - each and every device must have a unique list of what PASID's
> it can talk to.
> 
I agree, I don't think this is the case. The ioasid_set is some what
orthogonal to mdev collections.

> > 1. Identify a pool of PASIDs for permission checking (below to the same
> > VM), e.g. only allow SVA binding for PASIDs allocated from the same set.
> > 
> > 2. Allow different PASID-aware kernel subsystems to associate, e.g. KVM,
> > device drivers, and IOMMU driver. i.e. each KVM instance only cares
> > about the ioasid_set associated with the VM. Events notifications are
> > also within the ioasid_set to synchronize PASID states.
> > 
> > 3. Guest-Host PASID look up (each set has its own XArray to store the
> > mapping)
> > 
> > 4. Quota control (going away once we have cgroup)  
> 
> It sounds worrysome things have gone this way.
> 
Could you expand on that? Guaranteeing quota is very difficult. cgroup
limit model fits most scalar resources.

> I'd say you shoul have a single /dev/ioasid per VM and KVM should
> attach to that - it should get all the global events/etc that are not
> device specific.
> 
You mean a single /dev/ioasid FD per VM and KVM? I think that is what we
are doing in this set. A VM process can only open /dev/ioasid once, then
use the FD for allocation and pass the PASID for bind page table etc.

> permission checking *must* be done on a per-device level, either inside
> the mdev driver, or inside the IOMMU at a per-PCI device level.
> 
I think we are on the same page. For mdev, VDCM driver makes sure the guest
PASID programmed is allocated by the same VM that also performed the bind SVA.

For PF/VF which is not mediated, the permission is implied by the IOMMU
driver/HW since PASID table is per device.

> Not sure what guest-host PASID means, these have to be 1:1 for device
> assignment to work - why would use something else for mdev?
> 
We have G-H PASID translation. They don't have to be 1:1.
IOASID Set Private ID (SPID) is intended as a generic solution for guest PASID.
Could you review the secion Section: IOASID Set Private ID (SPID) in the
doc patch?

We also had some slides from last year. Slide 3s-6 mostly.
https://static.sched.com/hosted_files/kvmforum2020/9f/KVM_forum_2020_PASID_MGMT_Yi_Jacob_final.pdf

Really appreciated your time!


Jacob

^ permalink raw reply	[flat|nested] 269+ messages in thread

* RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-29 16:31                   ` Jason Gunthorpe
  2021-03-29 22:55                     ` Jacob Pan
@ 2021-03-30  1:37                     ` Tian, Kevin
  2021-03-30 13:28                       ` Jason Gunthorpe
  2021-03-30  2:24                     ` Tian, Kevin
  2021-03-30  4:14                     ` Tian, Kevin
  3 siblings, 1 reply; 269+ messages in thread
From: Tian, Kevin @ 2021-03-30  1:37 UTC (permalink / raw)
  To: Jason Gunthorpe, Jacob Pan
  Cc: Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu, Hao,
	Jiang, Dave

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, March 30, 2021 12:32 AM
> 
> On Wed, Mar 24, 2021 at 12:05:28PM -0700, Jacob Pan wrote:
> 
> > > IMHO a use created PASID is either bound to a mm (current) at creation
> > > time, or it will never be bound to a mm and its page table is under
> > > user control via /dev/ioasid.
> > >
> > True for PASID used in native SVA bind. But for binding with a guest mm,
> > PASID is allocated first (VT-d virtual cmd interface Spec 10.4.44), the
> > bind with the host IOMMU when vIOMMU PASID cache is invalidated.
> >
> > Our intention is to have two separate interfaces:
> > 1. /dev/ioasid (allocation/free only)
> > 2. /dev/sva (handles all SVA related activities including page tables)
> 
> I'm not sure I understand why you'd want to have two things. Doesn't
> that just complicate everything?
> 
> Manipulating the ioasid, including filling it with page tables, seems
> an integral inseperable part of the whole interface. Why have two ?

Hi, Jason,

Actually above is a major open while we are refactoring vSVA uAPI toward
this direction. We have two concerns about merging /dev/ioasid with
/dev/sva, and would like to hear your thought whether they are valid.

First, userspace may use ioasid in a non-SVA scenario where ioasid is 
bound to specific security context (e.g. a control vq in vDPA) instead of 
tying to mm. In this case there is no pgtable binding initiated from user
space. Instead, ioasid is allocated from /dev/ioasid and then programmed
to the intended security context through specific passthrough framework
which manages that context.

Second, ioasid is managed per process/VM while pgtable binding is a
device-wise operation.  The userspace flow looks like below for an integral
/dev/ioasid interface:

-----------initialization----------
- ioctl(container->fd, VFIO_SET_IOMMU, VFIO_TYPE1_NESTING_IOMMU)
- ioasid_fd = open(/dev/ioasid)
- ioctl(ioasid_fd, IOASID_GET_USVA_FD, &sva_fd) //an empty context
- ioctl(device->fd, VFIO_DEVICE_SET_SVA, &sva_fd); //sva_fd ties to device
- ioctl(sva_fd, USVA_GET_INFO, &sva_info);
-----------runtime----------------
- ioctl(ioasid_fd, IOMMU_ALLOC_IOASID, &ioasid);
- ioctl(sva_fd, USVA_BIND_PGTBL, &bind_data);
- ioctl(sva_fd, USVA_FLUSH_CACHE, &inv_info);
- ioctl(sva_fd, USVA_UNBIND_PGTBL, &unbind_data);
-----------destroy----------------
- ioctl(device->fd, VFIO_DEVICE_UNSET_SVA, &sva_fd);
- close(sva_fd)
- close(ioasid_fd)

Our hesitation here is based on one of your earlier comments that
you are not a fan of constructing fd's through ioctl. Are you OK with
above flow or have a better idea of handling it?

With separate interfaces then userspace just opens /dev/sva instead 
of getting it through ioasid_fd:

- ioasid_fd = open(/dev/ioasid)
- sva_fd = open(/dev/sva)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 269+ messages in thread

* RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-29 16:31                   ` Jason Gunthorpe
  2021-03-29 22:55                     ` Jacob Pan
  2021-03-30  1:37                     ` Tian, Kevin
@ 2021-03-30  2:24                     ` Tian, Kevin
  2021-03-30 13:24                       ` Jason Gunthorpe
  2021-03-30  4:14                     ` Tian, Kevin
  3 siblings, 1 reply; 269+ messages in thread
From: Tian, Kevin @ 2021-03-30  2:24 UTC (permalink / raw)
  To: Jason Gunthorpe, Jacob Pan
  Cc: Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu, Hao,
	Jiang, Dave

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, March 30, 2021 12:32 AM
> > In terms of usage for guest SVA, an ioasid_set is mostly tied to a host mm,
> > the use case is as the following:
> 
> From that doc:
> 
>   It is imperative to enforce
>   VM-IOASID ownership such that a malicious guest cannot target DMA
>   traffic outside its own IOASIDs, or free an active IOASID that belongs
>   to another VM.
> 
> Huh?
> 
> Security in a PASID world comes from the IOMMU blocking access to the
> PASID except from approved PCI-ID's. If a VF/PF is assigned to a guest
> then that guest can cause the device to issue any PASID by having
> complete control and the vIOMMU is supposed to tell the real IOMMU
> what PASID's the device is alowed to access.
> 
> If a device is sharing a single PCI function with different security
> contexts (eg vfio mdev) then the device itself is responsible to
> ensure that only the secure interface can program a PASID and a less
> secure context can never self-enroll.
> 
> Here the mdev driver would have to consule with the vIOMMU to ensure
> the mdev device is allowed to access the PASID - is that what this
> set stuff is about?
> 
> If yes, it is backwards. The MDEV is the thing doing the security, the
> MDEV should have the list of allowed PASID's and a single PASID
> created under /dev/ioasid should be loaded into MDEV with some 'Ok you
> can use PASID xyz from FD abc' command.
> 

The 'set' is per-VM. Once the mdev is assigned to a VM, all valid PASID's
in the set of that VM are considered legitimate on this mdev. The mdev
driver will mediate guest operations which program PASID to the backend
context and load the PASID only if it is within the 'set' (i.e. already 
allocated through /dev/ioasid). This prevents a malicious VM from attacking
others. Though it's not mdev which directly maintaining the list of allowed 
PASID's, the effect is the same in concept.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 269+ messages in thread

* RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-29 16:31                   ` Jason Gunthorpe
                                       ` (2 preceding siblings ...)
  2021-03-30  2:24                     ` Tian, Kevin
@ 2021-03-30  4:14                     ` Tian, Kevin
  2021-03-30 13:27                       ` Jason Gunthorpe
  3 siblings, 1 reply; 269+ messages in thread
From: Tian, Kevin @ 2021-03-30  4:14 UTC (permalink / raw)
  To: Jason Gunthorpe, Jacob Pan
  Cc: Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu, Hao,
	Jiang, Dave

> From: Tian, Kevin
> Sent: Tuesday, March 30, 2021 10:24 AM
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, March 30, 2021 12:32 AM
> > > In terms of usage for guest SVA, an ioasid_set is mostly tied to a host mm,
> > > the use case is as the following:
> >
> > From that doc:
> >
> >   It is imperative to enforce
> >   VM-IOASID ownership such that a malicious guest cannot target DMA
> >   traffic outside its own IOASIDs, or free an active IOASID that belongs
> >   to another VM.
> >
> > Huh?
> >
> > Security in a PASID world comes from the IOMMU blocking access to the
> > PASID except from approved PCI-ID's. If a VF/PF is assigned to a guest
> > then that guest can cause the device to issue any PASID by having
> > complete control and the vIOMMU is supposed to tell the real IOMMU
> > what PASID's the device is alowed to access.
> >
> > If a device is sharing a single PCI function with different security
> > contexts (eg vfio mdev) then the device itself is responsible to
> > ensure that only the secure interface can program a PASID and a less
> > secure context can never self-enroll.
> >
> > Here the mdev driver would have to consule with the vIOMMU to ensure
> > the mdev device is allowed to access the PASID - is that what this
> > set stuff is about?
> >
> > If yes, it is backwards. The MDEV is the thing doing the security, the
> > MDEV should have the list of allowed PASID's and a single PASID
> > created under /dev/ioasid should be loaded into MDEV with some 'Ok you
> > can use PASID xyz from FD abc' command.
> >
> 
> The 'set' is per-VM. Once the mdev is assigned to a VM, all valid PASID's
> in the set of that VM are considered legitimate on this mdev. The mdev
> driver will mediate guest operations which program PASID to the backend
> context and load the PASID only if it is within the 'set' (i.e. already
> allocated through /dev/ioasid). This prevents a malicious VM from attacking
> others. Though it's not mdev which directly maintaining the list of allowed
> PASID's, the effect is the same in concept.
> 

One correction. The mdev should still construct the list of allowed PASID's as
you said (by listening to IOASID_BIND/UNBIND event), in addition to the ioasid 
set maintained per VM (updated when a PASID is allocated/freed). The per-VM
set is required for inter-VM isolation (verified when a pgtable is bound to the 
mdev/PASID), while the mdev's own list is necessary for intra-VM isolation when 
multiple mdevs are assigned to the same VM (verified before loading a PASID 
to the mdev). This series just handles the general part i.e. per-VM ioasid set and 
leaves the mdev's own list to be managed by specific mdev driver which listens
to various IOASID events).

Thanks
Kevin

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-26  8:06                             ` Jean-Philippe Brucker
@ 2021-03-30 13:07                               ` Jason Gunthorpe
  2021-03-30 13:42                                 ` Jean-Philippe Brucker
  0 siblings, 1 reply; 269+ messages in thread
From: Jason Gunthorpe @ 2021-03-30 13:07 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Jacob Pan, LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu,
	cgroups, Tejun Heo, Li Zefan, Johannes Weiner,
	Jean-Philippe Brucker, Alex Williamson, Eric Auger,
	Jonathan Corbet, Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao,
	Dave Jiang

On Fri, Mar 26, 2021 at 09:06:42AM +0100, Jean-Philippe Brucker wrote:

> It's not inconceivable to have a control queue doing DMA tagged with
> PASID. The devices I know either use untagged DMA, or have a choice to use
> a PASID.

I don't think we should encourage that. A PASID and all the related is
so expensive compared to just doing normal untagged kernel DMA.

I assume HW has these features because virtualization use cases might
use them, eg by using mdev to assign a command queue - then it would
need be be contained by a PASID.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-30  2:24                     ` Tian, Kevin
@ 2021-03-30 13:24                       ` Jason Gunthorpe
  0 siblings, 0 replies; 269+ messages in thread
From: Jason Gunthorpe @ 2021-03-30 13:24 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jacob Pan, Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu, Hao,
	Jiang, Dave

On Tue, Mar 30, 2021 at 02:24:09AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, March 30, 2021 12:32 AM
> > > In terms of usage for guest SVA, an ioasid_set is mostly tied to a host mm,
> > > the use case is as the following:
> > 
> > From that doc:
> > 
> >   It is imperative to enforce
> >   VM-IOASID ownership such that a malicious guest cannot target DMA
> >   traffic outside its own IOASIDs, or free an active IOASID that belongs
> >   to another VM.
> > 
> > Huh?
> > 
> > Security in a PASID world comes from the IOMMU blocking access to the
> > PASID except from approved PCI-ID's. If a VF/PF is assigned to a guest
> > then that guest can cause the device to issue any PASID by having
> > complete control and the vIOMMU is supposed to tell the real IOMMU
> > what PASID's the device is alowed to access.
> > 
> > If a device is sharing a single PCI function with different security
> > contexts (eg vfio mdev) then the device itself is responsible to
> > ensure that only the secure interface can program a PASID and a less
> > secure context can never self-enroll.
> > 
> > Here the mdev driver would have to consule with the vIOMMU to ensure
> > the mdev device is allowed to access the PASID - is that what this
> > set stuff is about?
> > 
> > If yes, it is backwards. The MDEV is the thing doing the security, the
> > MDEV should have the list of allowed PASID's and a single PASID
> > created under /dev/ioasid should be loaded into MDEV with some 'Ok you
> > can use PASID xyz from FD abc' command.
> > 
> 
> The 'set' is per-VM. Once the mdev is assigned to a VM, all valid PASID's
> in the set of that VM are considered legitimate on this mdev.

No! That is a major security problem!

PASID authorization is *PER DEVICE*.

If I map a device into VFIO in userspace with full control over the HW
that device MUST ONLY have access to PASID's that have been registered
with vfio.

This means each time you register a PASID vfio must tell the IOMMU
driver to authorize the pci_device to access the PASID, the vIOMMU
driver must tell the hypervisor and the mdev under the PCI device MUST
have a per-device list of allowed PASIDs.

Otherwise userspace in a VM with vfio could tell the mdev driver to
talk to a PASID in the same VM but *that process doesn't own*. This is
absolutely not allowed.

Most likely the entire ioasid set and related need to be deleted as a
kernel concept.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-30  4:14                     ` Tian, Kevin
@ 2021-03-30 13:27                       ` Jason Gunthorpe
  2021-03-31  7:41                         ` Liu, Yi L
  0 siblings, 1 reply; 269+ messages in thread
From: Jason Gunthorpe @ 2021-03-30 13:27 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jacob Pan, Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu, Hao,
	Jiang, Dave

On Tue, Mar 30, 2021 at 04:14:58AM +0000, Tian, Kevin wrote:

> One correction. The mdev should still construct the list of allowed PASID's as
> you said (by listening to IOASID_BIND/UNBIND event), in addition to the ioasid 
> set maintained per VM (updated when a PASID is allocated/freed). The per-VM
> set is required for inter-VM isolation (verified when a pgtable is bound to the 
> mdev/PASID), while the mdev's own list is necessary for intra-VM isolation when 
> multiple mdevs are assigned to the same VM (verified before loading a PASID 
> to the mdev). This series just handles the general part i.e. per-VM ioasid set and 
> leaves the mdev's own list to be managed by specific mdev driver which listens
> to various IOASID events).

This is better, but I don't understand why we need such a convoluted
design.

Get rid of the ioasid set.

Each driver has its own list of allowed ioasids.

Register a ioasid in the driver's list by passing the fd and ioasid #

No listening to events. A simple understandable security model.

Look - it took you three emails to even correctly explain the security
model you are striving for here, it is *obviously* too complicated for
anyone to understand or successfully implement. simplify smiplify
simplify.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-30  1:37                     ` Tian, Kevin
@ 2021-03-30 13:28                       ` Jason Gunthorpe
  2021-03-31  7:38                         ` Liu, Yi L
  2021-04-02  8:22                         ` Tian, Kevin
  0 siblings, 2 replies; 269+ messages in thread
From: Jason Gunthorpe @ 2021-03-30 13:28 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jacob Pan, Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu, Hao,
	Jiang, Dave

On Tue, Mar 30, 2021 at 01:37:05AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, March 30, 2021 12:32 AM
> > 
> > On Wed, Mar 24, 2021 at 12:05:28PM -0700, Jacob Pan wrote:
> > 
> > > > IMHO a use created PASID is either bound to a mm (current) at creation
> > > > time, or it will never be bound to a mm and its page table is under
> > > > user control via /dev/ioasid.
> > > >
> > > True for PASID used in native SVA bind. But for binding with a guest mm,
> > > PASID is allocated first (VT-d virtual cmd interface Spec 10.4.44), the
> > > bind with the host IOMMU when vIOMMU PASID cache is invalidated.
> > >
> > > Our intention is to have two separate interfaces:
> > > 1. /dev/ioasid (allocation/free only)
> > > 2. /dev/sva (handles all SVA related activities including page tables)
> > 
> > I'm not sure I understand why you'd want to have two things. Doesn't
> > that just complicate everything?
> > 
> > Manipulating the ioasid, including filling it with page tables, seems
> > an integral inseperable part of the whole interface. Why have two ?
> 
> Hi, Jason,
> 
> Actually above is a major open while we are refactoring vSVA uAPI toward
> this direction. We have two concerns about merging /dev/ioasid with
> /dev/sva, and would like to hear your thought whether they are valid.
> 
> First, userspace may use ioasid in a non-SVA scenario where ioasid is 
> bound to specific security context (e.g. a control vq in vDPA) instead of 
> tying to mm. In this case there is no pgtable binding initiated from user
> space. Instead, ioasid is allocated from /dev/ioasid and then programmed
> to the intended security context through specific passthrough framework
> which manages that context.

This sounds like the exact opposite of what I'd like to see.

I do not want to see every subsystem gaining APIs to program a
PASID. All of that should be consolidated in *one place*.

I do not want to see VDPA and VFIO have two nearly identical sets of
APIs to control the PASID.

Drivers consuming a PASID, like VDPA, should consume the PASID and do
nothing more than authorize the HW to use it.

quemu should have general code under the viommu driver that drives
/dev/ioasid to create PASID's and manage the IO mapping according to
the guest's needs.

Drivers like VDPA and VFIO should simply accept that PASID and
configure/authorize their HW to do DMA's with its tag.

> Second, ioasid is managed per process/VM while pgtable binding is a
> device-wise operation.  The userspace flow looks like below for an integral
> /dev/ioasid interface:
> 
> - ioctl(container->fd, VFIO_SET_IOMMU, VFIO_TYPE1_NESTING_IOMMU)
> - ioasid_fd = open(/dev/ioasid)
> - ioctl(ioasid_fd, IOASID_GET_USVA_FD, &sva_fd) //an empty context
> - ioctl(device->fd, VFIO_DEVICE_SET_SVA, &sva_fd); //sva_fd ties to device
> - ioctl(sva_fd, USVA_GET_INFO, &sva_info);
> - ioctl(ioasid_fd, IOMMU_ALLOC_IOASID, &ioasid);
> - ioctl(sva_fd, USVA_BIND_PGTBL, &bind_data);
> - ioctl(sva_fd, USVA_FLUSH_CACHE, &inv_info);
> - ioctl(sva_fd, USVA_UNBIND_PGTBL, &unbind_data);
> - ioctl(device->fd, VFIO_DEVICE_UNSET_SVA, &sva_fd);
> - close(sva_fd)
> - close(ioasid_fd)
> 
> Our hesitation here is based on one of your earlier comments that
> you are not a fan of constructing fd's through ioctl. Are you OK with
> above flow or have a better idea of handling it?

My reaction is to squash 'sva' and ioasid fds together, I can't see
why you'd need two fds to manipulate a PASID.

DEVICE_SET_SVA seems like the wrong language too, it should be more
like DEVICE_ALLOW_IOASID which only tells the iommu and driver to alow
the pci_device to use the IOASID.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-30 13:07                               ` Jason Gunthorpe
@ 2021-03-30 13:42                                 ` Jean-Philippe Brucker
  2021-03-30 13:46                                   ` Jason Gunthorpe
  0 siblings, 1 reply; 269+ messages in thread
From: Jean-Philippe Brucker @ 2021-03-30 13:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jacob Pan, LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu,
	cgroups, Tejun Heo, Li Zefan, Johannes Weiner,
	Jean-Philippe Brucker, Alex Williamson, Eric Auger,
	Jonathan Corbet, Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao,
	Dave Jiang

On Tue, Mar 30, 2021 at 10:07:55AM -0300, Jason Gunthorpe wrote:
> On Fri, Mar 26, 2021 at 09:06:42AM +0100, Jean-Philippe Brucker wrote:
> 
> > It's not inconceivable to have a control queue doing DMA tagged with
> > PASID. The devices I know either use untagged DMA, or have a choice to use
> > a PASID.
> 
> I don't think we should encourage that. A PASID and all the related is
> so expensive compared to just doing normal untagged kernel DMA.

How is it expensive?  Low number of PASIDs, or slowing down DMA
transactions?  PASIDs aren't a scarce resource on Arm systems, they have
almost 1M unused PASIDs per VM.

Thanks,
Jean

> I assume HW has these features because virtualization use cases might
> use them, eg by using mdev to assign a command queue - then it would
> need be be contained by a PASID.
> 
> Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-29 22:55                     ` Jacob Pan
@ 2021-03-30 13:43                       ` Jason Gunthorpe
  2021-03-31  0:10                         ` Jacob Pan
  2021-03-31  8:38                         ` Liu, Yi L
  0 siblings, 2 replies; 269+ messages in thread
From: Jason Gunthorpe @ 2021-03-30 13:43 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj Ashok, Tian, Kevin, Yi Liu,
	Wu Hao, Dave Jiang

On Mon, Mar 29, 2021 at 03:55:26PM -0700, Jacob Pan wrote:

> In one of the earlier discussions, I was made aware of some use cases (by
> AMD, iirc) where PASID can be used w/o IOMMU. That is why I tried to keep
> ioasid a separate subsystem. Other than that, I don't see an issue
> combining the two.

That sounds like nonsense. A freshly created ioasid should have *NO
DMA*. Every access to it should result in a PCI error until a mapping
for the address space is defined. It is called IO *address space* for
a reason.

So, what exactly do you do with a PASID without an IOMMU? You
certainly can't expose it through this interface because you can't
establish the first requirement of *NO DMA*.

While there may be an interesting use case, it looks to be kernel-only
and not relavent here.

> it matters when it comes to which interface to choose. Use /dev/ioasid to
> allocate if PASID value cannot be hidden. Use some other interface for bind
> current and allocate if a PASID is not visible to the user.

I just view it as a shortcut, it has less to do with "hidden" and more
to do if the shortcut is a valuable savings. If you swap four ioctls
with one ioctl I'd say that is not enough of a win <shrug>
 
> Yes, I think we are on the same page. For example, today's uacce or idxd
> driver creates a hidden PASID when user does open(), where a new WQ is
> provisioned and bound to current mm. This is the case where /dev/ioasid is
> not needed.

So that is a probelm for uacce, they shouldn't have created PASIDs at
open() time, there is no option to customize what is happening there.

> Sorry, I am not following. In the doc, I have an example to show the
> ioasid_set to VM/mm mapping. We use mm as the ioasid_set token to identify
> who the owner of an IOASID is. i.e. who allocated the IOASID. Non-owner
> cannot perform bind page table or free operations.

As I said to Kevin this seems very over complicated.

Access to the /dev/ioasid FD is the only authorization the kernel
needs.

> Yes, each PF/VF has its own PASID table. The device can do whatever
> it wants as long as the PASID is present in the table. Programming of the
> pIOMMU PASID table entry, however, is controlled by the host.
> 
> IMHO, there are two levels of security here:
> 1. A PASID can only be used by a secure context
> 2. A device can only use allowed PASIDs (PASID namespace is system-wide but
> PASID table storage is per PF/VF)
> 
> IOASID set is designed for #1.

#1 sounds like the mdev case, and as I said to Kevin each and every
mdev needs its own allow'd PASID list. There is no need for an ioasid
set to implement that.

> > If a device is sharing a single PCI function with different security
> > contexts (eg vfio mdev) then the device itself is responsible to
> > ensure that only the secure interface can program a PASID and a less
> > secure context can never self-enroll. 
> 
> If two mdevs from the same PF dev are assigned to two VMs, the PASID
> table will be shared. IOASID set ensures one VM cannot program another VM's
> PASIDs. I assume 'secure context' is per VM when it comes to host PASID.

No, the mdev device driver must enforce this directly. It is the one
that programms the physical shared HW, it is the one that needs a list
of PASID's it is allowed to program *for each mdev*

ioasid_set doesn't seem to help at all, certainly not as a concept
tied to /dev/ioasid.

> No. the mdev driver consults with IOASID core When the guest programs a
> guest PASID on to he mdev. VDCM driver does a lookup:
> host_pasid = ioasid_find_by_spid(ioasid_set, guest_pasid);

This is the wrong layering. Tell the mdev device directly what it is
allowed to do. Do not pollute the ioasid core with security stuff.

> > I'd say you shoul have a single /dev/ioasid per VM and KVM should
> > attach to that - it should get all the global events/etc that are not
> > device specific.
> > 
> You mean a single /dev/ioasid FD per VM and KVM? I think that is what we
> are doing in this set. A VM process can only open /dev/ioasid once, then
> use the FD for allocation and pass the PASID for bind page table etc.

Yes, I think that is reasonable.

Tag all the IOCTL's with the IOASID number.
 
> > Not sure what guest-host PASID means, these have to be 1:1 for device
> > assignment to work - why would use something else for mdev?
> > 
> We have G-H PASID translation. They don't have to be 1:1.
> IOASID Set Private ID (SPID) is intended as a generic solution for guest PASID.
> Could you review the secion Section: IOASID Set Private ID (SPID) in the
> doc patch?

Again this only works for MDEV? How would you do translation for a
real PF/VF?

So when you 'allow' a mdev to access a PASID you want to say:
 Allow Guest PASID A, map it to host PASID B on this /dev/ioasid FD

?

That seems like a good helper library to provide for drivers to use,
but it should be a construct entirely contained in the driver.

> We also had some slides from last year. Slide 3s-6 mostly.
> https://static.sched.com/hosted_files/kvmforum2020/9f/KVM_forum_2020_PASID_MGMT_Yi_Jacob_final.pdf

I think you are trying to put too much into a giant ioasid
core. Responsibility needs to rest in more logical places, it will
simplify everything.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-30 13:42                                 ` Jean-Philippe Brucker
@ 2021-03-30 13:46                                   ` Jason Gunthorpe
  0 siblings, 0 replies; 269+ messages in thread
From: Jason Gunthorpe @ 2021-03-30 13:46 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Jacob Pan, LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu,
	cgroups, Tejun Heo, Li Zefan, Johannes Weiner,
	Jean-Philippe Brucker, Alex Williamson, Eric Auger,
	Jonathan Corbet, Raj Ashok, Tian, Kevin, Yi Liu, Wu Hao,
	Dave Jiang

On Tue, Mar 30, 2021 at 03:42:24PM +0200, Jean-Philippe Brucker wrote:
> On Tue, Mar 30, 2021 at 10:07:55AM -0300, Jason Gunthorpe wrote:
> > On Fri, Mar 26, 2021 at 09:06:42AM +0100, Jean-Philippe Brucker wrote:
> > 
> > > It's not inconceivable to have a control queue doing DMA tagged with
> > > PASID. The devices I know either use untagged DMA, or have a choice to use
> > > a PASID.
> > 
> > I don't think we should encourage that. A PASID and all the related is
> > so expensive compared to just doing normal untagged kernel DMA.
> 
> How is it expensive?  Low number of PASIDs, or slowing down DMA
> transactions?  PASIDs aren't a scarce resource on Arm systems, they have
> almost 1M unused PASIDs per VM.

There may be lots of PASIDs, but they are not without cost. The page
table behind them costs memory and cache occupancy, doing the lookups
hurts DMA performance.

Compare to a physical addressed kernel DMA (like x86 often sets up)
the runtime overheads from unnecessary PASID use is quite big.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-30 13:43                       ` Jason Gunthorpe
@ 2021-03-31  0:10                         ` Jacob Pan
  2021-03-31 12:28                           ` Jason Gunthorpe
  2021-03-31  8:38                         ` Liu, Yi L
  1 sibling, 1 reply; 269+ messages in thread
From: Jacob Pan @ 2021-03-31  0:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj Ashok, Tian, Kevin, Yi Liu,
	Wu Hao, Dave Jiang, jacob.jun.pan

Hi Jason,

On Tue, 30 Mar 2021 10:43:13 -0300, Jason Gunthorpe <jgg@nvidia.com> wrote:

> > If two mdevs from the same PF dev are assigned to two VMs, the PASID
> > table will be shared. IOASID set ensures one VM cannot program another
> > VM's PASIDs. I assume 'secure context' is per VM when it comes to host
> > PASID.  
> 
> No, the mdev device driver must enforce this directly. It is the one
> that programms the physical shared HW, it is the one that needs a list
> of PASID's it is allowed to program *for each mdev*
> 
This requires the mdev driver to obtain a list of allowed PASIDs(possibly
during PASID bind time) prior to do enforcement. IMHO, the PASID enforcement
points are:
1. During WQ configuration (e.g.program MSI)
2. During work submission

For VT-d shared workqueue, there is no way to enforce #2 in mdev driver in
that the PASID is obtained from PASID MSR from the CPU and submitted w/o
driver involvement. The enforcement for #2 is in the KVM PASID translation
table, which is per VM.

For our current VFIO mdev model, bind guest page table does not involve
mdev driver. So this is a gap we must fill, i.e. include a callback from
mdev driver?

> ioasid_set doesn't seem to help at all, certainly not as a concept
> tied to /dev/ioasid.
> 
Yes, we can take the security role off ioasid_set once we have per mdev
list. However, ioasid_set being a per VM/mm entity also bridge
communications among kernel subsystems that don't have direct call path.
e.g. KVM, VDCM and IOMMU.

> > No. the mdev driver consults with IOASID core When the guest programs a
> > guest PASID on to he mdev. VDCM driver does a lookup:
> > host_pasid = ioasid_find_by_spid(ioasid_set, guest_pasid);  
> 
> This is the wrong layering. Tell the mdev device directly what it is
> allowed to do. Do not pollute the ioasid core with security stuff.
> 
> > > I'd say you shoul have a single /dev/ioasid per VM and KVM should
> > > attach to that - it should get all the global events/etc that are not
> > > device specific.
> > >   
> > You mean a single /dev/ioasid FD per VM and KVM? I think that is what we
> > are doing in this set. A VM process can only open /dev/ioasid once, then
> > use the FD for allocation and pass the PASID for bind page table etc.  
> 
> Yes, I think that is reasonable.
> 
> Tag all the IOCTL's with the IOASID number.
>  
> > > Not sure what guest-host PASID means, these have to be 1:1 for device
> > > assignment to work - why would use something else for mdev?
> > >   
> > We have G-H PASID translation. They don't have to be 1:1.
> > IOASID Set Private ID (SPID) is intended as a generic solution for
> > guest PASID. Could you review the secion Section: IOASID Set Private ID
> > (SPID) in the doc patch?  
> 
> Again this only works for MDEV? How would you do translation for a
> real PF/VF?
> 
Right, we will need some mediation for PF/VF.

> So when you 'allow' a mdev to access a PASID you want to say:
>  Allow Guest PASID A, map it to host PASID B on this /dev/ioasid FD
> 
> ?
> 
Host and guest PASID value, as well as device info are available through
iommu_uapi_sva_bind_gpasid(), we just need to feed that info to mdev driver.

> That seems like a good helper library to provide for drivers to use,
> but it should be a construct entirely contained in the driver.
why? would it be cleaner if it is in the common code?

Thanks,

Jacob

^ permalink raw reply	[flat|nested] 269+ messages in thread

* RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-30 13:28                       ` Jason Gunthorpe
@ 2021-03-31  7:38                         ` Liu, Yi L
  2021-03-31 12:40                           ` Jason Gunthorpe
  2021-04-02  8:22                         ` Tian, Kevin
  1 sibling, 1 reply; 269+ messages in thread
From: Liu, Yi L @ 2021-03-31  7:38 UTC (permalink / raw)
  To: Jason Gunthorpe, Tian, Kevin
  Cc: Jacob Pan, Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj, Ashok, Wu, Hao, Jiang, Dave

Hi Jason,

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, March 30, 2021 9:29 PM
> 
> On Tue, Mar 30, 2021 at 01:37:05AM +0000, Tian, Kevin wrote:
[...]
> > Hi, Jason,
> >
> > Actually above is a major open while we are refactoring vSVA uAPI toward
> > this direction. We have two concerns about merging /dev/ioasid with
> > /dev/sva, and would like to hear your thought whether they are valid.
> >
> > First, userspace may use ioasid in a non-SVA scenario where ioasid is
> > bound to specific security context (e.g. a control vq in vDPA) instead of
> > tying to mm. In this case there is no pgtable binding initiated from user
> > space. Instead, ioasid is allocated from /dev/ioasid and then programmed
> > to the intended security context through specific passthrough framework
> > which manages that context.
> 
> This sounds like the exact opposite of what I'd like to see.
> 
> I do not want to see every subsystem gaining APIs to program a
> PASID. All of that should be consolidated in *one place*.
> 
> I do not want to see VDPA and VFIO have two nearly identical sets of
> APIs to control the PASID.
> 
> Drivers consuming a PASID, like VDPA, should consume the PASID and do
> nothing more than authorize the HW to use it.
> 
> quemu should have general code under the viommu driver that drives
> /dev/ioasid to create PASID's and manage the IO mapping according to
> the guest's needs.
> 
> Drivers like VDPA and VFIO should simply accept that PASID and
> configure/authorize their HW to do DMA's with its tag.
> 
> > Second, ioasid is managed per process/VM while pgtable binding is a
> > device-wise operation.  The userspace flow looks like below for an integral
> > /dev/ioasid interface:
> >
> > - ioctl(container->fd, VFIO_SET_IOMMU, VFIO_TYPE1_NESTING_IOMMU)
> > - ioasid_fd = open(/dev/ioasid)
> > - ioctl(ioasid_fd, IOASID_GET_USVA_FD, &sva_fd) //an empty context
> > - ioctl(device->fd, VFIO_DEVICE_SET_SVA, &sva_fd); //sva_fd ties to
> device
> > - ioctl(sva_fd, USVA_GET_INFO, &sva_info);
> > - ioctl(ioasid_fd, IOMMU_ALLOC_IOASID, &ioasid);
> > - ioctl(sva_fd, USVA_BIND_PGTBL, &bind_data);
> > - ioctl(sva_fd, USVA_FLUSH_CACHE, &inv_info);
> > - ioctl(sva_fd, USVA_UNBIND_PGTBL, &unbind_data);
> > - ioctl(device->fd, VFIO_DEVICE_UNSET_SVA, &sva_fd);
> > - close(sva_fd)
> > - close(ioasid_fd)
> >
> > Our hesitation here is based on one of your earlier comments that
> > you are not a fan of constructing fd's through ioctl. Are you OK with
> > above flow or have a better idea of handling it?
> 
> My reaction is to squash 'sva' and ioasid fds together, I can't see
> why you'd need two fds to manipulate a PASID.

The reason is /dev/ioasid FD is per-VM since the ioasid allocated to
the VM should be able to be shared by all assigned device for the VM.
But the SVA operations (bind/unbind page table, cache_invalidate) should
be per-device. If squashing the two fds to be one, then requires a device
tag for each vSVA ioctl. I'm not sure if it is good. Per me, it looks
better to have a SVA FD and associated with a device FD so that any ioctl
on it will be in the device level. This also benefits ARM and AMD's vSVA
support since they binds guest PASID table to host instead of binding
guest page tables to specific PASIDs.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 269+ messages in thread

* RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-30 13:27                       ` Jason Gunthorpe
@ 2021-03-31  7:41                         ` Liu, Yi L
  2021-03-31 12:38                           ` Jason Gunthorpe
  0 siblings, 1 reply; 269+ messages in thread
From: Liu, Yi L @ 2021-03-31  7:41 UTC (permalink / raw)
  To: Jason Gunthorpe, Tian, Kevin
  Cc: Jacob Pan, Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj, Ashok, Wu, Hao, Jiang, Dave

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, March 30, 2021 9:28 PM
> 
> On Tue, Mar 30, 2021 at 04:14:58AM +0000, Tian, Kevin wrote:
> 
> > One correction. The mdev should still construct the list of allowed PASID's
> as
> > you said (by listening to IOASID_BIND/UNBIND event), in addition to the
> ioasid
> > set maintained per VM (updated when a PASID is allocated/freed). The
> per-VM
> > set is required for inter-VM isolation (verified when a pgtable is bound to
> the
> > mdev/PASID), while the mdev's own list is necessary for intra-VM isolation
> when
> > multiple mdevs are assigned to the same VM (verified before loading a
> PASID
> > to the mdev). This series just handles the general part i.e. per-VM ioasid
> set and
> > leaves the mdev's own list to be managed by specific mdev driver which
> listens
> > to various IOASID events).
> 
> This is better, but I don't understand why we need such a convoluted
> design.
> 
> Get rid of the ioasid set.
>
> Each driver has its own list of allowed ioasids.

First, I agree with you it's necessary to have a per-device allowed ioasid
list. But besides it, I think we still need to ensure the ioasid used by a
VM is really allocated to this VM. A VM should not use an ioasid allocated
to another VM. right? Actually, this is the major intention for introducing
ioasid_set.

> Register a ioasid in the driver's list by passing the fd and ioasid #

The fd here is a device fd. Am I right? If yes, your idea is ioasid is
allocated via /dev/ioasid and associated with device fd via either VFIO
or vDPA ioctl. right? sorry I may be asking silly questions but really
need to ensure we are talking in the same page.

> No listening to events. A simple understandable security model.

For this suggestion, I have a little bit concern if we may have A-B/B-A
lock sequence issue since it requires the /dev/ioasid (if it supports)
to call back into VFIO/VDPA to check if the ioasid has been registered to
device FD and record it in the per-device list. right? Let's have more
discussion based on the skeleton sent by Kevin.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 269+ messages in thread

* RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-30 13:43                       ` Jason Gunthorpe
  2021-03-31  0:10                         ` Jacob Pan
@ 2021-03-31  8:38                         ` Liu, Yi L
  1 sibling, 0 replies; 269+ messages in thread
From: Liu, Yi L @ 2021-03-31  8:38 UTC (permalink / raw)
  To: Jason Gunthorpe, Jacob Pan
  Cc: Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj, Ashok, Tian, Kevin, Wu, Hao,
	Jiang, Dave

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, March 30, 2021 9:43 PM
[..]
> No, the mdev device driver must enforce this directly. It is the one
> that programms the physical shared HW, it is the one that needs a list
> of PASID's it is allowed to program *for each mdev*
> 
> ioasid_set doesn't seem to help at all, certainly not as a concept
> tied to /dev/ioasid.
> 

As replied in another thread. We introduced ioasid_set based on the
motivation to have per-VM ioasid track, which is required when user
space tries to bind an ioasid with a device. Should ensure the ioasid
it is using was allocated to it. otherwise, we may suffer inter-VM ioasid
problem. It may not necessaty to be ioasid_set but a per-VM ioasid track
is necessary.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-31  0:10                         ` Jacob Pan
@ 2021-03-31 12:28                           ` Jason Gunthorpe
  2021-03-31 16:34                             ` Jacob Pan
  0 siblings, 1 reply; 269+ messages in thread
From: Jason Gunthorpe @ 2021-03-31 12:28 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj Ashok, Tian, Kevin, Yi Liu,
	Wu Hao, Dave Jiang

On Tue, Mar 30, 2021 at 05:10:41PM -0700, Jacob Pan wrote:
> Hi Jason,
> 
> On Tue, 30 Mar 2021 10:43:13 -0300, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > > If two mdevs from the same PF dev are assigned to two VMs, the PASID
> > > table will be shared. IOASID set ensures one VM cannot program another
> > > VM's PASIDs. I assume 'secure context' is per VM when it comes to host
> > > PASID.  
> > 
> > No, the mdev device driver must enforce this directly. It is the one
> > that programms the physical shared HW, it is the one that needs a list
> > of PASID's it is allowed to program *for each mdev*
> > 
> This requires the mdev driver to obtain a list of allowed PASIDs(possibly
> during PASID bind time) prior to do enforcement. IMHO, the PASID enforcement
> points are:
> 1. During WQ configuration (e.g.program MSI)
> 2. During work submission
> 
> For VT-d shared workqueue, there is no way to enforce #2 in mdev driver in
> that the PASID is obtained from PASID MSR from the CPU and submitted w/o
> driver involvement.

I assume that the PASID MSR is privileged and only qemu can program
it? Otherwise this seems like a security problem.

If qemu controls it then the idxd userspace driver in qemu must ensure
it is only ever programmed to an authorized PASID.

> The enforcement for #2 is in the KVM PASID translation table, which
> is per VM.

I don't understand why KVM gets involved in PASID??

Doesn't work submission go either to the mdev driver or through the
secure PASID of #1?

> For our current VFIO mdev model, bind guest page table does not involve
> mdev driver. So this is a gap we must fill, i.e. include a callback from
> mdev driver?

No not a callback, tell the mdev driver with a VFIO IOCTL that it is
authorized to use a specific PASID because the vIOMMU was told to
allow it by the guest kernel. Simple and straightforward.

> > ioasid_set doesn't seem to help at all, certainly not as a concept
> > tied to /dev/ioasid.
> > 
> Yes, we can take the security role off ioasid_set once we have per mdev
> list. However, ioasid_set being a per VM/mm entity also bridge
> communications among kernel subsystems that don't have direct call path.
> e.g. KVM, VDCM and IOMMU.

Everything should revolve around the /dev/ioasid FD. qemu should pass
it to all places that need to know about PASID's in the VM.

We should try to avoid hidden behind the scenes kernel
interconnections between subsystems.


> > So when you 'allow' a mdev to access a PASID you want to say:
> >  Allow Guest PASID A, map it to host PASID B on this /dev/ioasid FD
> > 

> Host and guest PASID value, as well as device info are available through
> iommu_uapi_sva_bind_gpasid(), we just need to feed that info to mdev driver.

You need that IOCTL to exist on the *mdev driver*. It is a VFIO ioctl,
not a iommu or ioasid or sva IOCTL.
 
> > That seems like a good helper library to provide for drivers to use,
> > but it should be a construct entirely contained in the driver.
> why? would it be cleaner if it is in the common code?

No, it is the "mid layer" problematic design.

Having the iommu layer store driver-specific data on behalf of a
driver will just make a mess. Use the natural layering we have and
store driver specific data in the driver structs.

Add a library to help build the datastructure if it necessary.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-31  7:41                         ` Liu, Yi L
@ 2021-03-31 12:38                           ` Jason Gunthorpe
  2021-03-31 23:46                             ` Jacob Pan
  0 siblings, 1 reply; 269+ messages in thread
From: Jason Gunthorpe @ 2021-03-31 12:38 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Tian, Kevin, Jacob Pan, Jean-Philippe Brucker, LKML,
	Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker,
	Alex Williamson, Eric Auger, Jonathan Corbet, Raj, Ashok, Wu,
	Hao, Jiang, Dave

On Wed, Mar 31, 2021 at 07:41:40AM +0000, Liu, Yi L wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, March 30, 2021 9:28 PM
> > 
> > On Tue, Mar 30, 2021 at 04:14:58AM +0000, Tian, Kevin wrote:
> > 
> > > One correction. The mdev should still construct the list of allowed PASID's
> > as
> > > you said (by listening to IOASID_BIND/UNBIND event), in addition to the
> > ioasid
> > > set maintained per VM (updated when a PASID is allocated/freed). The
> > per-VM
> > > set is required for inter-VM isolation (verified when a pgtable is bound to
> > the
> > > mdev/PASID), while the mdev's own list is necessary for intra-VM isolation
> > when
> > > multiple mdevs are assigned to the same VM (verified before loading a
> > PASID
> > > to the mdev). This series just handles the general part i.e. per-VM ioasid
> > set and
> > > leaves the mdev's own list to be managed by specific mdev driver which
> > listens
> > > to various IOASID events).
> > 
> > This is better, but I don't understand why we need such a convoluted
> > design.
> > 
> > Get rid of the ioasid set.
> >
> > Each driver has its own list of allowed ioasids.
> 
> First, I agree with you it's necessary to have a per-device allowed ioasid
> list. But besides it, I think we still need to ensure the ioasid used by a
> VM is really allocated to this VM. A VM should not use an ioasid allocated
> to another VM. right? Actually, this is the major intention for introducing
> ioasid_set.

The /dev/ioasid FD replaces this security check. By becoming FD
centric you don't need additional kernel security objects.

Any process with access to the /dev/ioasid FD is allowed to control
those PASID. The seperation between VMs falls naturally from the
seperation of FDs without creating additional, complicated, security
infrastrucure in the kernel.

This is why all APIs must be FD focused, and you need to have a
logical layering of responsibility.

 Allocate a /dev/ioasid FD
 Allocate PASIDs inside the FD
 Assign memory to the PASIDS

 Open a device FD, eg from VFIO or VDP
 Instruct the device FD to authorize the device to access PASID A in
 an ioasid FD
   * Prior to being authorized the device will have NO access to any
     PASID
   * Presenting BOTH the device FD and the ioasid FD to the kernel
     is the security check. Any process with both FDs is allowed to
     make the connection. This is normal Unix FD centric thinking.

> > Register a ioasid in the driver's list by passing the fd and ioasid #
> 
> The fd here is a device fd. Am I right? 

It would be the vfio_device FD, for instance, and a VFIO IOCTL.

> If yes, your idea is ioasid is allocated via /dev/ioasid and
> associated with device fd via either VFIO or vDPA ioctl. right?
> sorry I may be asking silly questions but really need to ensure we
> are talking in the same page.

Yes, this is right

> > No listening to events. A simple understandable security model.
> 
> For this suggestion, I have a little bit concern if we may have A-B/B-A
> lock sequence issue since it requires the /dev/ioasid (if it supports)
> to call back into VFIO/VDPA to check if the ioasid has been registered to
> device FD and record it in the per-device list. right? Let's have more
> discussion based on the skeleton sent by Kevin.

Callbacks would be backwards.

User calls vfio with vfio_device fd and dev/ioasid fd

VFIO extracts some kernel representation of the ioasid from the ioasid
fd using an API

VFIO does some kernel call to IOMMU/IOASID layer that says 'tell the
IOMMU that this PCI device is allowed to use this PASID'

VFIO mdev drivers then record that the PASID is allowed in its own
device specific struct for later checking during other system calls.

No lock inversions. No callbacks. Why do we need callbacks?? Simplify.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-31  7:38                         ` Liu, Yi L
@ 2021-03-31 12:40                           ` Jason Gunthorpe
  2021-04-01  4:38                             ` Liu, Yi L
  0 siblings, 1 reply; 269+ messages in thread
From: Jason Gunthorpe @ 2021-03-31 12:40 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Tian, Kevin, Jacob Pan, Jean-Philippe Brucker, LKML,
	Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker,
	Alex Williamson, Eric Auger, Jonathan Corbet, Raj, Ashok, Wu,
	Hao, Jiang, Dave

On Wed, Mar 31, 2021 at 07:38:36AM +0000, Liu, Yi L wrote:

> The reason is /dev/ioasid FD is per-VM since the ioasid allocated to
> the VM should be able to be shared by all assigned device for the VM.
> But the SVA operations (bind/unbind page table, cache_invalidate) should
> be per-device.

It is not *per-device* it is *per-ioasid*

And as /dev/ioasid is an interface for controlling multiple ioasid's
there is no issue to also multiplex the page table manipulation for
multiple ioasids as well.

What you should do next is sketch out in some RFC the exactl ioctls
each FD would have and show how the parts I outlined would work and
point out any remaining gaps.

The device FD is something like the vfio_device FD from VFIO, it has
*nothing* to do with PASID beyond having a single ioctl to authorize
the device to use the PASID. All control of the PASID is in
/dev/ioasid.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-31 12:28                           ` Jason Gunthorpe
@ 2021-03-31 16:34                             ` Jacob Pan
  2021-03-31 17:31                               ` Jason Gunthorpe
  0 siblings, 1 reply; 269+ messages in thread
From: Jacob Pan @ 2021-03-31 16:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj Ashok, Tian, Kevin, Yi Liu,
	Wu Hao, Dave Jiang, jacob.jun.pan

Hi Jason,

On Wed, 31 Mar 2021 09:28:05 -0300, Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Mar 30, 2021 at 05:10:41PM -0700, Jacob Pan wrote:
>  [...]  
>  [...]  
>  [...]  
> > This requires the mdev driver to obtain a list of allowed
> > PASIDs(possibly during PASID bind time) prior to do enforcement. IMHO,
> > the PASID enforcement points are:
> > 1. During WQ configuration (e.g.program MSI)
> > 2. During work submission
> > 
> > For VT-d shared workqueue, there is no way to enforce #2 in mdev driver
> > in that the PASID is obtained from PASID MSR from the CPU and submitted
> > w/o driver involvement.  
> 
> I assume that the PASID MSR is privileged and only qemu can program
> it? Otherwise this seems like a security problem.
> 
yes.

> If qemu controls it then the idxd userspace driver in qemu must ensure
> it is only ever programmed to an authorized PASID.
> 
it is ensured for #1.

> > The enforcement for #2 is in the KVM PASID translation table, which
> > is per VM.  
> 
> I don't understand why KVM gets involved in PASID??
> 
Here is an excerpt from the SIOV spec.
https://software.intel.com/content/www/us/en/develop/download/intel-scalable-io-virtualization-technical-specification.html

"3.3 PASID translation
To support PASID isolation for Shared Work Queues used by VMs, the CPU must
provide a way for the PASID to be communicated to the device in the DMWr
transaction. On Intel CPUs, the CPU provides a PASID translation table in
the vCPUs virtual machine control structures. During ENQCMD/ENQCMDS
instruction execution in a VM, the PASID translation table is used by the
CPU to replace the guest PASID in the work descriptor with a host PASID
before the descriptor is sent to the device.3.3 PASID translation"

> Doesn't work submission go either to the mdev driver or through the
> secure PASID of #1?
> 
No, once a PASID is bound with IOMMU, KVM, and the mdev, work submission is
all done in HW.
But I don't think this will change for either uAPI design.

> > For our current VFIO mdev model, bind guest page table does not involve
> > mdev driver. So this is a gap we must fill, i.e. include a callback from
> > mdev driver?  
> 
> No not a callback, tell the mdev driver with a VFIO IOCTL that it is
> authorized to use a specific PASID because the vIOMMU was told to
> allow it by the guest kernel. Simple and straightforward.
> 
Make sense.

> > > ioasid_set doesn't seem to help at all, certainly not as a concept
> > > tied to /dev/ioasid.
> > >   
> > Yes, we can take the security role off ioasid_set once we have per mdev
> > list. However, ioasid_set being a per VM/mm entity also bridge
> > communications among kernel subsystems that don't have direct call path.
> > e.g. KVM, VDCM and IOMMU.  
> 
> Everything should revolve around the /dev/ioasid FD. qemu should pass
> it to all places that need to know about PASID's in the VM.
> 
I guess we need to extend KVM interface to support PASIDs. Our original
intention was to avoid introducing new interfaces.

> We should try to avoid hidden behind the scenes kernel
> interconnections between subsystems.
> 
Can we? in case of exception. Since all these IOCTLs are coming from the
unreliable user space, we must deal all exceptions.

For example, when user closes /dev/ioasid FD before (or w/o) unbind IOCTL
for VFIO, KVM, kernel must do cleanup and coordinate among subsystems.
In this patchset, we have a per mm(ioasid_set) notifier to inform mdev, KVM
to clean up and drop its refcount. Do you have any suggestion on this?

> 
> > > So when you 'allow' a mdev to access a PASID you want to say:
> > >  Allow Guest PASID A, map it to host PASID B on this /dev/ioasid FD
> > >   
> 
> > Host and guest PASID value, as well as device info are available through
> > iommu_uapi_sva_bind_gpasid(), we just need to feed that info to mdev
> > driver.  
> 
> You need that IOCTL to exist on the *mdev driver*. It is a VFIO ioctl,
> not a iommu or ioasid or sva IOCTL.
>
OK. A separate IOCTL and separate step.

> > > That seems like a good helper library to provide for drivers to use,
> > > but it should be a construct entirely contained in the driver.  
> > why? would it be cleaner if it is in the common code?  
> 
> No, it is the "mid layer" problematic design.
> 
> Having the iommu layer store driver-specific data on behalf of a
> driver will just make a mess. Use the natural layering we have and
> store driver specific data in the driver structs.
> 
> Add a library to help build the datastructure if it necessary.
> 
Let me try to paraphrase, you are suggesting common helper code and data
format but still driver specific storage of the mapping, correct?

Will try this out, seems cleaner.

> Jason


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-31 16:34                             ` Jacob Pan
@ 2021-03-31 17:31                               ` Jason Gunthorpe
  2021-03-31 18:20                                 ` Jacob Pan
  0 siblings, 1 reply; 269+ messages in thread
From: Jason Gunthorpe @ 2021-03-31 17:31 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj Ashok, Tian, Kevin, Yi Liu,
	Wu Hao, Dave Jiang

On Wed, Mar 31, 2021 at 09:34:57AM -0700, Jacob Pan wrote:

> "3.3 PASID translation
> To support PASID isolation for Shared Work Queues used by VMs, the CPU must
> provide a way for the PASID to be communicated to the device in the DMWr
> transaction. On Intel CPUs, the CPU provides a PASID translation table in
> the vCPUs virtual machine control structures. During ENQCMD/ENQCMDS
> instruction execution in a VM, the PASID translation table is used by the
> CPU to replace the guest PASID in the work descriptor with a host PASID
> before the descriptor is sent to the device.3.3 PASID translation"

Yikes, a special ENQCMD table in the hypervisor!

Still, pass the /dev/ioasid into a KVM IOCTL and tell it to populate
this table. KVM only adds to the table when userspace presents a
/dev/ioasid FD.

> > Doesn't work submission go either to the mdev driver or through the
> > secure PASID of #1?
> 
> No, once a PASID is bound with IOMMU, KVM, and the mdev, work
> submission is all done in HW.  But I don't think this will change
> for either uAPI design.

The big note here is "only for things that use ENQCMD" and that is
basically nothing these days.

> > Everything should revolve around the /dev/ioasid FD. qemu should pass
> > it to all places that need to know about PASID's in the VM.
> 
> I guess we need to extend KVM interface to support PASIDs. Our original
> intention was to avoid introducing new interfaces.

New features need new interfaces, especially if there is a security
sensitivity! KVM should *not* automatically opt into security
sensitive stuff without being explicitly told what to do.

Here you'd need to authorized *two* things for IDXD:
 - The mdev needs to be told it is allowed to use PASID, this tells
   the IOMMU driver to connect the pci device under the mdev
 - KVM needs to be told to populate a vPASID to the 'ENQCMD'
   security table translated to a physical PASID.

If qemu doesn't explicitly enable the ENQCMD security table it should
be *left disabled* by KVM - even if someone else is using PASID in the
same process. And the API should be narrow like this just to the
EQNCMD table as who knows what will come down the road, or how it will
work.

Having a PASID wrongly leak out into the VM would be a security
disaster. Be explicit.

> > We should try to avoid hidden behind the scenes kernel
> > interconnections between subsystems.
> > 
> Can we? in case of exception. Since all these IOCTLs are coming from the
> unreliable user space, we must deal all exceptions.
>
> For example, when user closes /dev/ioasid FD before (or w/o) unbind IOCTL
> for VFIO, KVM, kernel must do cleanup and coordinate among subsystems.
> In this patchset, we have a per mm(ioasid_set) notifier to inform mdev, KVM
> to clean up and drop its refcount. Do you have any suggestion on this?

The ioasid should be a reference counted object.

When the FD is closed, or the ioasid is "destroyed" it just blocks DMA
and parks the PASID until *all* places release it. Upon a zero
refcount the PASID is recycled for future use.

The duration between unmapping the ioasid and releasing all HW access
will have HW see PCIE TLP errors due to the blocked access. If
userspace messes up the order it is fine to cause this. We already had
this dicussion when talking about how to deal with process exit in the
simple SVA case.

> Let me try to paraphrase, you are suggesting common helper code and data
> format but still driver specific storage of the mapping, correct?

The driver just needs to hold the datastructure in its memory.

Like an xarray, the driver can have an xarray inside its struct
device, but the xarray library provides all the implementation.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-31 17:31                               ` Jason Gunthorpe
@ 2021-03-31 18:20                                 ` Jacob Pan
  2021-03-31 18:33                                   ` Jason Gunthorpe
  0 siblings, 1 reply; 269+ messages in thread
From: Jacob Pan @ 2021-03-31 18:20 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj Ashok, Tian, Kevin, Yi Liu,
	Wu Hao, Dave Jiang, jacob.jun.pan

Hi Jason,

On Wed, 31 Mar 2021 14:31:48 -0300, Jason Gunthorpe <jgg@nvidia.com> wrote:

> > > We should try to avoid hidden behind the scenes kernel
> > > interconnections between subsystems.
> > >   
> > Can we? in case of exception. Since all these IOCTLs are coming from the
> > unreliable user space, we must deal all exceptions.
> >
> > For example, when user closes /dev/ioasid FD before (or w/o) unbind
> > IOCTL for VFIO, KVM, kernel must do cleanup and coordinate among
> > subsystems. In this patchset, we have a per mm(ioasid_set) notifier to
> > inform mdev, KVM to clean up and drop its refcount. Do you have any
> > suggestion on this?  
> 
> The ioasid should be a reference counted object.
> 
yes, this is done in this patchset.

> When the FD is closed, or the ioasid is "destroyed" it just blocks DMA
> and parks the PASID until *all* places release it. Upon a zero
> refcount the PASID is recycled for future use.
> 
Just to clarify, you are saying (when FREE happens before proper
teardown) there is no need to proactively notify all users of the IOASID to
drop their reference. Instead, just wait for the other parties to naturally
close and drop their references. Am I understanding you correctly?

I feel having the notifications can add two values:
1. Shorten the duration of errors (as you mentioned below), FD close can
take a long and unpredictable time. e.g. FD shared.
2. Provide teardown ordering among PASID users. i.e. vCPU, IOMMU, mdev.

> The duration between unmapping the ioasid and releasing all HW access
> will have HW see PCIE TLP errors due to the blocked access. If
> userspace messes up the order it is fine to cause this. We already had
> this dicussion when talking about how to deal with process exit in the
> simple SVA case.
Yes, we have disabled fault reporting during this period. The slight
differences vs. the simple SVA case is that KVM is also involved and there
might be an ordering requirement to stop vCPU first.

Thanks,

Jacob

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-31 18:20                                 ` Jacob Pan
@ 2021-03-31 18:33                                   ` Jason Gunthorpe
  2021-03-31 21:50                                     ` Jacob Pan
  0 siblings, 1 reply; 269+ messages in thread
From: Jason Gunthorpe @ 2021-03-31 18:33 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj Ashok, Tian, Kevin, Yi Liu,
	Wu Hao, Dave Jiang

On Wed, Mar 31, 2021 at 11:20:30AM -0700, Jacob Pan wrote:
> Hi Jason,
> 
> On Wed, 31 Mar 2021 14:31:48 -0300, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > > > We should try to avoid hidden behind the scenes kernel
> > > > interconnections between subsystems.
> > > >   
> > > Can we? in case of exception. Since all these IOCTLs are coming from the
> > > unreliable user space, we must deal all exceptions.
> > >
> > > For example, when user closes /dev/ioasid FD before (or w/o) unbind
> > > IOCTL for VFIO, KVM, kernel must do cleanup and coordinate among
> > > subsystems. In this patchset, we have a per mm(ioasid_set) notifier to
> > > inform mdev, KVM to clean up and drop its refcount. Do you have any
> > > suggestion on this?  
> > 
> > The ioasid should be a reference counted object.
> > 
> yes, this is done in this patchset.
> 
> > When the FD is closed, or the ioasid is "destroyed" it just blocks DMA
> > and parks the PASID until *all* places release it. Upon a zero
> > refcount the PASID is recycled for future use.
> > 
> Just to clarify, you are saying (when FREE happens before proper
> teardown) there is no need to proactively notify all users of the IOASID to
> drop their reference. Instead, just wait for the other parties to naturally
> close and drop their references. Am I understanding you correctly?

Yes. What are receivers going to do when you notify them anyhow? What
will a mdev do? This is how you get into they crazy locking problems.

It is an error for userspace to shutdown like this, recover sensibly
and don't crash the kernel. PCIe error TLPs are expected, supress
them. That is what we decided on the mmu notifier discussion.

> I feel having the notifications can add two values:
> 1. Shorten the duration of errors (as you mentioned below), FD close can
> take a long and unpredictable time. e.g. FD shared.

Only if userspace exits in some uncontrolled way. In a controlled exit
it can close all the FDs in the right order.

It is OK if userspace does something weird and ends up with disabled
IOASIDs. It shouldn't do that if it cares.

> 2. Provide teardown ordering among PASID users. i.e. vCPU, IOMMU, mdev.

This is a hard ask too, there is no natural ordering here I can see,
obviously we want vcpu, mdev, iommu for qemu but that doesn't seem to
fall out unless we explicitly hard wire it into the kernel.

Doesn't kvm always kill the vCPU first based on the mmu notifier
shooting down all the memory? IIRC this happens before FD close?

> > The duration between unmapping the ioasid and releasing all HW access
> > will have HW see PCIE TLP errors due to the blocked access. If
> > userspace messes up the order it is fine to cause this. We already had
> > this dicussion when talking about how to deal with process exit in the
> > simple SVA case.
> Yes, we have disabled fault reporting during this period. The slight
> differences vs. the simple SVA case is that KVM is also involved and there
> might be an ordering requirement to stop vCPU first.

KVM can continue to use the PASIDs, they are parked and DMA is
permanently blocked. When KVM reaches a natural point in its teardown
it can release them.

If you have to stop the vcpu from a iommu notifier you are in the
crazy locking world I mentioned. IMHO don't create exciting locking
problems just to avoid PCI errors in uncontrolled shutdown.

Suppress the errors instead.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-31 18:33                                   ` Jason Gunthorpe
@ 2021-03-31 21:50                                     ` Jacob Pan
  0 siblings, 0 replies; 269+ messages in thread
From: Jacob Pan @ 2021-03-31 21:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj Ashok, Tian, Kevin, Yi Liu,
	Wu Hao, Dave Jiang, jacob.jun.pan

Hi Jason,

On Wed, 31 Mar 2021 15:33:24 -0300, Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Mar 31, 2021 at 11:20:30AM -0700, Jacob Pan wrote:
> > Hi Jason,
> > 
> > On Wed, 31 Mar 2021 14:31:48 -0300, Jason Gunthorpe <jgg@nvidia.com>
> > wrote: 
> > > > > We should try to avoid hidden behind the scenes kernel
> > > > > interconnections between subsystems.
> > > > >     
>  [...]  
>  [...]  
> > yes, this is done in this patchset.
> >   
>  [...]  
> > Just to clarify, you are saying (when FREE happens before proper
> > teardown) there is no need to proactively notify all users of the
> > IOASID to drop their reference. Instead, just wait for the other
> > parties to naturally close and drop their references. Am I
> > understanding you correctly?  
> 
> Yes. What are receivers going to do when you notify them anyhow? What
> will a mdev do? This is how you get into they crazy locking problems.
> 
The receivers perform cleanup work similar to normal unbind. Drain/Abort
PASID. Locking is an issue in that the atomic notifier is under IOASID
spinlock, so I provided a common ordered workqueue to let mdev drivers
queue cleanup work that cannot be done in atomic context. Not ideal. Also
need to prevent nested notifications for certain cases.

> It is an error for userspace to shutdown like this, recover sensibly
> and don't crash the kernel. PCIe error TLPs are expected, supress
> them. That is what we decided on the mmu notifier discussion.
> 
> > I feel having the notifications can add two values:
> > 1. Shorten the duration of errors (as you mentioned below), FD close can
> > take a long and unpredictable time. e.g. FD shared.  
> 
> Only if userspace exits in some uncontrolled way. In a controlled exit
> it can close all the FDs in the right order.
> 
> It is OK if userspace does something weird and ends up with disabled
> IOASIDs. It shouldn't do that if it cares.
> 
Agreed.

> > 2. Provide teardown ordering among PASID users. i.e. vCPU, IOMMU, mdev.
> >  
> 
> This is a hard ask too, there is no natural ordering here I can see,
> obviously we want vcpu, mdev, iommu for qemu but that doesn't seem to
> fall out unless we explicitly hard wire it into the kernel.
> 
The ordering problem as I understood is that it is difficult for KVM to
rendezvous all vCPUs before updating PASID translation table. So there
could be in-flight enqcmd with the stale PASID after the PASID table update
and refcount drop.

If KVM is the last one to drop the PASID refcount, the PASID could be
immediately reused and starts a new life. The in-flight enqcmd with the
stale PASID could cause problems. The likelihood and window is very small.

If we ensure KVM does PASID table update before IOMMU and mdev driver, the
stale PASID in the in-flight enqcmd would be be drained before starting
a new life.

Perhaps Yi and Kevin can explain this better.

> Doesn't kvm always kill the vCPU first based on the mmu notifier
> shooting down all the memory? IIRC this happens before FD close?
> 
I don't know the answer, Kevin & Yi?

> > > The duration between unmapping the ioasid and releasing all HW access
> > > will have HW see PCIE TLP errors due to the blocked access. If
> > > userspace messes up the order it is fine to cause this. We already had
> > > this dicussion when talking about how to deal with process exit in the
> > > simple SVA case.  
> > Yes, we have disabled fault reporting during this period. The slight
> > differences vs. the simple SVA case is that KVM is also involved and
> > there might be an ordering requirement to stop vCPU first.  
> 
> KVM can continue to use the PASIDs, they are parked and DMA is
> permanently blocked. When KVM reaches a natural point in its teardown
> it can release them.
> 
> If you have to stop the vcpu from a iommu notifier you are in the
> crazy locking world I mentioned. IMHO don't create exciting locking
> problems just to avoid PCI errors in uncontrolled shutdown.
> 
> Suppress the errors instead.
> 
I agree, this simplify things a lot. Just need to clarify the in-flight
enqcmd case.

> Jason


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-31 12:38                           ` Jason Gunthorpe
@ 2021-03-31 23:46                             ` Jacob Pan
  2021-04-01  0:37                               ` Jason Gunthorpe
  0 siblings, 1 reply; 269+ messages in thread
From: Jacob Pan @ 2021-03-31 23:46 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, Tian, Kevin, Jean-Philippe Brucker, LKML,
	Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker,
	Alex Williamson, Eric Auger, Jonathan Corbet, Raj, Ashok, Wu,
	Hao, Jiang, Dave, jacob.jun.pan

Hi Jason,

On Wed, 31 Mar 2021 09:38:01 -0300, Jason Gunthorpe <jgg@nvidia.com> wrote:

> > > Get rid of the ioasid set.
> > >
> > > Each driver has its own list of allowed ioasids.  
>  [...]  
> 
> The /dev/ioasid FD replaces this security check. By becoming FD
> centric you don't need additional kernel security objects.
> 
> Any process with access to the /dev/ioasid FD is allowed to control
> those PASID. The seperation between VMs falls naturally from the
> seperation of FDs without creating additional, complicated, security
> infrastrucure in the kernel.
> 
> This is why all APIs must be FD focused, and you need to have a
> logical layering of responsibility.
> 
>  Allocate a /dev/ioasid FD
>  Allocate PASIDs inside the FD
>  Assign memory to the PASIDS
> 
>  Open a device FD, eg from VFIO or VDP
>  Instruct the device FD to authorize the device to access PASID A in
>  an ioasid FD
How do we know user provided PASID A was allocated by the ioasid FD?
Shouldn't we validate user input by tracking which PASIDs are allocated by
which ioasid FD? This is one reason why we have ioasid_set and its xarray.

>    * Prior to being authorized the device will have NO access to any
>      PASID
>    * Presenting BOTH the device FD and the ioasid FD to the kernel
>      is the security check. Any process with both FDs is allowed to
>      make the connection. This is normal Unix FD centric thinking.
> 
> > > Register a ioasid in the driver's list by passing the fd and ioasid #
> > >  
> > 
> > The fd here is a device fd. Am I right?   
> 
> It would be the vfio_device FD, for instance, and a VFIO IOCTL.
> 
> > If yes, your idea is ioasid is allocated via /dev/ioasid and
> > associated with device fd via either VFIO or vDPA ioctl. right?
> > sorry I may be asking silly questions but really need to ensure we
> > are talking in the same page.  
> 
> Yes, this is right
> 
> > > No listening to events. A simple understandable security model.  
> > 
> > For this suggestion, I have a little bit concern if we may have A-B/B-A
> > lock sequence issue since it requires the /dev/ioasid (if it supports)
> > to call back into VFIO/VDPA to check if the ioasid has been registered
> > to device FD and record it in the per-device list. right? Let's have
> > more discussion based on the skeleton sent by Kevin.  
> 
> Callbacks would be backwards.
> 
> User calls vfio with vfio_device fd and dev/ioasid fd
> 
> VFIO extracts some kernel representation of the ioasid from the ioasid
> fd using an API
> 
This lookup API seems to be asking for per ioasid FD storage array. Today,
the ioasid_set is per mm and contains a Xarray. Since each VM, KVM can only
open one ioasid FD, this per FD array would be equivalent to the per mm
ioasid_set, right?

> VFIO does some kernel call to IOMMU/IOASID layer that says 'tell the
> IOMMU that this PCI device is allowed to use this PASID'
> 
Would it be redundant to what iommu_uapi_sva_bind_gpasid() does? I thought
the idea is to use ioasid FD IOCTL to issue IOMMU uAPI calls. Or we can
skip this step for now and wait for the user to do SVA bind.

> VFIO mdev drivers then record that the PASID is allowed in its own
> device specific struct for later checking during other system calls.


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-31 23:46                             ` Jacob Pan
@ 2021-04-01  0:37                               ` Jason Gunthorpe
  2021-04-01 17:23                                 ` Jacob Pan
  0 siblings, 1 reply; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-01  0:37 UTC (permalink / raw)
  To: Jacob Pan, Joerg Roedel
  Cc: Liu, Yi L, Tian, Kevin, Jean-Philippe Brucker, LKML, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj, Ashok, Wu, Hao, Jiang, Dave

On Wed, Mar 31, 2021 at 04:46:21PM -0700, Jacob Pan wrote:
> Hi Jason,
> 
> On Wed, 31 Mar 2021 09:38:01 -0300, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > > > Get rid of the ioasid set.
> > > >
> > > > Each driver has its own list of allowed ioasids.  
> >  [...]  
> > 
> > The /dev/ioasid FD replaces this security check. By becoming FD
> > centric you don't need additional kernel security objects.
> > 
> > Any process with access to the /dev/ioasid FD is allowed to control
> > those PASID. The seperation between VMs falls naturally from the
> > seperation of FDs without creating additional, complicated, security
> > infrastrucure in the kernel.
> > 
> > This is why all APIs must be FD focused, and you need to have a
> > logical layering of responsibility.
> > 
> >  Allocate a /dev/ioasid FD
> >  Allocate PASIDs inside the FD
> >  Assign memory to the PASIDS
> > 
> >  Open a device FD, eg from VFIO or VDP
> >  Instruct the device FD to authorize the device to access PASID A in
> >  an ioasid FD
> How do we know user provided PASID A was allocated by the ioasid FD?

You pass in the ioasid FD and use a 'get pasid from fdno' API to
extract the required kernel structure.

> Shouldn't we validate user input by tracking which PASIDs are
> allocated by which ioasid FD?

Yes, but it is integral to the ioasid FD, not something separated.

> > VFIO extracts some kernel representation of the ioasid from the ioasid
> > fd using an API
> > 
> This lookup API seems to be asking for per ioasid FD storage array. Today,
> the ioasid_set is per mm and contains a Xarray. 

Right, put the xarray per FD. A set per mm is fairly nonsensical, we
don't use the mm as that kind of security key.

> Since each VM, KVM can only open one ioasid FD, this per FD array
> would be equivalent to the per mm ioasid_set, right?

Why only one?  Each interaction with the other FDs should include the
PASID/FD pair. There is no restriction to just one.

> > VFIO does some kernel call to IOMMU/IOASID layer that says 'tell the
> > IOMMU that this PCI device is allowed to use this PASID'
>
> Would it be redundant to what iommu_uapi_sva_bind_gpasid() does? I thought
> the idea is to use ioasid FD IOCTL to issue IOMMU uAPI calls. Or we can
> skip this step for now and wait for the user to do SVA bind.

I'm not sure what you are asking.

Possibly some of the IOMMU API will need a bit adjusting to make
things split.

The act of programming the page tables and the act of authorizing a
PCI BDF to use a PASID are distinct things with two different IOCTLs.

iommu_uapi_sva_bind_gpasid() is never called by anything, and it's
uAPI is never implemented.

Joerg? Why did you merge dead uapi and dead code?

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-31 12:40                           ` Jason Gunthorpe
@ 2021-04-01  4:38                             ` Liu, Yi L
  2021-04-01  7:04                               ` Liu, Yi L
  2021-04-01 11:46                               ` Jason Gunthorpe
  0 siblings, 2 replies; 269+ messages in thread
From: Liu, Yi L @ 2021-04-01  4:38 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Jacob Pan, Jean-Philippe Brucker, LKML,
	Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker,
	Alex Williamson, Eric Auger, Jonathan Corbet, Raj, Ashok, Wu,
	Hao, Jiang, Dave

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, March 31, 2021 8:41 PM
> 
> On Wed, Mar 31, 2021 at 07:38:36AM +0000, Liu, Yi L wrote:
> 
> > The reason is /dev/ioasid FD is per-VM since the ioasid allocated to
> > the VM should be able to be shared by all assigned device for the VM.
> > But the SVA operations (bind/unbind page table, cache_invalidate) should
> > be per-device.
> 
> It is not *per-device* it is *per-ioasid*
>
> And as /dev/ioasid is an interface for controlling multiple ioasid's
> there is no issue to also multiplex the page table manipulation for
> multiple ioasids as well.
> 
> What you should do next is sketch out in some RFC the exactl ioctls
> each FD would have and show how the parts I outlined would work and
> point out any remaining gaps.
> 
> The device FD is something like the vfio_device FD from VFIO, it has
> *nothing* to do with PASID beyond having a single ioctl to authorize
> the device to use the PASID. All control of the PASID is in
> /dev/ioasid.

good to see this reply. Your idea is much clearer to me now. If I'm getting
you correctly. I think the skeleton is something like below:

1) userspace opens a /dev/ioasid, meanwhile there will be an ioasid
   allocated and a per-ioasid context which can be used to do bind page
   table and cache invalidate, an ioasid FD returned to userspace.
2) userspace passes the ioasid FD to VFIO, let it associated with a device
   FD (like vfio_device FD).
3) userspace binds page table on the ioasid FD with the page table info.
4) userspace unbinds the page table on the ioasid FD
5) userspace de-associates the ioasid FD and device FD

Does above suit your outline?

If yes, I still have below concern and wish to see your opinion.
- the ioasid FD and device association will happen at runtime instead of
  just happen in the setup phase.
- how about AMD and ARM's vSVA support? Their PASID allocation and page table
  happens within guest. They only need to bind the guest PASID table to host.
  Above model seems unable to fit them. (Jean, Eric, Jacob please feel free
  to correct me)
- this per-ioasid SVA operations is not aligned with the native SVA usage
  model. Native SVA bind is per-device.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 269+ messages in thread

* RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-01  4:38                             ` Liu, Yi L
@ 2021-04-01  7:04                               ` Liu, Yi L
  2021-04-01 11:54                                 ` Jason Gunthorpe
  2021-04-01 12:05                                 ` Jean-Philippe Brucker
  2021-04-01 11:46                               ` Jason Gunthorpe
  1 sibling, 2 replies; 269+ messages in thread
From: Liu, Yi L @ 2021-04-01  7:04 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Jacob Pan, Jean-Philippe Brucker, LKML,
	Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker,
	Alex Williamson, Eric Auger, Jonathan Corbet, Raj, Ashok, Wu,
	Hao, Jiang, Dave

Hi Jason,

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Thursday, April 1, 2021 12:39 PM
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, March 31, 2021 8:41 PM
> >
> > On Wed, Mar 31, 2021 at 07:38:36AM +0000, Liu, Yi L wrote:
> >
> > > The reason is /dev/ioasid FD is per-VM since the ioasid allocated to
> > > the VM should be able to be shared by all assigned device for the VM.
> > > But the SVA operations (bind/unbind page table, cache_invalidate)
> should
> > > be per-device.
> >
> > It is not *per-device* it is *per-ioasid*
> >
> > And as /dev/ioasid is an interface for controlling multiple ioasid's
> > there is no issue to also multiplex the page table manipulation for
> > multiple ioasids as well.
> >
> > What you should do next is sketch out in some RFC the exactl ioctls
> > each FD would have and show how the parts I outlined would work and
> > point out any remaining gaps.
> >
> > The device FD is something like the vfio_device FD from VFIO, it has
> > *nothing* to do with PASID beyond having a single ioctl to authorize
> > the device to use the PASID. All control of the PASID is in
> > /dev/ioasid.
> 
> good to see this reply. Your idea is much clearer to me now. If I'm getting
> you correctly. I think the skeleton is something like below:
> f
> 1) userspace opens a /dev/ioasid, meanwhile there will be an ioasid
>    allocated and a per-ioasid context which can be used to do bind page
>    table and cache invalidate, an ioasid FD returned to userspace.
> 2) userspace passes the ioasid FD to VFIO, let it associated with a device
>    FD (like vfio_device FD).
> 3) userspace binds page table on the ioasid FD with the page table info.
> 4) userspace unbinds the page table on the ioasid FD
> 5) userspace de-associates the ioasid FD and device FD
> 
> Does above suit your outline?
> 
> If yes, I still have below concern and wish to see your opinion.
> - the ioasid FD and device association will happen at runtime instead of
>   just happen in the setup phase.
> - how about AMD and ARM's vSVA support? Their PASID allocation and page
> table
>   happens within guest. They only need to bind the guest PASID table to
> host.
>   Above model seems unable to fit them. (Jean, Eric, Jacob please feel free
>   to correct me)
> - this per-ioasid SVA operations is not aligned with the native SVA usage
>   model. Native SVA bind is per-device.

After reading your reply in https://lore.kernel.org/linux-iommu/20210331123801.GD1463678@nvidia.com/#t
So you mean /dev/ioasid FD is per-VM instead of per-ioasid, so above skeleton
doesn't suit your idea. I draft below skeleton to see if our mind is the
same. But I still believe there is an open on how to fit ARM and AMD's
vSVA support in this the per-ioasid SVA operation model. thoughts?

+-----------------------------+-----------------------------------------------+
|      userspace              |               kernel space                    |
+-----------------------------+-----------------------------------------------+
| ioasid_fd =                 | /dev/ioasid does below:                       |
| open("/dev/ioasid", O_RDWR);|   struct ioasid_fd_ctx {                      |
|                             |       struct list_head ioasid_list;           |
|                             |       ...                                     |
|                             |   } ifd_ctx; // ifd_ctx is per ioasid_fd      |
+-----------------------------+-----------------------------------------------+
| ioctl(ioasid_fd,            | /dev/ioasid does below:                       |
|       ALLOC, &ioasid);      |   struct ioasid_data {                        |
|                             |       ioasid_t ioasid;                        |
|                             |       struct list_head device_list;           |
|                             |       struct list_head next;                  |
|                             |       ...                                     |
|                             |   } id_data; // id_data is per ioasid         |
|                             |                                               |
|                             |   list_add(&id_data.next,                     |
|                             |            &ifd_ctx.ioasid_list);             |
+-----------------------------+-----------------------------------------------+
| ioctl(device_fd,            | VFIO does below:                              |
|       DEVICE_ALLOW_IOASID,  | 1) get ioasid_fd, check if ioasid_fd is valid |
|       ioasid_fd,            | 2) check if ioasid is allocated from ioasid_fd|
|       ioasid);              | 3) register device/domain info to /dev/ioasid |
|                             |    tracked in id_data.device_list             |
|                             | 4) record the ioasid in VFIO's per-device     |
|                             |    ioasid list for future security check      |
+-----------------------------+-----------------------------------------------+
| ioctl(ioasid_fd,            | /dev/ioasid does below:                       |
|       BIND_PGTBL,           | 1) find ioasid's id_data                      |
|       pgtbl_data,           | 2) loop the id_data.device_list and tell iommu|
|       ioasid);              |    give ioasid access to the devices          |
+-----------------------------+-----------------------------------------------+
| ioctl(ioasid_fd,            | /dev/ioasid does below:                       |
|       UNBIND_PGTBL,         | 1) find ioasid's id_data                      |
|       ioasid);              | 2) loop the id_data.device_list and tell iommu|
|                             |    clear ioasid access to the devices         |
+-----------------------------+-----------------------------------------------+
| ioctl(device_fd,            | VFIO does below:                              |
|      DEVICE_DISALLOW_IOASID,| 1) check if ioasid is associated in VFIO's    |
|       ioasid_fd,            |    device ioasid list.                        |
|       ioasid);              | 2) unregister device/domain info from         |
|                             |    /dev/ioasid, clear in id_data.device_list  |
+-----------------------------+-----------------------------------------------+
| ioctl(ioasid_fd,            | /dev/ioasid does below:                       |
|       FREE, ioasid);        |  list_del(&id_data.next);                     |
+-----------------------------+-----------------------------------------------+

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-01  4:38                             ` Liu, Yi L
  2021-04-01  7:04                               ` Liu, Yi L
@ 2021-04-01 11:46                               ` Jason Gunthorpe
  2021-04-01 13:10                                 ` Liu, Yi L
  1 sibling, 1 reply; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-01 11:46 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Tian, Kevin, Jacob Pan, Jean-Philippe Brucker, LKML,
	Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker,
	Alex Williamson, Eric Auger, Jonathan Corbet, Raj, Ashok, Wu,
	Hao, Jiang, Dave

On Thu, Apr 01, 2021 at 04:38:44AM +0000, Liu, Yi L wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, March 31, 2021 8:41 PM
> > 
> > On Wed, Mar 31, 2021 at 07:38:36AM +0000, Liu, Yi L wrote:
> > 
> > > The reason is /dev/ioasid FD is per-VM since the ioasid allocated to
> > > the VM should be able to be shared by all assigned device for the VM.
> > > But the SVA operations (bind/unbind page table, cache_invalidate) should
> > > be per-device.
> > 
> > It is not *per-device* it is *per-ioasid*
> >
> > And as /dev/ioasid is an interface for controlling multiple ioasid's
> > there is no issue to also multiplex the page table manipulation for
> > multiple ioasids as well.
> > 
> > What you should do next is sketch out in some RFC the exactl ioctls
> > each FD would have and show how the parts I outlined would work and
> > point out any remaining gaps.
> > 
> > The device FD is something like the vfio_device FD from VFIO, it has
> > *nothing* to do with PASID beyond having a single ioctl to authorize
> > the device to use the PASID. All control of the PASID is in
> > /dev/ioasid.
> 
> good to see this reply. Your idea is much clearer to me now. If I'm getting
> you correctly. I think the skeleton is something like below:
> 
> 1) userspace opens a /dev/ioasid, meanwhile there will be an ioasid
>    allocated and a per-ioasid context which can be used to do bind page
>    table and cache invalidate, an ioasid FD returned to userspace.
> 2) userspace passes the ioasid FD to VFIO, let it associated with a device
>    FD (like vfio_device FD).
> 3) userspace binds page table on the ioasid FD with the page table info.
> 4) userspace unbinds the page table on the ioasid FD
> 5) userspace de-associates the ioasid FD and device FD
> 
> Does above suit your outline?

Seems so

> If yes, I still have below concern and wish to see your opinion.
> - the ioasid FD and device association will happen at runtime instead of
>   just happen in the setup phase.

Of course, this is required for security. The vIOMMU must perform the
device association when the guest requires it. Otherwise a guest
cannot isolate a PASID to a single process/device pair.

I'm worried Intel views the only use of PASID in a guest is with
ENQCMD, but that is not consistent with the industry. We need to see
normal nested PASID support with assigned PCI VFs.

> - how about AMD and ARM's vSVA support? Their PASID allocation and page table
>   happens within guest. They only need to bind the guest PASID table to host.
>   Above model seems unable to fit them. (Jean, Eric, Jacob please feel free
>   to correct me)

No, everything needs the device association step or it is not
secure. 

You can give a PASID to a guest and allow it to manipulate it's memory
map directly, nested under the guest's CPU page tables.

However the guest cannot authorize a PCI BDF to utilize that PASID
without going through some kind of step in the hypervisor. A Guest
should not be able to authorize a PASID for a BDF it doesn't have
access to - only the hypervisor can enforce this.

This all must also fit into the mdev model where only the
device-specific mdev driver can do the device specific PASID
authorization. A hypercall is essential, or we need to stop pretending
mdev is a good idea.

I'm sure there will be some small differences, and you should clearly
explain the entire uAPI surface so that soneone from AMD and ARM can
review it.

> - this per-ioasid SVA operations is not aligned with the native SVA usage
>   model. Native SVA bind is per-device.

Seems like that is an error in native SVA.

SVA is a particular mode of the PASID's memory mapping table, it has
nothing to do with a device.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-01  7:04                               ` Liu, Yi L
@ 2021-04-01 11:54                                 ` Jason Gunthorpe
  2021-04-02 12:46                                   ` Liu, Yi L
  2021-04-01 12:05                                 ` Jean-Philippe Brucker
  1 sibling, 1 reply; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-01 11:54 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Tian, Kevin, Jacob Pan, Jean-Philippe Brucker, LKML,
	Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker,
	Alex Williamson, Eric Auger, Jonathan Corbet, Raj, Ashok, Wu,
	Hao, Jiang, Dave

On Thu, Apr 01, 2021 at 07:04:01AM +0000, Liu, Yi L wrote:

> After reading your reply in https://lore.kernel.org/linux-iommu/20210331123801.GD1463678@nvidia.com/#t
> So you mean /dev/ioasid FD is per-VM instead of per-ioasid, so above skeleton
> doesn't suit your idea.

You can do it one PASID per FD or multiple PASID's per FD. Most likely
we will have high numbers of PASID's in a qemu process so I assume
that number of FDs will start to be a contraining factor, thus
multiplexing is reasonable.

It doesn't really change anything about the basic flow.

digging deeply into it either seems like a reasonable choice.

> +-----------------------------+-----------------------------------------------+
> |      userspace              |               kernel space                    |
> +-----------------------------+-----------------------------------------------+
> | ioasid_fd =                 | /dev/ioasid does below:                       |
> | open("/dev/ioasid", O_RDWR);|   struct ioasid_fd_ctx {                      |
> |                             |       struct list_head ioasid_list;           |
> |                             |       ...                                     |
> |                             |   } ifd_ctx; // ifd_ctx is per ioasid_fd      |

Sure, possibly an xarray not a list

> +-----------------------------+-----------------------------------------------+
> | ioctl(ioasid_fd,            | /dev/ioasid does below:                       |
> |       ALLOC, &ioasid);      |   struct ioasid_data {                        |
> |                             |       ioasid_t ioasid;                        |
> |                             |       struct list_head device_list;           |
> |                             |       struct list_head next;                  |
> |                             |       ...                                     |
> |                             |   } id_data; // id_data is per ioasid         |
> |                             |                                               |
> |                             |   list_add(&id_data.next,                     |
> |                             |            &ifd_ctx.ioasid_list);
> |

Yes, this should have a kref in it too

> +-----------------------------+-----------------------------------------------+
> | ioctl(device_fd,            | VFIO does below:                              |
> |       DEVICE_ALLOW_IOASID,  | 1) get ioasid_fd, check if ioasid_fd is valid |
> |       ioasid_fd,            | 2) check if ioasid is allocated from ioasid_fd|
> |       ioasid);              | 3) register device/domain info to /dev/ioasid |
> |                             |    tracked in id_data.device_list             |
> |                             | 4) record the ioasid in VFIO's per-device     |
> |                             |    ioasid list for future security check      |

You would provide a function that does steps 1&2 look at eventfd for
instance.

I'm not sure we need to register the device with the ioasid. device
should incr the kref on the ioasid_data at this point.

> +-----------------------------+-----------------------------------------------+
> | ioctl(ioasid_fd,            | /dev/ioasid does below:                       |
> |       BIND_PGTBL,           | 1) find ioasid's id_data                      |
> |       pgtbl_data,           | 2) loop the id_data.device_list and tell iommu|
> |       ioasid);              |    give ioasid access to the devices
> |

This seems backwards, DEVICE_ALLOW_IOASID should tell the iommu to
give the ioasid to the device.

Here the ioctl should be about assigning a memory map from the the current
mm_struct to the pasid

> +-----------------------------+-----------------------------------------------+
> | ioctl(ioasid_fd,            | /dev/ioasid does below:                       |
> |       UNBIND_PGTBL,         | 1) find ioasid's id_data                      |
> |       ioasid);              | 2) loop the id_data.device_list and tell iommu|
> |                             |    clear ioasid access to the devices         |

Also seems backwards. The ioctl here should be 'destroy ioasid' which
wipes out the page table, halts DMA access and parks the PASID until
all users are done.

> +-----------------------------+-----------------------------------------------+
> | ioctl(device_fd,            | VFIO does below:                              |
> |      DEVICE_DISALLOW_IOASID,| 1) check if ioasid is associated in VFIO's    |
> |       ioasid_fd,            |    device ioasid list.                        |
> |       ioasid);              | 2) unregister device/domain info from         |
> |                             |    /dev/ioasid, clear in id_data.device_list  |

This should disconnect the iommu and kref_put the ioasid_data

Remember the layering, only the device_fd knows what the pci_device is
that it is touching, it doesn't make alot of sense to leak that into
the ioasid world that should only be dealing with the page table
mapping.

> +-----------------------------+-----------------------------------------------+
> | ioctl(ioasid_fd,            | /dev/ioasid does below:                       |
> |       FREE, ioasid);        |  list_del(&id_data.next);                     |
> +-----------------------------+-----------------------------------------------+

Don't know if we need a free. The sequence above is backwards, the
page table should be setup, the device authorized, device
de-authorized then page table destroyed. PASID recycles once everyone
is released.

Include a sequence showing how the kvm FD is used to program the
vPASID to pPASID table that ENQCMD uses.

Show how dynamic authorization works based on requests from the
guest's vIOMMU

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-01  7:04                               ` Liu, Yi L
  2021-04-01 11:54                                 ` Jason Gunthorpe
@ 2021-04-01 12:05                                 ` Jean-Philippe Brucker
  2021-04-01 12:12                                   ` Jason Gunthorpe
  2021-04-01 13:38                                   ` Liu, Yi L
  1 sibling, 2 replies; 269+ messages in thread
From: Jean-Philippe Brucker @ 2021-04-01 12:05 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Jason Gunthorpe, Tian, Kevin, Jacob Pan, LKML, Joerg Roedel,
	Lu Baolu, David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj, Ashok, Wu, Hao, Jiang, Dave

On Thu, Apr 01, 2021 at 07:04:01AM +0000, Liu, Yi L wrote:
> > - how about AMD and ARM's vSVA support? Their PASID allocation and page
> > table
> >   happens within guest. They only need to bind the guest PASID table to
> > host.

In this case each VM has its own IOASID space, and the host IOASID
allocator doesn't participate. Plus this only makes sense when assigning a
whole VF to a guest, and VFIO is the tool for this. So I wouldn't shoehorn
those ops into /dev/ioasid, though we do need a transport for invalidate
commands.

> >   Above model seems unable to fit them. (Jean, Eric, Jacob please feel free
> >   to correct me)
> > - this per-ioasid SVA operations is not aligned with the native SVA usage
> >   model. Native SVA bind is per-device.

Bare-metal SVA doesn't need /dev/ioasid either. A program uses a device
handle to either ask whether SVA is enabled, or to enable it explicitly.
With or without /dev/ioasid, that step is required. OpenCL uses the first
method - automatically enable "fine-grain system SVM" if available, and
provide a flag to userspace.

So userspace does not need to know about PASID. It's only one method for
doing SVA (some GPUs are context-switching page tables instead).

> After reading your reply in https://lore.kernel.org/linux-iommu/20210331123801.GD1463678@nvidia.com/#t
> So you mean /dev/ioasid FD is per-VM instead of per-ioasid, so above skeleton
> doesn't suit your idea. I draft below skeleton to see if our mind is the
> same. But I still believe there is an open on how to fit ARM and AMD's
> vSVA support in this the per-ioasid SVA operation model. thoughts?
> 
> +-----------------------------+-----------------------------------------------+
> |      userspace              |               kernel space                    |
> +-----------------------------+-----------------------------------------------+
> | ioasid_fd =                 | /dev/ioasid does below:                       |
> | open("/dev/ioasid", O_RDWR);|   struct ioasid_fd_ctx {                      |
> |                             |       struct list_head ioasid_list;           |
> |                             |       ...                                     |
> |                             |   } ifd_ctx; // ifd_ctx is per ioasid_fd      |
> +-----------------------------+-----------------------------------------------+
> | ioctl(ioasid_fd,            | /dev/ioasid does below:                       |
> |       ALLOC, &ioasid);      |   struct ioasid_data {                        |
> |                             |       ioasid_t ioasid;                        |
> |                             |       struct list_head device_list;           |
> |                             |       struct list_head next;                  |
> |                             |       ...                                     |
> |                             |   } id_data; // id_data is per ioasid         |
> |                             |                                               |
> |                             |   list_add(&id_data.next,                     |
> |                             |            &ifd_ctx.ioasid_list);             |
> +-----------------------------+-----------------------------------------------+
> | ioctl(device_fd,            | VFIO does below:                              |
> |       DEVICE_ALLOW_IOASID,  | 1) get ioasid_fd, check if ioasid_fd is valid |
> |       ioasid_fd,            | 2) check if ioasid is allocated from ioasid_fd|
> |       ioasid);              | 3) register device/domain info to /dev/ioasid |
> |                             |    tracked in id_data.device_list             |
> |                             | 4) record the ioasid in VFIO's per-device     |
> |                             |    ioasid list for future security check      |
> +-----------------------------+-----------------------------------------------+
> | ioctl(ioasid_fd,            | /dev/ioasid does below:                       |
> |       BIND_PGTBL,           | 1) find ioasid's id_data                      |
> |       pgtbl_data,           | 2) loop the id_data.device_list and tell iommu|
> |       ioasid);              |    give ioasid access to the devices          |
> +-----------------------------+-----------------------------------------------+
> | ioctl(ioasid_fd,            | /dev/ioasid does below:                       |
> |       UNBIND_PGTBL,         | 1) find ioasid's id_data                      |
> |       ioasid);              | 2) loop the id_data.device_list and tell iommu|
> |                             |    clear ioasid access to the devices         |
> +-----------------------------+-----------------------------------------------+
> | ioctl(device_fd,            | VFIO does below:                              |
> |      DEVICE_DISALLOW_IOASID,| 1) check if ioasid is associated in VFIO's    |
> |       ioasid_fd,            |    device ioasid list.                        |
> |       ioasid);              | 2) unregister device/domain info from         |
> |                             |    /dev/ioasid, clear in id_data.device_list  |
> +-----------------------------+-----------------------------------------------+
> | ioctl(ioasid_fd,            | /dev/ioasid does below:                       |
> |       FREE, ioasid);        |  list_del(&id_data.next);                     |
> +-----------------------------+-----------------------------------------------+


Also wondering about:

* Querying IOMMU nesting capabilities before binding page tables (which
  page table formats are supported?). We were planning to have a VFIO cap,
  but I'm guessing we need to go back to the sysfs solution?

* Invalidation, probably an ioasid_fd ioctl?

* Page faults, page response. From and to devices, and don't necessarily
  have a PASID. But needed by vdpa as well, so that's also going through
  /dev/ioasid?

Thanks,
Jean

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-01 12:05                                 ` Jean-Philippe Brucker
@ 2021-04-01 12:12                                   ` Jason Gunthorpe
  2021-04-01 13:38                                   ` Liu, Yi L
  1 sibling, 0 replies; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-01 12:12 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Liu, Yi L, Tian, Kevin, Jacob Pan, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj, Ashok, Wu, Hao, Jiang, Dave

On Thu, Apr 01, 2021 at 02:05:00PM +0200, Jean-Philippe Brucker wrote:
> On Thu, Apr 01, 2021 at 07:04:01AM +0000, Liu, Yi L wrote:
> > > - how about AMD and ARM's vSVA support? Their PASID allocation and page
> > > table
> > >   happens within guest. They only need to bind the guest PASID table to
> > > host.
> 
> In this case each VM has its own IOASID space, and the host IOASID
> allocator doesn't participate. Plus this only makes sense when assigning a
> whole VF to a guest, and VFIO is the tool for this. So I wouldn't shoehorn
> those ops into /dev/ioasid, though we do need a transport for invalidate
> commands.

How does security work? Devices still have to be authorized to use the
PASID and this approach seems like it completely excludes mdev/vdpa
from ever using a PASID, and those are the most logical users.

> > >   Above model seems unable to fit them. (Jean, Eric, Jacob please feel free
> > >   to correct me)
> > > - this per-ioasid SVA operations is not aligned with the native SVA usage
> > >   model. Native SVA bind is per-device.
> 
> Bare-metal SVA doesn't need /dev/ioasid either. 

It depends what you are doing. /dev/ioasid would provide fine grained
control over the memory mapping. It is not strict SVA, but I can see
applications where using a GPU with a pre-configured optimized mapping
could be interesting.

> A program uses a device handle to either ask whether SVA is enabled,
> or to enable it explicitly.  With or without /dev/ioasid, that step
> is required. OpenCL uses the first method - automatically enable
> "fine-grain system SVM" if available, and provide a flag to
> userspace.

SVA can be done with ioasid, we can decide if it makes sense to have
shortcuts in every driver

> So userspace does not need to know about PASID. It's only one method for
> doing SVA (some GPUs are context-switching page tables instead).

Sure, there are lots of approaches. Here we are only talking about
PASID enablement. PASID has more options.
 
> * Page faults, page response. From and to devices, and don't necessarily
>   have a PASID. But needed by vdpa as well, so that's also going through
>   /dev/ioasid?

Only real PASID's should use this interface. All the not-PASID stuff
is on its own.

VPDA should accept a PASID from here and configure&authorize the real
HW to attach the PASID to all DMA's connected to the virtio queues.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-01 11:46                               ` Jason Gunthorpe
@ 2021-04-01 13:10                                 ` Liu, Yi L
  2021-04-01 13:15                                   ` Jason Gunthorpe
  0 siblings, 1 reply; 269+ messages in thread
From: Liu, Yi L @ 2021-04-01 13:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Jacob Pan, Jean-Philippe Brucker, LKML,
	Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker,
	Alex Williamson, Eric Auger, Jonathan Corbet, Raj, Ashok, Wu,
	Hao, Jiang, Dave

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, April 1, 2021 7:47 PM
[...]
> I'm worried Intel views the only use of PASID in a guest is with
> ENQCMD, but that is not consistent with the industry. We need to see
> normal nested PASID support with assigned PCI VFs.

I'm not quire flow here. Intel also allows PASID usage in guest without
ENQCMD. e.g. Passthru a PF to guest, and use PASID on it without ENQCMD.

[...]

> I'm sure there will be some small differences, and you should clearly
> explain the entire uAPI surface so that soneone from AMD and ARM can
> review it.

good suggestion, will do.

> > - this per-ioasid SVA operations is not aligned with the native SVA usage
> >   model. Native SVA bind is per-device.
> 
> Seems like that is an error in native SVA.
> 
> SVA is a particular mode of the PASID's memory mapping table, it has
> nothing to do with a device.

I think it still has relationship with device. This is determined by the
DMA remapping hierarchy in hardware. e.g. Intel VT-d, the DMA isolation is
enforced first in device granularity and then PASID granularity. SVA makes
usage of both PASID and device granularity isolation.

Regards,
Yi Liu

> Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-01 13:10                                 ` Liu, Yi L
@ 2021-04-01 13:15                                   ` Jason Gunthorpe
  2021-04-01 13:43                                     ` Liu, Yi L
  0 siblings, 1 reply; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-01 13:15 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Tian, Kevin, Jacob Pan, Jean-Philippe Brucker, LKML,
	Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker,
	Alex Williamson, Eric Auger, Jonathan Corbet, Raj, Ashok, Wu,
	Hao, Jiang, Dave

On Thu, Apr 01, 2021 at 01:10:48PM +0000, Liu, Yi L wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Thursday, April 1, 2021 7:47 PM
> [...]
> > I'm worried Intel views the only use of PASID in a guest is with
> > ENQCMD, but that is not consistent with the industry. We need to see
> > normal nested PASID support with assigned PCI VFs.
> 
> I'm not quire flow here. Intel also allows PASID usage in guest without
> ENQCMD. e.g. Passthru a PF to guest, and use PASID on it without ENQCMD.

Then you need all the parts, the hypervisor calls from the vIOMMU, and
you can't really use a vPASID.

I'm not sure how Intel intends to resolve all of this.

> > > - this per-ioasid SVA operations is not aligned with the native SVA usage
> > >   model. Native SVA bind is per-device.
> > 
> > Seems like that is an error in native SVA.
> > 
> > SVA is a particular mode of the PASID's memory mapping table, it has
> > nothing to do with a device.
> 
> I think it still has relationship with device. This is determined by the
> DMA remapping hierarchy in hardware. e.g. Intel VT-d, the DMA isolation is
> enforced first in device granularity and then PASID granularity. SVA makes
> usage of both PASID and device granularity isolation.

When the device driver authorizes a PASID the VT-d stuff should setup
the isolation parameters for the give pci_device and PASID.

Do not leak implementation details like this as uAPI. Authorization
and memory map are distinct ideas with distinct interfaces. Do not mix
them.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-01 12:05                                 ` Jean-Philippe Brucker
  2021-04-01 12:12                                   ` Jason Gunthorpe
@ 2021-04-01 13:38                                   ` Liu, Yi L
  2021-04-01 13:42                                     ` Jason Gunthorpe
  1 sibling, 1 reply; 269+ messages in thread
From: Liu, Yi L @ 2021-04-01 13:38 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Jason Gunthorpe, Tian, Kevin, Jacob Pan, LKML, Joerg Roedel,
	Lu Baolu, David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj, Ashok, Wu, Hao, Jiang, Dave

> From: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Sent: Thursday, April 1, 2021 8:05 PM
[...]
> 
> Also wondering about:
> 
> * Querying IOMMU nesting capabilities before binding page tables (which
>   page table formats are supported?). We were planning to have a VFIO cap,
>   but I'm guessing we need to go back to the sysfs solution?

I think it can also be with /dev/ioasid.

> 
> * Invalidation, probably an ioasid_fd ioctl?

yeah, if we are doing bind/unbind_pagtable via ioasid_fd, then yes,
invalidation should go this way as well. This is why I worried it may
fail to meet the requirement from you and Eric.

> * Page faults, page response. From and to devices, and don't necessarily
>   have a PASID. But needed by vdpa as well, so that's also going through
>   /dev/ioasid?

page faults should still be per-device, but the fault event fd may be stored
in /dev/ioasid. page response would be in /dev/ioasid just like invalidation.

Regards,
Yi Liu

> 
> Thanks,
> Jean

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-01 13:38                                   ` Liu, Yi L
@ 2021-04-01 13:42                                     ` Jason Gunthorpe
  2021-04-01 14:08                                       ` Liu, Yi L
  0 siblings, 1 reply; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-01 13:42 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Jean-Philippe Brucker, Tian, Kevin, Jacob Pan, LKML,
	Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker,
	Alex Williamson, Eric Auger, Jonathan Corbet, Raj, Ashok, Wu,
	Hao, Jiang, Dave

On Thu, Apr 01, 2021 at 01:38:46PM +0000, Liu, Yi L wrote:
> > From: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > Sent: Thursday, April 1, 2021 8:05 PM
> [...]
> > 
> > Also wondering about:
> > 
> > * Querying IOMMU nesting capabilities before binding page tables (which
> >   page table formats are supported?). We were planning to have a VFIO cap,
> >   but I'm guessing we need to go back to the sysfs solution?
> 
> I think it can also be with /dev/ioasid.

Sure, anything to do with page table formats and setting page tables
should go through ioasid.

> > * Invalidation, probably an ioasid_fd ioctl?
> 
> yeah, if we are doing bind/unbind_pagtable via ioasid_fd, then yes,
> invalidation should go this way as well. This is why I worried it may
> fail to meet the requirement from you and Eric.

Yes, all manipulation of page tables, including removing memory ranges, or
setting memory ranges to trigger a page fault behavior should go
through here.

> > * Page faults, page response. From and to devices, and don't necessarily
> >   have a PASID. But needed by vdpa as well, so that's also going through
> >   /dev/ioasid?
> 
> page faults should still be per-device, but the fault event fd may be stored
> in /dev/ioasid. page response would be in /dev/ioasid just like invalidation.

Here you mean non-SVA page faults that are delegated to userspace to handle?

Why would that be per-device?

Can you show the flow you imagine?

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-01 13:15                                   ` Jason Gunthorpe
@ 2021-04-01 13:43                                     ` Liu, Yi L
  2021-04-01 13:46                                       ` Jason Gunthorpe
  0 siblings, 1 reply; 269+ messages in thread
From: Liu, Yi L @ 2021-04-01 13:43 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Jacob Pan, Jean-Philippe Brucker, LKML,
	Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker,
	Alex Williamson, Eric Auger, Jonathan Corbet, Raj, Ashok, Wu,
	Hao, Jiang, Dave

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, April 1, 2021 9:16 PM
> 
> On Thu, Apr 01, 2021 at 01:10:48PM +0000, Liu, Yi L wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Thursday, April 1, 2021 7:47 PM
> > [...]
> > > I'm worried Intel views the only use of PASID in a guest is with
> > > ENQCMD, but that is not consistent with the industry. We need to see
> > > normal nested PASID support with assigned PCI VFs.
> >
> > I'm not quire flow here. Intel also allows PASID usage in guest without
> > ENQCMD. e.g. Passthru a PF to guest, and use PASID on it without
> ENQCMD.
> 
> Then you need all the parts, the hypervisor calls from the vIOMMU, and
> you can't really use a vPASID.

This is a diagram shows the vSVA setup.

    .-------------.  .---------------------------.
    |   vIOMMU    |  | Guest process CR3, FL only|
    |             |  '---------------------------'
    .----------------/
    | PASID Entry |--- PASID cache flush -
    '-------------'                       |
    |             |                       V
    |             |                CR3 in GPA
    '-------------'
Guest
------| Shadow |--------------------------|--------
      v        v                          v
Host
    .-------------.  .----------------------.
    |   pIOMMU    |  | Bind FL for GVA-GPA  |
    |             |  '----------------------'
    .----------------/  |
    | PASID Entry |     V (Nested xlate)
    '----------------\.------------------------------.
    |             |   |SL for GPA-HPA, default domain|
    |             |   '------------------------------'
    '-------------'
Where:
 - FL = First level/stage one page tables
 - SL = Second level/stage two page tables

https://lore.kernel.org/linux-iommu/20210302203545.436623-1-yi.l.liu@intel.com/

> 
> I'm not sure how Intel intends to resolve all of this.
> 
> > > > - this per-ioasid SVA operations is not aligned with the native SVA
> usage
> > > >   model. Native SVA bind is per-device.
> > >
> > > Seems like that is an error in native SVA.
> > >
> > > SVA is a particular mode of the PASID's memory mapping table, it has
> > > nothing to do with a device.
> >
> > I think it still has relationship with device. This is determined by the
> > DMA remapping hierarchy in hardware. e.g. Intel VT-d, the DMA isolation
> is
> > enforced first in device granularity and then PASID granularity. SVA makes
> > usage of both PASID and device granularity isolation.
> 
> When the device driver authorizes a PASID the VT-d stuff should setup
> the isolation parameters for the give pci_device and PASID.

yes, both device and PASID is needed to setup VT-d stuff.

> Do not leak implementation details like this as uAPI. Authorization
> and memory map are distinct ideas with distinct interfaces. Do not mix
> them.

got you. Let's focus on the uAPI things here and leave implementation details
in RFC patches.

Thanks,
Yi Liu

> Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-01 13:43                                     ` Liu, Yi L
@ 2021-04-01 13:46                                       ` Jason Gunthorpe
  2021-04-02  7:58                                         ` Tian, Kevin
  2021-04-02 10:01                                         ` Tian, Kevin
  0 siblings, 2 replies; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-01 13:46 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Tian, Kevin, Jacob Pan, Jean-Philippe Brucker, LKML,
	Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker,
	Alex Williamson, Eric Auger, Jonathan Corbet, Raj, Ashok, Wu,
	Hao, Jiang, Dave

On Thu, Apr 01, 2021 at 01:43:36PM +0000, Liu, Yi L wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Thursday, April 1, 2021 9:16 PM
> > 
> > On Thu, Apr 01, 2021 at 01:10:48PM +0000, Liu, Yi L wrote:
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Thursday, April 1, 2021 7:47 PM
> > > [...]
> > > > I'm worried Intel views the only use of PASID in a guest is with
> > > > ENQCMD, but that is not consistent with the industry. We need to see
> > > > normal nested PASID support with assigned PCI VFs.
> > >
> > > I'm not quire flow here. Intel also allows PASID usage in guest without
> > > ENQCMD. e.g. Passthru a PF to guest, and use PASID on it without
> > ENQCMD.
> > 
> > Then you need all the parts, the hypervisor calls from the vIOMMU, and
> > you can't really use a vPASID.
> 
> This is a diagram shows the vSVA setup.

I'm not talking only about vSVA. Generic PASID support with arbitary
mappings.

And how do you deal with the vPASID vs pPASID issue if the system has
a mix of physical devices and mdevs?

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-01 13:42                                     ` Jason Gunthorpe
@ 2021-04-01 14:08                                       ` Liu, Yi L
  2021-04-01 16:03                                         ` Jason Gunthorpe
  0 siblings, 1 reply; 269+ messages in thread
From: Liu, Yi L @ 2021-04-01 14:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Tian, Kevin, Jacob Pan, LKML,
	Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker,
	Alex Williamson, Eric Auger, Jonathan Corbet, Raj, Ashok, Wu,
	Hao, Jiang, Dave

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, April 1, 2021 9:43 PM
> 
> On Thu, Apr 01, 2021 at 01:38:46PM +0000, Liu, Yi L wrote:
> > > From: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > > Sent: Thursday, April 1, 2021 8:05 PM
> > [...]
> > >
> > > Also wondering about:
> > >
> > > * Querying IOMMU nesting capabilities before binding page tables
> (which
> > >   page table formats are supported?). We were planning to have a VFIO
> cap,
> > >   but I'm guessing we need to go back to the sysfs solution?
> >
> > I think it can also be with /dev/ioasid.
> 
> Sure, anything to do with page table formats and setting page tables
> should go through ioasid.
> 
> > > * Invalidation, probably an ioasid_fd ioctl?
> >
> > yeah, if we are doing bind/unbind_pagtable via ioasid_fd, then yes,
> > invalidation should go this way as well. This is why I worried it may
> > fail to meet the requirement from you and Eric.
> 
> Yes, all manipulation of page tables, including removing memory ranges, or
> setting memory ranges to trigger a page fault behavior should go
> through here.
> 
> > > * Page faults, page response. From and to devices, and don't necessarily
> > >   have a PASID. But needed by vdpa as well, so that's also going through
> > >   /dev/ioasid?
> >
> > page faults should still be per-device, but the fault event fd may be stored
> > in /dev/ioasid. page response would be in /dev/ioasid just like invalidation.
> 
> Here you mean non-SVA page faults that are delegated to userspace to
> handle?

no, just SVA page faults. otherwise, no need to let userspace handle.

> 
> Why would that be per-device?
>
> Can you show the flow you imagine?

DMA page faults are delivered to root-complex via page request message and
it is per-device according to PCIe spec. Page request handling flow is:

1) iommu driver receives a page request from device
2) iommu driver parses the page request message. Get the RID,PASID, faulted
   page and requested permissions etc.
3) iommu driver triggers fault handler registered by device driver with
   iommu_report_device_fault()
4) device driver's fault handler signals an event FD to notify userspace to
   fetch the information about the page fault. If it's VM case, inject the
   page fault to VM and let guest to solve it.

Eric has sent below series for the page fault reporting for VM with passthru
device.
https://lore.kernel.org/kvm/20210223210625.604517-5-eric.auger@redhat.com/

Regards,
Yi Liu

> Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-01 14:08                                       ` Liu, Yi L
@ 2021-04-01 16:03                                         ` Jason Gunthorpe
  2021-04-02  7:30                                           ` Tian, Kevin
  2021-04-15 13:11                                           ` Auger Eric
  0 siblings, 2 replies; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-01 16:03 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Jean-Philippe Brucker, Tian, Kevin, Jacob Pan, LKML,
	Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker,
	Alex Williamson, Eric Auger, Jonathan Corbet, Raj, Ashok, Wu,
	Hao, Jiang, Dave

On Thu, Apr 01, 2021 at 02:08:17PM +0000, Liu, Yi L wrote:

> DMA page faults are delivered to root-complex via page request message and
> it is per-device according to PCIe spec. Page request handling flow is:
> 
> 1) iommu driver receives a page request from device
> 2) iommu driver parses the page request message. Get the RID,PASID, faulted
>    page and requested permissions etc.
> 3) iommu driver triggers fault handler registered by device driver with
>    iommu_report_device_fault()

This seems confused.

The PASID should define how to handle the page fault, not the driver.

I don't remember any device specific actions in ATS, so what is the
driver supposed to do?

> 4) device driver's fault handler signals an event FD to notify userspace to
>    fetch the information about the page fault. If it's VM case, inject the
>    page fault to VM and let guest to solve it.

If the PASID is set to 'report page fault to userspace' then some
event should come out of /dev/ioasid, or be reported to a linked
eventfd, or whatever.

If the PASID is set to 'SVM' then the fault should be passed to
handle_mm_fault

And so on.

Userspace chooses what happens based on how they configure the PASID
through /dev/ioasid.

Why would a device driver get involved here?

> Eric has sent below series for the page fault reporting for VM with passthru
> device.
> https://lore.kernel.org/kvm/20210223210625.604517-5-eric.auger@redhat.com/

It certainly should not be in vfio pci. Everything using a PASID needs
this infrastructure, VDPA, mdev, PCI, CXL, etc.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-01  0:37                               ` Jason Gunthorpe
@ 2021-04-01 17:23                                 ` Jacob Pan
  2021-04-01 17:26                                   ` Jason Gunthorpe
  0 siblings, 1 reply; 269+ messages in thread
From: Jacob Pan @ 2021-04-01 17:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Joerg Roedel, Liu, Yi L, Tian, Kevin, Jean-Philippe Brucker,
	LKML, Lu Baolu, David Woodhouse, iommu, cgroups, Tejun Heo,
	Li Zefan, Johannes Weiner, Jean-Philippe Brucker,
	Alex Williamson, Eric Auger, Jonathan Corbet, Raj, Ashok, Wu,
	Hao, Jiang, Dave, jacob.jun.pan

Hi Jason,

On Wed, 31 Mar 2021 21:37:05 -0300, Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Mar 31, 2021 at 04:46:21PM -0700, Jacob Pan wrote:
> > Hi Jason,
> > 
> > On Wed, 31 Mar 2021 09:38:01 -0300, Jason Gunthorpe <jgg@nvidia.com>
> > wrote: 
> > > > > Get rid of the ioasid set.
> > > > >
> > > > > Each driver has its own list of allowed ioasids.    
> > >  [...]  
> > > 
> > > The /dev/ioasid FD replaces this security check. By becoming FD
> > > centric you don't need additional kernel security objects.
> > > 
> > > Any process with access to the /dev/ioasid FD is allowed to control
> > > those PASID. The seperation between VMs falls naturally from the
> > > seperation of FDs without creating additional, complicated, security
> > > infrastrucure in the kernel.
> > > 
> > > This is why all APIs must be FD focused, and you need to have a
> > > logical layering of responsibility.
> > > 
> > >  Allocate a /dev/ioasid FD
> > >  Allocate PASIDs inside the FD
Just to be super clear. Do we allocate a FD for each PASID and return the
FD to the user? Or return the plain PASID number back to the user space?

> > >  Assign memory to the PASIDS
> > > 
> > >  Open a device FD, eg from VFIO or VDP
> > >  Instruct the device FD to authorize the device to access PASID A in
> > >  an ioasid FD  
> > How do we know user provided PASID A was allocated by the ioasid FD?  
> 
> You pass in the ioasid FD and use a 'get pasid from fdno' API to
> extract the required kernel structure.
> 
Seems you are talking about two FDs:
- /dev/ioasid FD
- per IOASID FD
This API ioasid = get_pasid_from_fd(dev_ioasid_fd, ioasid_fd);
dev_ioasid_fd will find the xarray for all the PASIDs allocated under it,
ioasid_fd wil be the index into the xarray to retrieve the actual ioasid.
Correct?

> > Shouldn't we validate user input by tracking which PASIDs are
> > allocated by which ioasid FD?  
> 
> Yes, but it is integral to the ioasid FD, not something separated.
> 
OK, if we have per IOASID FD in addition to the /dev/ioasid FD we can
validate user input.

> > > VFIO extracts some kernel representation of the ioasid from the ioasid
> > > fd using an API
> > >   
> > This lookup API seems to be asking for per ioasid FD storage array.
> > Today, the ioasid_set is per mm and contains a Xarray.   
> 
> Right, put the xarray per FD. A set per mm is fairly nonsensical, we
> don't use the mm as that kind of security key.
> 
Sounds good, one xarray per /dev/ioasid FD.

> > Since each VM, KVM can only open one ioasid FD, this per FD array
> > would be equivalent to the per mm ioasid_set, right?  
> 
> Why only one?  Each interaction with the other FDs should include the
> PASID/FD pair. There is no restriction to just one.
> 
OK, one per subsystem-VM. For example, if a VM has a VFIO and a VDPA
device, it should only two /dev/ioasid FDs respectively. Correct?

> > > VFIO does some kernel call to IOMMU/IOASID layer that says 'tell the
> > > IOMMU that this PCI device is allowed to use this PASID'  
> >
> > Would it be redundant to what iommu_uapi_sva_bind_gpasid() does? I
> > thought the idea is to use ioasid FD IOCTL to issue IOMMU uAPI calls.
> > Or we can skip this step for now and wait for the user to do SVA bind.  
> 
> I'm not sure what you are asking.
> 
> Possibly some of the IOMMU API will need a bit adjusting to make
> things split.
> 
> The act of programming the page tables and the act of authorizing a
> PCI BDF to use a PASID are distinct things with two different IOCTLs.
> 
Why separate? I don't see a use case to just authorize a PASID but don't
bind it with a page table. The very act of bind page table *is* the
authorization.

> iommu_uapi_sva_bind_gpasid() is never called by anything, and it's
> uAPI is never implemented.
> 
Just a little background here. We have been working on the vSVA stack
since 2017. At the time, VFIO was the de facto interface for IOMMU-aware
driver framework. These uAPIs were always developed alone side with the
accompanying VFIO patches served as consumers. By the time these IOMMU uAPIs
were merged after reviews from most vendors, the VFIO patchset was
approaching maturity in around v7. This is when we suddenly saw a new
request to support VDPA, which attempted VFIO earlier but ultimately moved
away.

For a complex stack like vSVA, I feel we have to reduce moving parts and do
some divide and conquer.

> Joerg? Why did you merge dead uapi and dead code?
> 
> Jason


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-01 17:23                                 ` Jacob Pan
@ 2021-04-01 17:26                                   ` Jason Gunthorpe
  0 siblings, 0 replies; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-01 17:26 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Joerg Roedel, Liu, Yi L, Tian, Kevin, Jean-Philippe Brucker,
	LKML, Lu Baolu, David Woodhouse, iommu, cgroups, Tejun Heo,
	Li Zefan, Johannes Weiner, Jean-Philippe Brucker,
	Alex Williamson, Eric Auger, Jonathan Corbet, Raj, Ashok, Wu,
	Hao, Jiang, Dave

On Thu, Apr 01, 2021 at 10:23:55AM -0700, Jacob Pan wrote:
> Hi Jason,
> 
> On Wed, 31 Mar 2021 21:37:05 -0300, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Mar 31, 2021 at 04:46:21PM -0700, Jacob Pan wrote:
> > > Hi Jason,
> > > 
> > > On Wed, 31 Mar 2021 09:38:01 -0300, Jason Gunthorpe <jgg@nvidia.com>
> > > wrote: 
> > > > > > Get rid of the ioasid set.
> > > > > >
> > > > > > Each driver has its own list of allowed ioasids.    
> > > >  [...]  
> > > > 
> > > > The /dev/ioasid FD replaces this security check. By becoming FD
> > > > centric you don't need additional kernel security objects.
> > > > 
> > > > Any process with access to the /dev/ioasid FD is allowed to control
> > > > those PASID. The seperation between VMs falls naturally from the
> > > > seperation of FDs without creating additional, complicated, security
> > > > infrastrucure in the kernel.
> > > > 
> > > > This is why all APIs must be FD focused, and you need to have a
> > > > logical layering of responsibility.
> > > > 
> > > >  Allocate a /dev/ioasid FD
> > > >  Allocate PASIDs inside the FD
> Just to be super clear. Do we allocate a FD for each PASID and return the
> FD to the user? Or return the plain PASID number back to the user space?

I would do multiple PASID's per /dev/ioasid FD because we expect alot
of PASIDs to be in use and we'd run into FDno limits.

> > > >  Assign memory to the PASIDS
> > > > 
> > > >  Open a device FD, eg from VFIO or VDP
> > > >  Instruct the device FD to authorize the device to access PASID A in
> > > >  an ioasid FD  
> > > How do we know user provided PASID A was allocated by the ioasid FD?  
> > 
> > You pass in the ioasid FD and use a 'get pasid from fdno' API to
> > extract the required kernel structure.
> > 
> Seems you are talking about two FDs:
> - /dev/ioasid FD

No, just this one.

> - per IOASID FD
> This API ioasid = get_pasid_from_fd(dev_ioasid_fd, ioasid_fd);
> dev_ioasid_fd will find the xarray for all the PASIDs allocated under it,
> ioasid_fd wil be the index into the xarray to retrieve the actual ioasid.
> Correct?

'ioasid_fd' is just the ioasid number in whatever numberspace the
/dev/ioasid FD's use.

> > Why only one?  Each interaction with the other FDs should include the
> > PASID/FD pair. There is no restriction to just one.

> OK, one per subsystem-VM. For example, if a VM has a VFIO and a VDPA
> device, it should only two /dev/ioasid FDs respectively. Correct?

No, only one.

For something like qemu's use case I mostly expect the vIOMMU driver
will open /dev/ioasid for each vIOMMU instance it creates (basically
only one)

> > The act of programming the page tables and the act of authorizing a
> > PCI BDF to use a PASID are distinct things with two different IOCTLs.
> > 
> Why separate? 

Because they have different owners and different layers in the
software.

It is not about use case, it is about putting the control points where
they naturally belong.

> For a complex stack like vSVA, I feel we have to reduce moving parts and do
> some divide and conquer.

uAPI should have all come together with a user and user application.

uAPI is hardest and most important part.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-01 16:03                                         ` Jason Gunthorpe
@ 2021-04-02  7:30                                           ` Tian, Kevin
  2021-04-05 23:35                                             ` Jason Gunthorpe
  2021-04-15 13:11                                           ` Auger Eric
  1 sibling, 1 reply; 269+ messages in thread
From: Tian, Kevin @ 2021-04-02  7:30 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: Jean-Philippe Brucker, Jacob Pan, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj, Ashok, Wu, Hao, Jiang, Dave

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, April 2, 2021 12:04 AM
> 
> On Thu, Apr 01, 2021 at 02:08:17PM +0000, Liu, Yi L wrote:
> 
> > DMA page faults are delivered to root-complex via page request message
> and
> > it is per-device according to PCIe spec. Page request handling flow is:
> >
> > 1) iommu driver receives a page request from device
> > 2) iommu driver parses the page request message. Get the RID,PASID,
> faulted
> >    page and requested permissions etc.
> > 3) iommu driver triggers fault handler registered by device driver with
> >    iommu_report_device_fault()
> 
> This seems confused.
> 
> The PASID should define how to handle the page fault, not the driver.
> 
> I don't remember any device specific actions in ATS, so what is the
> driver supposed to do?
> 
> > 4) device driver's fault handler signals an event FD to notify userspace to
> >    fetch the information about the page fault. If it's VM case, inject the
> >    page fault to VM and let guest to solve it.
> 
> If the PASID is set to 'report page fault to userspace' then some
> event should come out of /dev/ioasid, or be reported to a linked
> eventfd, or whatever.
> 
> If the PASID is set to 'SVM' then the fault should be passed to
> handle_mm_fault
> 
> And so on.
> 
> Userspace chooses what happens based on how they configure the PASID
> through /dev/ioasid.
> 
> Why would a device driver get involved here?
> 
> > Eric has sent below series for the page fault reporting for VM with passthru
> > device.
> > https://lore.kernel.org/kvm/20210223210625.604517-5-
> eric.auger@redhat.com/
> 
> It certainly should not be in vfio pci. Everything using a PASID needs
> this infrastructure, VDPA, mdev, PCI, CXL, etc.
> 

This touches an interesting fact:

The fault may be triggered in either 1st-level or 2nd-level page table, 
when nested translation is enabled (in vSVA case). The 1st-level is bound 
by the user space, which therefore needs to receive the fault event. The 
2nd-level is managed by VFIO (or vDPA), which needs to fix the fault in 
kernel (e.g. find HVA per faulting GPA, call handle_mm_fault and map 
GPA->HPA to IOMMU). Yi's current proposal lets VFIO to register the 
device fault handler, which then forward the event through /dev/ioasid 
to userspace only if it is a 1st-level fault. Are you suggesting a pgtable-
centric fault reporting mechanism to separate handlers in each level, 
i.e. letting VFIO register handler only for 2nd-level fault and then /dev/
ioasid register handler for 1st-level fault?

Thanks
Kevin


^ permalink raw reply	[flat|nested] 269+ messages in thread

* RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-01 13:46                                       ` Jason Gunthorpe
@ 2021-04-02  7:58                                         ` Tian, Kevin
  2021-04-05 23:39                                           ` Jason Gunthorpe
  2021-04-02 10:01                                         ` Tian, Kevin
  1 sibling, 1 reply; 269+ messages in thread
From: Tian, Kevin @ 2021-04-02  7:58 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: Jacob Pan, Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj, Ashok, Wu, Hao, Jiang, Dave

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, April 1, 2021 9:47 PM
> 
> On Thu, Apr 01, 2021 at 01:43:36PM +0000, Liu, Yi L wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Thursday, April 1, 2021 9:16 PM
> > >
> > > On Thu, Apr 01, 2021 at 01:10:48PM +0000, Liu, Yi L wrote:
> > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > Sent: Thursday, April 1, 2021 7:47 PM
> > > > [...]
> > > > > I'm worried Intel views the only use of PASID in a guest is with
> > > > > ENQCMD, but that is not consistent with the industry. We need to see
> > > > > normal nested PASID support with assigned PCI VFs.
> > > >
> > > > I'm not quire flow here. Intel also allows PASID usage in guest without
> > > > ENQCMD. e.g. Passthru a PF to guest, and use PASID on it without
> > > ENQCMD.
> > >
> > > Then you need all the parts, the hypervisor calls from the vIOMMU, and
> > > you can't really use a vPASID.
> >
> > This is a diagram shows the vSVA setup.
> 
> I'm not talking only about vSVA. Generic PASID support with arbitary
> mappings.
> 
> And how do you deal with the vPASID vs pPASID issue if the system has
> a mix of physical devices and mdevs?
> 

We plan to support two schemes. One is vPASID identity-mapped to
pPASID then the mixed scenario just works, with the limitation of 
lacking of live migration support. The other is non-identity-mapped 
scheme, where live migration is supported but physical devices and 
mdevs should not be mixed in one VM if both expose SVA capability 
(requires some filtering check in Qemu). Although we have some 
idea relaxing this restriction in the non-identity scheme, it requires 
more thinking given how the vSVA uAPI is being refactored.

In both cases the virtual VT-d will report a virtual capability to the guest,
indicating that the guest must request PASID through a vcmd register
instead of creating its own namespace. The vIOMMU returns a vPASID 
to the guest upon request. The vPASID could be directly mapped to a 
pPASID or allocated from a new namespace based on user configuration.

We hope the /dev/ioasid can support both schemes, with the minimal
requirement of allowing userspace to tag a vPASID to a pPASID and
allowing mdev to translate vPASID into pPASID, i.e. not assuming that
the guest will always use pPASID.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 269+ messages in thread

* RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-03-30 13:28                       ` Jason Gunthorpe
  2021-03-31  7:38                         ` Liu, Yi L
@ 2021-04-02  8:22                         ` Tian, Kevin
  2021-04-05 23:42                           ` Jason Gunthorpe
  1 sibling, 1 reply; 269+ messages in thread
From: Tian, Kevin @ 2021-04-02  8:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jacob Pan, Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu, Hao,
	Jiang, Dave, Jason Wang

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, March 30, 2021 9:29 PM
> 
> >
> > First, userspace may use ioasid in a non-SVA scenario where ioasid is
> > bound to specific security context (e.g. a control vq in vDPA) instead of
> > tying to mm. In this case there is no pgtable binding initiated from user
> > space. Instead, ioasid is allocated from /dev/ioasid and then programmed
> > to the intended security context through specific passthrough framework
> > which manages that context.
> 
> This sounds like the exact opposite of what I'd like to see.
> 
> I do not want to see every subsystem gaining APIs to program a
> PASID. All of that should be consolidated in *one place*.
> 
> I do not want to see VDPA and VFIO have two nearly identical sets of
> APIs to control the PASID.
> 
> Drivers consuming a PASID, like VDPA, should consume the PASID and do
> nothing more than authorize the HW to use it.
> 
> quemu should have general code under the viommu driver that drives
> /dev/ioasid to create PASID's and manage the IO mapping according to
> the guest's needs.
> 
> Drivers like VDPA and VFIO should simply accept that PASID and
> configure/authorize their HW to do DMA's with its tag.
> 

I agree with you on consolidating things in one place (especially for the
general SVA support). But here I was referring to an usage without 
pgtable binding (Possibly Jason. W can say more here), where the 
userspace just wants to allocate PASIDs, program/accept PASIDs to 
various workqueues (device specific), and then use MAP/UNMAP 
interface to manage address spaces associated with each PASID. 
I just wanted to point out that the latter two steps are through 
VFIO/VDPA specific interfaces. 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 269+ messages in thread

* RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-01 13:46                                       ` Jason Gunthorpe
  2021-04-02  7:58                                         ` Tian, Kevin
@ 2021-04-02 10:01                                         ` Tian, Kevin
  1 sibling, 0 replies; 269+ messages in thread
From: Tian, Kevin @ 2021-04-02 10:01 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: Jacob Pan, Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj, Ashok, Wu, Hao, Jiang, Dave

> From: Tian, Kevin
> Sent: Friday, April 2, 2021 3:58 PM
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Thursday, April 1, 2021 9:47 PM
> >
> > On Thu, Apr 01, 2021 at 01:43:36PM +0000, Liu, Yi L wrote:
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Thursday, April 1, 2021 9:16 PM
> > > >
> > > > On Thu, Apr 01, 2021 at 01:10:48PM +0000, Liu, Yi L wrote:
> > > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > > Sent: Thursday, April 1, 2021 7:47 PM
> > > > > [...]
> > > > > > I'm worried Intel views the only use of PASID in a guest is with
> > > > > > ENQCMD, but that is not consistent with the industry. We need to
> see
> > > > > > normal nested PASID support with assigned PCI VFs.
> > > > >
> > > > > I'm not quire flow here. Intel also allows PASID usage in guest without
> > > > > ENQCMD. e.g. Passthru a PF to guest, and use PASID on it without
> > > > ENQCMD.
> > > >
> > > > Then you need all the parts, the hypervisor calls from the vIOMMU, and
> > > > you can't really use a vPASID.
> > >
> > > This is a diagram shows the vSVA setup.
> >
> > I'm not talking only about vSVA. Generic PASID support with arbitary
> > mappings.
> >
> > And how do you deal with the vPASID vs pPASID issue if the system has
> > a mix of physical devices and mdevs?
> >
> 
> We plan to support two schemes. One is vPASID identity-mapped to
> pPASID then the mixed scenario just works, with the limitation of
> lacking of live migration support. The other is non-identity-mapped
> scheme, where live migration is supported but physical devices and
> mdevs should not be mixed in one VM if both expose SVA capability
> (requires some filtering check in Qemu). Although we have some
> idea relaxing this restriction in the non-identity scheme, it requires
> more thinking given how the vSVA uAPI is being refactored.
> 
> In both cases the virtual VT-d will report a virtual capability to the guest,
> indicating that the guest must request PASID through a vcmd register
> instead of creating its own namespace. The vIOMMU returns a vPASID
> to the guest upon request. The vPASID could be directly mapped to a
> pPASID or allocated from a new namespace based on user configuration.
> 
> We hope the /dev/ioasid can support both schemes, with the minimal
> requirement of allowing userspace to tag a vPASID to a pPASID and
> allowing mdev to translate vPASID into pPASID, i.e. not assuming that
> the guest will always use pPASID.
> 

Per your comments in other threads I suppose this requirement should
be implemented in VFIO_ALLOW_PASID command instead of going 
through /dev/ioasid which only needs to know pPASID and its pgtable
management.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 269+ messages in thread

* RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-01 11:54                                 ` Jason Gunthorpe
@ 2021-04-02 12:46                                   ` Liu, Yi L
  0 siblings, 0 replies; 269+ messages in thread
From: Liu, Yi L @ 2021-04-02 12:46 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Jacob Pan, Jean-Philippe Brucker, LKML,
	Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker,
	Alex Williamson, Eric Auger, Jonathan Corbet, Raj, Ashok, Wu,
	Hao, Jiang, Dave

Hi Jason,

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, April 1, 2021 7:54 PM
> 
> On Thu, Apr 01, 2021 at 07:04:01AM +0000, Liu, Yi L wrote:
> 
> > After reading your reply in https://lore.kernel.org/linux-
> iommu/20210331123801.GD1463678@nvidia.com/#t
> > So you mean /dev/ioasid FD is per-VM instead of per-ioasid, so above
> skeleton
> > doesn't suit your idea.
> 
> You can do it one PASID per FD or multiple PASID's per FD. Most likely
> we will have high numbers of PASID's in a qemu process so I assume
> that number of FDs will start to be a contraining factor, thus
> multiplexing is reasonable.
> 
> It doesn't really change anything about the basic flow.
> 
> digging deeply into it either seems like a reasonable choice.
> 
> > +-----------------------------+-----------------------------------------------+
> > |      userspace              |               kernel space                    |
> > +-----------------------------+-----------------------------------------------+
> > | ioasid_fd =                 | /dev/ioasid does below:                       |
> > | open("/dev/ioasid", O_RDWR);|   struct ioasid_fd_ctx {                      |
> > |                             |       struct list_head ioasid_list;           |
> > |                             |       ...                                     |
> > |                             |   } ifd_ctx; // ifd_ctx is per ioasid_fd      |
> 
> Sure, possibly an xarray not a list
> 
> > +-----------------------------+-----------------------------------------------+
> > | ioctl(ioasid_fd,            | /dev/ioasid does below:                       |
> > |       ALLOC, &ioasid);      |   struct ioasid_data {                        |
> > |                             |       ioasid_t ioasid;                        |
> > |                             |       struct list_head device_list;           |
> > |                             |       struct list_head next;                  |
> > |                             |       ...                                     |
> > |                             |   } id_data; // id_data is per ioasid         |
> > |                             |                                               |
> > |                             |   list_add(&id_data.next,                     |
> > |                             |            &ifd_ctx.ioasid_list);
> > |
> 
> Yes, this should have a kref in it too
> 
> > +-----------------------------+-----------------------------------------------+
> > | ioctl(device_fd,            | VFIO does below:                              |
> > |       DEVICE_ALLOW_IOASID,  | 1) get ioasid_fd, check if ioasid_fd is valid |
> > |       ioasid_fd,            | 2) check if ioasid is allocated from ioasid_fd|
> > |       ioasid);              | 3) register device/domain info to /dev/ioasid |
> > |                             |    tracked in id_data.device_list             |
> > |                             | 4) record the ioasid in VFIO's per-device     |
> > |                             |    ioasid list for future security check      |
> 
> You would provide a function that does steps 1&2 look at eventfd for
> instance.
> 
> I'm not sure we need to register the device with the ioasid. device
> should incr the kref on the ioasid_data at this point.
> 
> > +-----------------------------+-----------------------------------------------+
> > | ioctl(ioasid_fd,            | /dev/ioasid does below:                       |
> > |       BIND_PGTBL,           | 1) find ioasid's id_data                      |
> > |       pgtbl_data,           | 2) loop the id_data.device_list and tell iommu|
> > |       ioasid);              |    give ioasid access to the devices
> > |
> 
> This seems backwards, DEVICE_ALLOW_IOASID should tell the iommu to
> give the ioasid to the device.
> 
> Here the ioctl should be about assigning a memory map from the the
> current
> mm_struct to the pasid
> 
> > +-----------------------------+-----------------------------------------------+
> > | ioctl(ioasid_fd,            | /dev/ioasid does below:                       |
> > |       UNBIND_PGTBL,         | 1) find ioasid's id_data                      |
> > |       ioasid);              | 2) loop the id_data.device_list and tell iommu|
> > |                             |    clear ioasid access to the devices         |
> 
> Also seems backwards. The ioctl here should be 'destroy ioasid' which
> wipes out the page table, halts DMA access and parks the PASID until
> all users are done.
> 
> > +-----------------------------+-----------------------------------------------+
> > | ioctl(device_fd,            | VFIO does below:                              |
> > |      DEVICE_DISALLOW_IOASID,| 1) check if ioasid is associated in VFIO's    |
> > |       ioasid_fd,            |    device ioasid list.                        |
> > |       ioasid);              | 2) unregister device/domain info from         |
> > |                             |    /dev/ioasid, clear in id_data.device_list  |
> 
> This should disconnect the iommu and kref_put the ioasid_data

thanks for the comments, updated the skeleton a little bit, accepted your Xarray
and kref suggestion.

+-----------------------------+------------------------------------------------+
|      userspace              |               kernel space                     |
+-----------------------------+------------------------------------------------+
| ioasid_fd =                 | /dev/ioasid does below:                        |
| open("/dev/ioasid", O_RDWR);|   struct ioasid_fd_ctx {                       |
|                             |        struct xarray xa;                       |
|                             |       ...                                      |
|                             |   } ifd_ctx; // ifd_ctx is per ioasid_fd       |
+-----------------------------+------------------------------------------------+
| ioctl(ioasid_fd,            | /dev/ioasid does below:                        |
|       ALLOC, &ioasid);      |   struct ioasid_data {                         |
|                             |       ioasid_t ioasid;                         |
|                             |       refcount_t refs;                         |
|                             |       ...                                      |
|                             |   } id_data; // id_data is per ioasid          |
|                             |                                                |
|                             |   refcount_set(&id_data->refs, 1);             |
+-----------------------------+------------------------------------------------+
| ioctl(device_fd,            | VFIO does below:                               |
|       DEVICE_ALLOW_IOASID,  | 1) get ioasid_fd, check if ioasid_fd is valid  |
|       ioasid_fd,            | 2) check if ioasid is allocated from ioasid_fd |
|       ioasid);              | 3) inr refcount on the ioasid                  |
|                             | 4) tell iommu to give the ioasid to the device |
|                             |    by an iommu API. iommu driver needs to      |
|                             |    store the ioasid/device info in a per       |
|                             |    ioasid allow device list                    |
|                             | 5) record the ioasid in VFIO's per-device      |
|                             |    ioasid list for future security check       |
+-----------------------------+------------------------------------------------+
| ioctl(ioasid_fd,            | /dev/ioasid does below:                        |
|       BIND_PGTBL,           | 1) find ioasid's id_data                       |
|       pgtbl_data,           | 2) call into iommu driver with ioasid, pgtbl   |
|       ioasid);              |    data, iommu driver setup the PASID entry[1] |
|                             |    with the ioasid and the pgtbl_data          |
+-----------------------------+------------------------------------------------+
| ioctl(ioasid_fd,            | /dev/ioasid does below:                        |
|       CAHCE_INVLD,          | 1) find ioasid's id_data                       |
|       inv_data,             | 2) call into iommu driver with ioasid, inv     |
|       ioasid);              |    data, iommu driver invalidates cache        |
+-----------------------------+------------------------------------------------+
| ioctl(ioasid_fd,            | /dev/ioasid does below:                        |
|       UNBIND_PGTBL,         | 1) find ioasid's id_data                       |
|       ioasid);              | 2) call into iommu driver with ioasid, iommu   |
|                             |    driver destroy the PASID entry to block DMA |
|                             |    with this ioasid from device                |
+-----------------------------+------------------------------------------------+
| ioctl(device_fd,            | VFIO does below:                               |
|      DEVICE_DISALLOW_IOASID,| 1) check if ioasid is associated in VFIO's     |
|       ioasid_fd,            |    device ioasid list                          |
|       ioasid);              | 2) tell iommu driver to clear the device from  |
|                             |    its per-ioasid device allow list            |
|                             | 3) put refcount on the ioasid                  |
+-----------------------------+------------------------------------------------+
| ioctl(ioasid_fd,            | /dev/ioasid does below:                        |
|       FREE, ioasid);        |  list_del(&id_data.next);                      |
+-----------------------------+------------------------------------------------+

[1] PASID entry is an entry in a per-device PASID table, this is where the
    page table pointer is stored. e.g. guest cr3 page table pointer. Setup
    PASID entry in a device's PASID table means the access is finally grant
    in IOMMU side.

I kept FREE as it seems to be more symmetric since there is an ALLOC
exposed to userspace. But yeah, I'm open with removing it all the same
if it's really unnecessary per your opinion.

Need your help again on an open.
The major purpose of this series is to support vSVA for guest based on
nested translation. And there is another usage case which is also based
on nested translation but don't have an ioasid. And still, it needs the
bind/unbind_pgtbl, cache_invalidation uAPI. It is gIOVA support. In this
usage, page table is a guest IOVA page table, VMM needs to bind this page
table to host and enabled nested translation, also needs to do cache
invalidation when guest IOVA page table has changes. It's very similar
with the page table bind of vSVA. Only difference is there is no ioasid
in the gIOVA case. Instead, gIOVA case requires device information. But
with regards to the uAPI reusing, need to fit gIOVA to /dev/ioasid model.
As of now, I think it may require user space passes a device FD to the
BIND/UNBIND_PGTBL and CAHCE_INVLD ioctl, then iommu driver can bind the
gIOVA page table to a correct device. Not sure if it looks good. Do you
have any suggestion on it?

[...]
> Include a sequence showing how the kvm FD is used to program the
> vPASID to pPASID table that ENQCMD uses.
>
> Show how dynamic authorization works based on requests from the
> guest's vIOMMU

I'd like to see if the updated skeleton suits your idea first, then
draw a more complete flow to show this.

Regards,
Yi Liu

> Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-02  7:30                                           ` Tian, Kevin
@ 2021-04-05 23:35                                             ` Jason Gunthorpe
  2021-04-06  0:37                                               ` Tian, Kevin
  0 siblings, 1 reply; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-05 23:35 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, Jean-Philippe Brucker, Jacob Pan, LKML, Joerg Roedel,
	Lu Baolu, David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj, Ashok, Wu, Hao, Jiang, Dave

On Fri, Apr 02, 2021 at 07:30:23AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Friday, April 2, 2021 12:04 AM
> > 
> > On Thu, Apr 01, 2021 at 02:08:17PM +0000, Liu, Yi L wrote:
> > 
> > > DMA page faults are delivered to root-complex via page request message
> > and
> > > it is per-device according to PCIe spec. Page request handling flow is:
> > >
> > > 1) iommu driver receives a page request from device
> > > 2) iommu driver parses the page request message. Get the RID,PASID,
> > faulted
> > >    page and requested permissions etc.
> > > 3) iommu driver triggers fault handler registered by device driver with
> > >    iommu_report_device_fault()
> > 
> > This seems confused.
> > 
> > The PASID should define how to handle the page fault, not the driver.
> > 
> > I don't remember any device specific actions in ATS, so what is the
> > driver supposed to do?
> > 
> > > 4) device driver's fault handler signals an event FD to notify userspace to
> > >    fetch the information about the page fault. If it's VM case, inject the
> > >    page fault to VM and let guest to solve it.
> > 
> > If the PASID is set to 'report page fault to userspace' then some
> > event should come out of /dev/ioasid, or be reported to a linked
> > eventfd, or whatever.
> > 
> > If the PASID is set to 'SVM' then the fault should be passed to
> > handle_mm_fault
> > 
> > And so on.
> > 
> > Userspace chooses what happens based on how they configure the PASID
> > through /dev/ioasid.
> > 
> > Why would a device driver get involved here?
> > 
> > > Eric has sent below series for the page fault reporting for VM with passthru
> > > device.
> > > https://lore.kernel.org/kvm/20210223210625.604517-5-
> > eric.auger@redhat.com/
> > 
> > It certainly should not be in vfio pci. Everything using a PASID needs
> > this infrastructure, VDPA, mdev, PCI, CXL, etc.
> > 
> 
> This touches an interesting fact:
> 
> The fault may be triggered in either 1st-level or 2nd-level page table, 
> when nested translation is enabled (in vSVA case). The 1st-level is bound 
> by the user space, which therefore needs to receive the fault event. The 
> 2nd-level is managed by VFIO (or vDPA), which needs to fix the fault in 
> kernel (e.g. find HVA per faulting GPA, call handle_mm_fault and map 
> GPA->HPA to IOMMU). Yi's current proposal lets VFIO to register the 
> device fault handler, which then forward the event through /dev/ioasid 
> to userspace only if it is a 1st-level fault. Are you suggesting a pgtable-
> centric fault reporting mechanism to separate handlers in each level, 
> i.e. letting VFIO register handler only for 2nd-level fault and then /dev/
> ioasid register handler for 1st-level fault?

This I'm struggling to understand. /dev/ioasid should handle all the
faults cases, why would VFIO ever get involved in a fault? What would
it even do?

If the fault needs to be fixed in the hypervisor then it is a kernel
fault and it does handle_mm_fault. This absolutely should not be in
VFIO or VDPA

If the fault needs to be fixed in the guest, then it needs to be
delivered over /dev/ioasid in some way and injected into the
vIOMMU. VFIO and VDPA have nothing to do with vIOMMU driver in quemu.

You need to have an interface under /dev/ioasid to create both page
table levels and part of that will be to tell the kernel what VA is
mapped and how to handle faults.

VFIO/VDPA do *nothing* more than authorize the physical device to use
the given PASID.

In the VDPA case you might need to have SW access to the PASID, but
that should be provided by a generic iommu layer interface like
'copy_to/from_pasid()' not by involving VDPA in the address mapping.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-02  7:58                                         ` Tian, Kevin
@ 2021-04-05 23:39                                           ` Jason Gunthorpe
  2021-04-06  1:02                                             ` Tian, Kevin
       [not found]                                             ` <MWHPR11MB188628BDB37A4EE36F3D99338C769@MWHPR11MB1886.namprd11.prod.outlook.com>
  0 siblings, 2 replies; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-05 23:39 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, Jacob Pan, Jean-Philippe Brucker, LKML, Joerg Roedel,
	Lu Baolu, David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj, Ashok, Wu, Hao, Jiang, Dave

On Fri, Apr 02, 2021 at 07:58:02AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Thursday, April 1, 2021 9:47 PM
> > 
> > On Thu, Apr 01, 2021 at 01:43:36PM +0000, Liu, Yi L wrote:
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Thursday, April 1, 2021 9:16 PM
> > > >
> > > > On Thu, Apr 01, 2021 at 01:10:48PM +0000, Liu, Yi L wrote:
> > > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > > Sent: Thursday, April 1, 2021 7:47 PM
> > > > > [...]
> > > > > > I'm worried Intel views the only use of PASID in a guest is with
> > > > > > ENQCMD, but that is not consistent with the industry. We need to see
> > > > > > normal nested PASID support with assigned PCI VFs.
> > > > >
> > > > > I'm not quire flow here. Intel also allows PASID usage in guest without
> > > > > ENQCMD. e.g. Passthru a PF to guest, and use PASID on it without
> > > > ENQCMD.
> > > >
> > > > Then you need all the parts, the hypervisor calls from the vIOMMU, and
> > > > you can't really use a vPASID.
> > >
> > > This is a diagram shows the vSVA setup.
> > 
> > I'm not talking only about vSVA. Generic PASID support with arbitary
> > mappings.
> > 
> > And how do you deal with the vPASID vs pPASID issue if the system has
> > a mix of physical devices and mdevs?
> > 
> 
> We plan to support two schemes. One is vPASID identity-mapped to
> pPASID then the mixed scenario just works, with the limitation of 
> lacking of live migration support. The other is non-identity-mapped 
> scheme, where live migration is supported but physical devices and 
> mdevs should not be mixed in one VM if both expose SVA capability 
> (requires some filtering check in Qemu). 

That just becomes "block vPASID support if any device that
doesn't use ENQCMD is plugged into the guest"

Which needs a special VFIO capability of some kind so qemu knows to
block it. This really needs to all be layed out together so someone
can understand it :(

Why doesn't the siov cookbook explaining this stuff??

> We hope the /dev/ioasid can support both schemes, with the minimal
> requirement of allowing userspace to tag a vPASID to a pPASID and
> allowing mdev to translate vPASID into pPASID, i.e. not assuming that
> the guest will always use pPASID.

What I'm a unclear of is if /dev/ioasid even needs to care about
vPASID or if vPASID is just a hidden artifact of the KVM connection to
setup the translation table and the vIOMMU driver in qemu.

Since the physical HW never sees the vPASID I'm inclined to think the
latter.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-02  8:22                         ` Tian, Kevin
@ 2021-04-05 23:42                           ` Jason Gunthorpe
  2021-04-06  1:27                             ` Tian, Kevin
  2021-04-06  1:35                             ` Jason Wang
  0 siblings, 2 replies; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-05 23:42 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jacob Pan, Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu, Hao,
	Jiang, Dave, Jason Wang

On Fri, Apr 02, 2021 at 08:22:28AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, March 30, 2021 9:29 PM
> > 
> > >
> > > First, userspace may use ioasid in a non-SVA scenario where ioasid is
> > > bound to specific security context (e.g. a control vq in vDPA) instead of
> > > tying to mm. In this case there is no pgtable binding initiated from user
> > > space. Instead, ioasid is allocated from /dev/ioasid and then programmed
> > > to the intended security context through specific passthrough framework
> > > which manages that context.
> > 
> > This sounds like the exact opposite of what I'd like to see.
> > 
> > I do not want to see every subsystem gaining APIs to program a
> > PASID. All of that should be consolidated in *one place*.
> > 
> > I do not want to see VDPA and VFIO have two nearly identical sets of
> > APIs to control the PASID.
> > 
> > Drivers consuming a PASID, like VDPA, should consume the PASID and do
> > nothing more than authorize the HW to use it.
> > 
> > quemu should have general code under the viommu driver that drives
> > /dev/ioasid to create PASID's and manage the IO mapping according to
> > the guest's needs.
> > 
> > Drivers like VDPA and VFIO should simply accept that PASID and
> > configure/authorize their HW to do DMA's with its tag.
> > 
> 
> I agree with you on consolidating things in one place (especially for the
> general SVA support). But here I was referring to an usage without 
> pgtable binding (Possibly Jason. W can say more here), where the 
> userspace just wants to allocate PASIDs, program/accept PASIDs to 
> various workqueues (device specific), and then use MAP/UNMAP 
> interface to manage address spaces associated with each PASID. 
> I just wanted to point out that the latter two steps are through 
> VFIO/VDPA specific interfaces. 

No, don't do that.

VFIO and VDPA has no buisness having map/unmap interfaces once we have
/dev/ioasid. That all belongs in the iosaid side.

I know they have those interfaces today, but that doesn't mean we have
to keep using them for PASID use cases, they should be replaced with a
'do dma from this pasid on /dev/ioasid' interface certainly not a
'here is a pasid from /dev/ioasid, go ahead and configure it youself'
interface

This is because PASID is *complicated* in the general case! For
instance all the two level stuff you are talking about must not leak
into every user!

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-05 23:35                                             ` Jason Gunthorpe
@ 2021-04-06  0:37                                               ` Tian, Kevin
  2021-04-06 12:15                                                 ` Jason Gunthorpe
  0 siblings, 1 reply; 269+ messages in thread
From: Tian, Kevin @ 2021-04-06  0:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, Jean-Philippe Brucker, Jacob Pan, LKML, Joerg Roedel,
	Lu Baolu, David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj, Ashok, Wu, Hao, Jiang, Dave

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, April 6, 2021 7:35 AM
> 
> On Fri, Apr 02, 2021 at 07:30:23AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Friday, April 2, 2021 12:04 AM
> > >
> > > On Thu, Apr 01, 2021 at 02:08:17PM +0000, Liu, Yi L wrote:
> > >
> > > > DMA page faults are delivered to root-complex via page request
> message
> > > and
> > > > it is per-device according to PCIe spec. Page request handling flow is:
> > > >
> > > > 1) iommu driver receives a page request from device
> > > > 2) iommu driver parses the page request message. Get the RID,PASID,
> > > faulted
> > > >    page and requested permissions etc.
> > > > 3) iommu driver triggers fault handler registered by device driver with
> > > >    iommu_report_device_fault()
> > >
> > > This seems confused.
> > >
> > > The PASID should define how to handle the page fault, not the driver.
> > >
> > > I don't remember any device specific actions in ATS, so what is the
> > > driver supposed to do?
> > >
> > > > 4) device driver's fault handler signals an event FD to notify userspace
> to
> > > >    fetch the information about the page fault. If it's VM case, inject the
> > > >    page fault to VM and let guest to solve it.
> > >
> > > If the PASID is set to 'report page fault to userspace' then some
> > > event should come out of /dev/ioasid, or be reported to a linked
> > > eventfd, or whatever.
> > >
> > > If the PASID is set to 'SVM' then the fault should be passed to
> > > handle_mm_fault
> > >
> > > And so on.
> > >
> > > Userspace chooses what happens based on how they configure the PASID
> > > through /dev/ioasid.
> > >
> > > Why would a device driver get involved here?
> > >
> > > > Eric has sent below series for the page fault reporting for VM with
> passthru
> > > > device.
> > > > https://lore.kernel.org/kvm/20210223210625.604517-5-
> > > eric.auger@redhat.com/
> > >
> > > It certainly should not be in vfio pci. Everything using a PASID needs
> > > this infrastructure, VDPA, mdev, PCI, CXL, etc.
> > >
> >
> > This touches an interesting fact:
> >
> > The fault may be triggered in either 1st-level or 2nd-level page table,
> > when nested translation is enabled (in vSVA case). The 1st-level is bound
> > by the user space, which therefore needs to receive the fault event. The
> > 2nd-level is managed by VFIO (or vDPA), which needs to fix the fault in
> > kernel (e.g. find HVA per faulting GPA, call handle_mm_fault and map
> > GPA->HPA to IOMMU). Yi's current proposal lets VFIO to register the
> > device fault handler, which then forward the event through /dev/ioasid
> > to userspace only if it is a 1st-level fault. Are you suggesting a pgtable-
> > centric fault reporting mechanism to separate handlers in each level,
> > i.e. letting VFIO register handler only for 2nd-level fault and then /dev/
> > ioasid register handler for 1st-level fault?
> 
> This I'm struggling to understand. /dev/ioasid should handle all the
> faults cases, why would VFIO ever get involved in a fault? What would
> it even do?
> 
> If the fault needs to be fixed in the hypervisor then it is a kernel
> fault and it does handle_mm_fault. This absolutely should not be in
> VFIO or VDPA

With nested translation it is GVA->GPA->HPA. The kernel needs to
fix fault related to GPA->HPA (managed by VFIO/VDPA) while 
handle_mm_fault only handles HVA->HPA. In this case, the 2nd-level
page fault is expected to be delivered to VFIO/VDPA first which then
find HVA related to GPA, call handle_mm_fault to fix HVA->HPA,
and then call iommu_map to fix GPA->HPA in the IOMMU page table.
This is exactly like how CPU EPT violation is handled.

> 
> If the fault needs to be fixed in the guest, then it needs to be
> delivered over /dev/ioasid in some way and injected into the
> vIOMMU. VFIO and VDPA have nothing to do with vIOMMU driver in quemu.
> 
> You need to have an interface under /dev/ioasid to create both page
> table levels and part of that will be to tell the kernel what VA is
> mapped and how to handle faults.

VFIO/VDPA already have their own interface to manage GPA->HPA
mappings. Why do we want to duplicate it in /dev/ioasid? 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 269+ messages in thread

* RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-05 23:39                                           ` Jason Gunthorpe
@ 2021-04-06  1:02                                             ` Tian, Kevin
  2021-04-06 12:21                                               ` Jason Gunthorpe
       [not found]                                             ` <MWHPR11MB188628BDB37A4EE36F3D99338C769@MWHPR11MB1886.namprd11.prod.outlook.com>
  1 sibling, 1 reply; 269+ messages in thread
From: Tian, Kevin @ 2021-04-06  1:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, Jacob Pan, Jean-Philippe Brucker, LKML, Joerg Roedel,
	Lu Baolu, David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj, Ashok, Wu, Hao, Jiang, Dave

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, April 6, 2021 7:40 AM
> 
> On Fri, Apr 02, 2021 at 07:58:02AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Thursday, April 1, 2021 9:47 PM
> > >
> > > On Thu, Apr 01, 2021 at 01:43:36PM +0000, Liu, Yi L wrote:
> > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > Sent: Thursday, April 1, 2021 9:16 PM
> > > > >
> > > > > On Thu, Apr 01, 2021 at 01:10:48PM +0000, Liu, Yi L wrote:
> > > > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > > > Sent: Thursday, April 1, 2021 7:47 PM
> > > > > > [...]
> > > > > > > I'm worried Intel views the only use of PASID in a guest is with
> > > > > > > ENQCMD, but that is not consistent with the industry. We need to
> see
> > > > > > > normal nested PASID support with assigned PCI VFs.
> > > > > >
> > > > > > I'm not quire flow here. Intel also allows PASID usage in guest
> without
> > > > > > ENQCMD. e.g. Passthru a PF to guest, and use PASID on it without
> > > > > ENQCMD.
> > > > >
> > > > > Then you need all the parts, the hypervisor calls from the vIOMMU,
> and
> > > > > you can't really use a vPASID.
> > > >
> > > > This is a diagram shows the vSVA setup.
> > >
> > > I'm not talking only about vSVA. Generic PASID support with arbitary
> > > mappings.
> > >
> > > And how do you deal with the vPASID vs pPASID issue if the system has
> > > a mix of physical devices and mdevs?
> > >
> >
> > We plan to support two schemes. One is vPASID identity-mapped to
> > pPASID then the mixed scenario just works, with the limitation of
> > lacking of live migration support. The other is non-identity-mapped
> > scheme, where live migration is supported but physical devices and
> > mdevs should not be mixed in one VM if both expose SVA capability
> > (requires some filtering check in Qemu).
> 
> That just becomes "block vPASID support if any device that
> doesn't use ENQCMD is plugged into the guest"

The limitation is only for physical device. and in reality it is not that
bad. To support live migration with physical device we anyway need 
additional work to migrate the device state (e.g. based on Max's work), 
then it's not unreasonable to also mediate guest programming of 
device specific PASID register to enable vPASID (need to translate in
the whole VM lifespan but likely is not a hot path).

> 
> Which needs a special VFIO capability of some kind so qemu knows to
> block it. This really needs to all be layed out together so someone
> can understand it :(

Or could simply based on whether the VFIO device supports live migration.

> 
> Why doesn't the siov cookbook explaining this stuff??
> 
> > We hope the /dev/ioasid can support both schemes, with the minimal
> > requirement of allowing userspace to tag a vPASID to a pPASID and
> > allowing mdev to translate vPASID into pPASID, i.e. not assuming that
> > the guest will always use pPASID.
> 
> What I'm a unclear of is if /dev/ioasid even needs to care about
> vPASID or if vPASID is just a hidden artifact of the KVM connection to
> setup the translation table and the vIOMMU driver in qemu.

Not just for KVM. Also required by mdev, which needs to translate
vPASID into pPASID when ENQCMD is not used. As I replied in another
mail, possibly we don't need /dev/ioasid to know this fact, which 
should only care about the operations related to pPASID. VFIO could
carry vPASID information to mdev. KVM should have its own interface
to know this information, as you suggested earlier.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 269+ messages in thread

* RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-05 23:42                           ` Jason Gunthorpe
@ 2021-04-06  1:27                             ` Tian, Kevin
  2021-04-06 12:34                               ` Jason Gunthorpe
  2021-04-06  1:35                             ` Jason Wang
  1 sibling, 1 reply; 269+ messages in thread
From: Tian, Kevin @ 2021-04-06  1:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Alex Williamson, Raj, Ashok,
	Jonathan Corbet, Jean-Philippe Brucker, LKML, Jiang, Dave, iommu,
	Li Zefan, Johannes Weiner, Tejun Heo, cgroups, Wu, Hao,
	David Woodhouse, Jason Wang

> From: Jason Gunthorpe
> Sent: Tuesday, April 6, 2021 7:43 AM
> 
> On Fri, Apr 02, 2021 at 08:22:28AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Tuesday, March 30, 2021 9:29 PM
> > >
> > > >
> > > > First, userspace may use ioasid in a non-SVA scenario where ioasid is
> > > > bound to specific security context (e.g. a control vq in vDPA) instead of
> > > > tying to mm. In this case there is no pgtable binding initiated from user
> > > > space. Instead, ioasid is allocated from /dev/ioasid and then
> programmed
> > > > to the intended security context through specific passthrough
> framework
> > > > which manages that context.
> > >
> > > This sounds like the exact opposite of what I'd like to see.
> > >
> > > I do not want to see every subsystem gaining APIs to program a
> > > PASID. All of that should be consolidated in *one place*.
> > >
> > > I do not want to see VDPA and VFIO have two nearly identical sets of
> > > APIs to control the PASID.
> > >
> > > Drivers consuming a PASID, like VDPA, should consume the PASID and do
> > > nothing more than authorize the HW to use it.
> > >
> > > quemu should have general code under the viommu driver that drives
> > > /dev/ioasid to create PASID's and manage the IO mapping according to
> > > the guest's needs.
> > >
> > > Drivers like VDPA and VFIO should simply accept that PASID and
> > > configure/authorize their HW to do DMA's with its tag.
> > >
> >
> > I agree with you on consolidating things in one place (especially for the
> > general SVA support). But here I was referring to an usage without
> > pgtable binding (Possibly Jason. W can say more here), where the
> > userspace just wants to allocate PASIDs, program/accept PASIDs to
> > various workqueues (device specific), and then use MAP/UNMAP
> > interface to manage address spaces associated with each PASID.
> > I just wanted to point out that the latter two steps are through
> > VFIO/VDPA specific interfaces.
> 
> No, don't do that.
> 
> VFIO and VDPA has no buisness having map/unmap interfaces once we have
> /dev/ioasid. That all belongs in the iosaid side.
> 
> I know they have those interfaces today, but that doesn't mean we have
> to keep using them for PASID use cases, they should be replaced with a
> 'do dma from this pasid on /dev/ioasid' interface certainly not a
> 'here is a pasid from /dev/ioasid, go ahead and configure it youself'
> interface
> 
> This is because PASID is *complicated* in the general case! For
> instance all the two level stuff you are talking about must not leak
> into every user!
> 

Hi, Jason,

I didn't get your last comment how the two level stuff is leaked into every
user. Could you elaborate it a bit?

and here is one example why using existing VFIO/VDPA interface makes
sense. say dev1 (w/ sva) and dev2 (w/o sva) are placed in a single VFIO 
container. The container is associated to an iommu domain which contains 
a single 2nd-level page table, shared by both devices (when attached to
the domain). The VFIO MAP operation is applied to the 2nd-level page 
table thus naturally applied to both devices. Then userspace could use 
/dev/ioasid to further allocate IOASIDs and bind multiple 1st-level page 
tables for dev1, nested on the shared 2nd-level page table. 

If following your suggestion then VFIO must deny VFIO MAP operations
on sva1 (assume userspace should not mix sva1 and sva2 in the same
container and instead use /dev/ioasid to map for sva1)? and even for 
a sva-capable device there is a window before the guest actually enables 
sva on that device then VFIO should still accept MAP in that window 
and then deny it after sva is enabled by the guest? This all sounds
unnecessary complex while there is already a clean way to achieve it...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-05 23:42                           ` Jason Gunthorpe
  2021-04-06  1:27                             ` Tian, Kevin
@ 2021-04-06  1:35                             ` Jason Wang
  2021-04-06 12:42                               ` Jason Gunthorpe
  1 sibling, 1 reply; 269+ messages in thread
From: Jason Wang @ 2021-04-06  1:35 UTC (permalink / raw)
  To: Jason Gunthorpe, Tian, Kevin
  Cc: Jacob Pan, Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj, Ashok, Liu, Yi L, Wu, Hao,
	Jiang, Dave


在 2021/4/6 上午7:42, Jason Gunthorpe 写道:
> On Fri, Apr 02, 2021 at 08:22:28AM +0000, Tian, Kevin wrote:
>>> From: Jason Gunthorpe <jgg@nvidia.com>
>>> Sent: Tuesday, March 30, 2021 9:29 PM
>>>
>>>> First, userspace may use ioasid in a non-SVA scenario where ioasid is
>>>> bound to specific security context (e.g. a control vq in vDPA) instead of
>>>> tying to mm. In this case there is no pgtable binding initiated from user
>>>> space. Instead, ioasid is allocated from /dev/ioasid and then programmed
>>>> to the intended security context through specific passthrough framework
>>>> which manages that context.
>>> This sounds like the exact opposite of what I'd like to see.
>>>
>>> I do not want to see every subsystem gaining APIs to program a
>>> PASID. All of that should be consolidated in *one place*.
>>>
>>> I do not want to see VDPA and VFIO have two nearly identical sets of
>>> APIs to control the PASID.
>>>
>>> Drivers consuming a PASID, like VDPA, should consume the PASID and do
>>> nothing more than authorize the HW to use it.
>>>
>>> quemu should have general code under the viommu driver that drives
>>> /dev/ioasid to create PASID's and manage the IO mapping according to
>>> the guest's needs.
>>>
>>> Drivers like VDPA and VFIO should simply accept that PASID and
>>> configure/authorize their HW to do DMA's with its tag.
>>>
>> I agree with you on consolidating things in one place (especially for the
>> general SVA support). But here I was referring to an usage without
>> pgtable binding (Possibly Jason. W can say more here), where the
>> userspace just wants to allocate PASIDs, program/accept PASIDs to
>> various workqueues (device specific), and then use MAP/UNMAP
>> interface to manage address spaces associated with each PASID.
>> I just wanted to point out that the latter two steps are through
>> VFIO/VDPA specific interfaces.
> No, don't do that.
>
> VFIO and VDPA has no buisness having map/unmap interfaces once we have
> /dev/ioasid. That all belongs in the iosaid side.
>
> I know they have those interfaces today, but that doesn't mean we have
> to keep using them for PASID use cases, they should be replaced with a
> 'do dma from this pasid on /dev/ioasid' interface certainly not a
> 'here is a pasid from /dev/ioasid, go ahead and configure it youself'
> interface


So it looks like the PASID was bound to SVA in this design. I think it's 
not necessairly the case:

1) PASID can be implemented without SVA, in this case a map/unmap 
interface is still required
2) For the case that hypervisor want to do some mediation in the middle 
for a virtqueue. e.g in the case of control vq that is implemented in 
the VF/ADI/SF itself, the hardware virtqueue needs to be controlled by 
Qemu, Though binding qemu's page table to cvq can work but it looks like 
a overkill, a small dedicated buffers that is mapped for this PASID 
seems more suitalbe.


>
> This is because PASID is *complicated* in the general case! For
> instance all the two level stuff you are talking about must not leak
> into every user!
>
> Jason


So do you mean the device should not expose the PASID confiugration API 
to guest? I think it could happen if we assign the whole device and let 
guest to configure it for nested VMs.

Thanks


>


^ permalink raw reply	[flat|nested] 269+ messages in thread

* RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
       [not found]                                             ` <MWHPR11MB188628BDB37A4EE36F3D99338C769@MWHPR11MB1886.namprd11.prod.outlook.com>
@ 2021-04-06  2:08                                               ` Tian, Kevin
  0 siblings, 0 replies; 269+ messages in thread
From: Tian, Kevin @ 2021-04-06  2:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, Jacob Pan, Jean-Philippe Brucker, LKML, Joerg Roedel,
	Lu Baolu, David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj, Ashok, Wu, Hao, Jiang, Dave

> From: Tian, Kevin
> Sent: Tuesday, April 6, 2021 9:02 AM
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, April 6, 2021 7:40 AM
> >
> > On Fri, Apr 02, 2021 at 07:58:02AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Thursday, April 1, 2021 9:47 PM
> > > >
> > > > On Thu, Apr 01, 2021 at 01:43:36PM +0000, Liu, Yi L wrote:
> > > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > > Sent: Thursday, April 1, 2021 9:16 PM
> > > > > >
> > > > > > On Thu, Apr 01, 2021 at 01:10:48PM +0000, Liu, Yi L wrote:
> > > > > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > > > > Sent: Thursday, April 1, 2021 7:47 PM
> > > > > > > [...]
> > > > > > > > I'm worried Intel views the only use of PASID in a guest is with
> > > > > > > > ENQCMD, but that is not consistent with the industry. We need
> to
> > see
> > > > > > > > normal nested PASID support with assigned PCI VFs.
> > > > > > >
> > > > > > > I'm not quire flow here. Intel also allows PASID usage in guest
> > without
> > > > > > > ENQCMD. e.g. Passthru a PF to guest, and use PASID on it without
> > > > > > ENQCMD.
> > > > > >
> > > > > > Then you need all the parts, the hypervisor calls from the vIOMMU,
> > and
> > > > > > you can't really use a vPASID.
> > > > >
> > > > > This is a diagram shows the vSVA setup.
> > > >
> > > > I'm not talking only about vSVA. Generic PASID support with arbitary
> > > > mappings.
> > > >
> > > > And how do you deal with the vPASID vs pPASID issue if the system has
> > > > a mix of physical devices and mdevs?
> > > >
> > >
> > > We plan to support two schemes. One is vPASID identity-mapped to
> > > pPASID then the mixed scenario just works, with the limitation of
> > > lacking of live migration support. The other is non-identity-mapped
> > > scheme, where live migration is supported but physical devices and
> > > mdevs should not be mixed in one VM if both expose SVA capability
> > > (requires some filtering check in Qemu).
> >
> > That just becomes "block vPASID support if any device that
> > doesn't use ENQCMD is plugged into the guest"
> 
> The limitation is only for physical device. and in reality it is not that
> bad. To support live migration with physical device we anyway need
> additional work to migrate the device state (e.g. based on Max's work),
> then it's not unreasonable to also mediate guest programming of
> device specific PASID register to enable vPASID (need to translate in
> the whole VM lifespan but likely is not a hot path).
> 
> >
> > Which needs a special VFIO capability of some kind so qemu knows to
> > block it. This really needs to all be layed out together so someone
> > can understand it :(
> 
> Or could simply based on whether the VFIO device supports live migration.

Actually you are right on this point. VFIO should provide a per-device
capability to indicate whether vPASID is allowed on this device. likely 
yes for mdev, by default no for pdev (unless explicitly opt in). Qemu
should enable vPASID only if all assigned devices support it, and then 
provide vPASID information when using VFIO API to allow pPASIDs.

> 
> >
> > Why doesn't the siov cookbook explaining this stuff??
> >
> > > We hope the /dev/ioasid can support both schemes, with the minimal
> > > requirement of allowing userspace to tag a vPASID to a pPASID and
> > > allowing mdev to translate vPASID into pPASID, i.e. not assuming that
> > > the guest will always use pPASID.
> >
> > What I'm a unclear of is if /dev/ioasid even needs to care about
> > vPASID or if vPASID is just a hidden artifact of the KVM connection to
> > setup the translation table and the vIOMMU driver in qemu.
> 
> Not just for KVM. Also required by mdev, which needs to translate
> vPASID into pPASID when ENQCMD is not used. As I replied in another
> mail, possibly we don't need /dev/ioasid to know this fact, which
> should only care about the operations related to pPASID. VFIO could
> carry vPASID information to mdev. KVM should have its own interface
> to know this information, as you suggested earlier.
> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-06  0:37                                               ` Tian, Kevin
@ 2021-04-06 12:15                                                 ` Jason Gunthorpe
  0 siblings, 0 replies; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-06 12:15 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, Jean-Philippe Brucker, Jacob Pan, LKML, Joerg Roedel,
	Lu Baolu, David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj, Ashok, Wu, Hao, Jiang, Dave

On Tue, Apr 06, 2021 at 12:37:35AM +0000, Tian, Kevin wrote:

> With nested translation it is GVA->GPA->HPA. The kernel needs to
> fix fault related to GPA->HPA (managed by VFIO/VDPA) while 
> handle_mm_fault only handles HVA->HPA. In this case, the 2nd-level
> page fault is expected to be delivered to VFIO/VDPA first which then
> find HVA related to GPA, call handle_mm_fault to fix HVA->HPA,
> and then call iommu_map to fix GPA->HPA in the IOMMU page table.
> This is exactly like how CPU EPT violation is handled.

No, it should all be in the /dev/ioasid layer not duplicated into
every user.

> > If the fault needs to be fixed in the guest, then it needs to be
> > delivered over /dev/ioasid in some way and injected into the
> > vIOMMU. VFIO and VDPA have nothing to do with vIOMMU driver in quemu.
> > 
> > You need to have an interface under /dev/ioasid to create both page
> > table levels and part of that will be to tell the kernel what VA is
> > mapped and how to handle faults.
> 
> VFIO/VDPA already have their own interface to manage GPA->HPA
> mappings. Why do we want to duplicate it in /dev/ioasid? 

They have their own interface to manage other types of HW, we should
not duplicate PASID programming into there too.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-06  1:02                                             ` Tian, Kevin
@ 2021-04-06 12:21                                               ` Jason Gunthorpe
  2021-04-07  2:23                                                 ` Tian, Kevin
  0 siblings, 1 reply; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-06 12:21 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, Jacob Pan, Jean-Philippe Brucker, LKML, Joerg Roedel,
	Lu Baolu, David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj, Ashok, Wu, Hao, Jiang, Dave

On Tue, Apr 06, 2021 at 01:02:05AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, April 6, 2021 7:40 AM
> > 
> > On Fri, Apr 02, 2021 at 07:58:02AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Thursday, April 1, 2021 9:47 PM
> > > >
> > > > On Thu, Apr 01, 2021 at 01:43:36PM +0000, Liu, Yi L wrote:
> > > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > > Sent: Thursday, April 1, 2021 9:16 PM
> > > > > >
> > > > > > On Thu, Apr 01, 2021 at 01:10:48PM +0000, Liu, Yi L wrote:
> > > > > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > > > > Sent: Thursday, April 1, 2021 7:47 PM
> > > > > > > [...]
> > > > > > > > I'm worried Intel views the only use of PASID in a guest is with
> > > > > > > > ENQCMD, but that is not consistent with the industry. We need to
> > see
> > > > > > > > normal nested PASID support with assigned PCI VFs.
> > > > > > >
> > > > > > > I'm not quire flow here. Intel also allows PASID usage in guest
> > without
> > > > > > > ENQCMD. e.g. Passthru a PF to guest, and use PASID on it without
> > > > > > ENQCMD.
> > > > > >
> > > > > > Then you need all the parts, the hypervisor calls from the vIOMMU,
> > and
> > > > > > you can't really use a vPASID.
> > > > >
> > > > > This is a diagram shows the vSVA setup.
> > > >
> > > > I'm not talking only about vSVA. Generic PASID support with arbitary
> > > > mappings.
> > > >
> > > > And how do you deal with the vPASID vs pPASID issue if the system has
> > > > a mix of physical devices and mdevs?
> > > >
> > >
> > > We plan to support two schemes. One is vPASID identity-mapped to
> > > pPASID then the mixed scenario just works, with the limitation of
> > > lacking of live migration support. The other is non-identity-mapped
> > > scheme, where live migration is supported but physical devices and
> > > mdevs should not be mixed in one VM if both expose SVA capability
> > > (requires some filtering check in Qemu).
> > 
> > That just becomes "block vPASID support if any device that
> > doesn't use ENQCMD is plugged into the guest"
> 
> The limitation is only for physical device. and in reality it is not that
> bad. To support live migration with physical device we anyway need 
> additional work to migrate the device state (e.g. based on Max's work), 
> then it's not unreasonable to also mediate guest programming of 
> device specific PASID register to enable vPASID (need to translate in
> the whole VM lifespan but likely is not a hot path).

IMHO that is pretty unreasonable.. More likely we end up with vPASID
tables in each migratable device like KVM has.

> > Which needs a special VFIO capability of some kind so qemu knows to
> > block it. This really needs to all be layed out together so someone
> > can understand it :(
> 
> Or could simply based on whether the VFIO device supports live migration.

You need to define affirmative caps that indicate that vPASID will be
supported by the VFIO device.

> > Why doesn't the siov cookbook explaining this stuff??
> > 
> > > We hope the /dev/ioasid can support both schemes, with the minimal
> > > requirement of allowing userspace to tag a vPASID to a pPASID and
> > > allowing mdev to translate vPASID into pPASID, i.e. not assuming that
> > > the guest will always use pPASID.
> > 
> > What I'm a unclear of is if /dev/ioasid even needs to care about
> > vPASID or if vPASID is just a hidden artifact of the KVM connection to
> > setup the translation table and the vIOMMU driver in qemu.
> 
> Not just for KVM. Also required by mdev, which needs to translate
> vPASID into pPASID when ENQCMD is not used.

Do we have any mdev's that will do this?

> should only care about the operations related to pPASID. VFIO could
> carry vPASID information to mdev.

It depends how common this is, I suppose

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-06  1:27                             ` Tian, Kevin
@ 2021-04-06 12:34                               ` Jason Gunthorpe
  2021-04-07  2:08                                 ` Tian, Kevin
  0 siblings, 1 reply; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-06 12:34 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jean-Philippe Brucker, Alex Williamson, Raj, Ashok,
	Jonathan Corbet, Jean-Philippe Brucker, LKML, Jiang, Dave, iommu,
	Li Zefan, Johannes Weiner, Tejun Heo, cgroups, Wu, Hao,
	David Woodhouse, Jason Wang

On Tue, Apr 06, 2021 at 01:27:15AM +0000, Tian, Kevin wrote:
> 
> and here is one example why using existing VFIO/VDPA interface makes
> sense. say dev1 (w/ sva) and dev2 (w/o sva) are placed in a single VFIO 
> container. 

Forget about SVA, it is an irrelevant detail of how a PASID is
configured.

> The container is associated to an iommu domain which contains a
> single 2nd-level page table, shared by both devices (when attached
> to the domain).

This level should be described by an ioasid.

> The VFIO MAP operation is applied to the 2nd-level
> page table thus naturally applied to both devices. Then userspace
> could use /dev/ioasid to further allocate IOASIDs and bind multiple
> 1st-level page tables for dev1, nested on the shared 2nd-level page
> table.

Because if you don't then we enter insane world where a PASID is being
created under /dev/ioasid but its translation path flows through setup
done by VFIO and the whole user API becomes an incomprehensible mess.

How will you even associate the PASID with the other translation??

The entire translation path for any ioasid or PASID should be defined
only by /dev/ioasid. Everything else is a legacy API.

> If following your suggestion then VFIO must deny VFIO MAP operations
> on sva1 (assume userspace should not mix sva1 and sva2 in the same
> container and instead use /dev/ioasid to map for sva1)? 

No, userspace creates an iosaid for the guest physical mapping and
passes this ioasid to VFIO PCI which will assign it as the first layer
mapping on the RID

When PASIDs are allocated the uAPI will be told to logically nested
under the first ioasid. When VFIO authorizes a PASID for a RID it
checks that all the HW rules are being followed.

If there are rules like groups of VFIO devices must always use the
same IOASID then VFIO will check these too (and realistically qemu
will have only one guest physical map ioasid anyhow)

There is no real difference between setting up an IOMMU table for a
(RID,PASID) tuple or just a RID. We can do it universally with
one interface for all consumers.

I wanted this when we were doing VDPA for the first time, now that we
are doing pasid and more difficult stuff I view it as essential.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-06  1:35                             ` Jason Wang
@ 2021-04-06 12:42                               ` Jason Gunthorpe
  2021-04-07  2:06                                 ` Jason Wang
  2021-04-07  8:17                                 ` Tian, Kevin
  0 siblings, 2 replies; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-06 12:42 UTC (permalink / raw)
  To: Jason Wang
  Cc: Tian, Kevin, Jacob Pan, Jean-Philippe Brucker, LKML,
	Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker,
	Alex Williamson, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu,
	Yi L, Wu, Hao, Jiang, Dave

On Tue, Apr 06, 2021 at 09:35:17AM +0800, Jason Wang wrote:

> > VFIO and VDPA has no buisness having map/unmap interfaces once we have
> > /dev/ioasid. That all belongs in the iosaid side.
> > 
> > I know they have those interfaces today, but that doesn't mean we have
> > to keep using them for PASID use cases, they should be replaced with a
> > 'do dma from this pasid on /dev/ioasid' interface certainly not a
> > 'here is a pasid from /dev/ioasid, go ahead and configure it youself'
> > interface
>  
> So it looks like the PASID was bound to SVA in this design. I think it's not
> necessairly the case:

No, I wish people would stop talking about SVA.

SVA and vSVA are a very special narrow configuration of a PASID. There
are lots of other PASID configurations! That is the whole point, a
PASID is complicated, there are many configuration scenarios, they
need to be in one place with a very clearly defined uAPI

> 1) PASID can be implemented without SVA, in this case a map/unmap interface
> is still required

Any interface to manipulate a PASID should be under /dev/ioasid. We do
not want to duplicate this into every subsystem.

> 2) For the case that hypervisor want to do some mediation in the middle for
> a virtqueue. e.g in the case of control vq that is implemented in the
> VF/ADI/SF itself, the hardware virtqueue needs to be controlled by Qemu,
> Though binding qemu's page table to cvq can work but it looks like a
> overkill, a small dedicated buffers that is mapped for this PASID seems more
> suitalbe.

/dev/ioasid should allow userspace to setup any PASID configuration it
wants. There are many choices. That is the whole point, instead of
copying&pasting all the PASID configuration option into every
subsystem we have on place to configure it.

If you want a PASID (or generic ioasid) that has the guest physical
map, which is probably all that VDPA would ever want, then /dev/ioasid
should be able to prepare that.

If you just want to map a few buffers into a PASID then it should be
able to do that too.

> So do you mean the device should not expose the PASID confiugration API to
> guest? I think it could happen if we assign the whole device and let guest
> to configure it for nested VMs.

This always needs co-operating with the vIOMMU driver. We can't have
nested PASID use without both parts working together.

The vIOMMU driver configures the PASID and assigns the mappings
(however complicated that turns out to be)

The VDPA/mdev driver authorizes the HW to use the ioasid mapping, eg
by authorizing a queue to issue PCIe TLPs with a specific PASID.

The authorization is triggered by the guest telling the vIOMMU to
allow a vRID to talk to a PASID, which qemu would have to translate to
telling something like the VDPA driver under the vRID that it can use
a PASID from /dev/ioasid

For security a VDPA/mdev device MUST NOT issue PASIDs that the vIOMMU
has not authorized its vRID to use. Otherwise the security model of
something like VFIO in the guest becomes completely broken.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-06 12:42                               ` Jason Gunthorpe
@ 2021-04-07  2:06                                 ` Jason Wang
  2021-04-07  8:17                                 ` Tian, Kevin
  1 sibling, 0 replies; 269+ messages in thread
From: Jason Wang @ 2021-04-07  2:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Jacob Pan, Jean-Philippe Brucker, LKML,
	Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker,
	Alex Williamson, Eric Auger, Jonathan Corbet, Raj, Ashok, Liu,
	Yi L, Wu, Hao, Jiang, Dave


在 2021/4/6 下午8:42, Jason Gunthorpe 写道:
> On Tue, Apr 06, 2021 at 09:35:17AM +0800, Jason Wang wrote:
>
>>> VFIO and VDPA has no buisness having map/unmap interfaces once we have
>>> /dev/ioasid. That all belongs in the iosaid side.
>>>
>>> I know they have those interfaces today, but that doesn't mean we have
>>> to keep using them for PASID use cases, they should be replaced with a
>>> 'do dma from this pasid on /dev/ioasid' interface certainly not a
>>> 'here is a pasid from /dev/ioasid, go ahead and configure it youself'
>>> interface
>>   
>> So it looks like the PASID was bound to SVA in this design. I think it's not
>> necessairly the case:
> No, I wish people would stop talking about SVA.
>
> SVA and vSVA are a very special narrow configuration of a PASID. There
> are lots of other PASID configurations! That is the whole point, a
> PASID is complicated, there are many configuration scenarios, they
> need to be in one place with a very clearly defined uAPI


Right, that's my understanding as well.


>
>> 1) PASID can be implemented without SVA, in this case a map/unmap interface
>> is still required
> Any interface to manipulate a PASID should be under /dev/ioasid. We do
> not want to duplicate this into every subsystem.


Yes.


>
>> 2) For the case that hypervisor want to do some mediation in the middle for
>> a virtqueue. e.g in the case of control vq that is implemented in the
>> VF/ADI/SF itself, the hardware virtqueue needs to be controlled by Qemu,
>> Though binding qemu's page table to cvq can work but it looks like a
>> overkill, a small dedicated buffers that is mapped for this PASID seems more
>> suitalbe.
> /dev/ioasid should allow userspace to setup any PASID configuration it
> wants. There are many choices. That is the whole point, instead of
> copying&pasting all the PASID configuration option into every
> subsystem we have on place to configure it.
>
> If you want a PASID (or generic ioasid) that has the guest physical
> map, which is probably all that VDPA would ever want, then /dev/ioasid
> should be able to prepare that.
>
> If you just want to map a few buffers into a PASID then it should be
> able to do that too.
>
>> So do you mean the device should not expose the PASID confiugration API to
>> guest? I think it could happen if we assign the whole device and let guest
>> to configure it for nested VMs.
> This always needs co-operating with the vIOMMU driver. We can't have
> nested PASID use without both parts working together.
>
> The vIOMMU driver configures the PASID and assigns the mappings
> (however complicated that turns out to be)
>
> The VDPA/mdev driver authorizes the HW to use the ioasid mapping, eg
> by authorizing a queue to issue PCIe TLPs with a specific PASID.
>
> The authorization is triggered by the guest telling the vIOMMU to
> allow a vRID to talk to a PASID, which qemu would have to translate to
> telling something like the VDPA driver under the vRID that it can use
> a PASID from /dev/ioasid
>
> For security a VDPA/mdev device MUST NOT issue PASIDs that the vIOMMU
> has not authorized its vRID to use. Otherwise the security model of
> something like VFIO in the guest becomes completely broken.


Yes, that's how it should work.

Thanks


>
> Jason
>


^ permalink raw reply	[flat|nested] 269+ messages in thread

* RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-06 12:34                               ` Jason Gunthorpe
@ 2021-04-07  2:08                                 ` Tian, Kevin
  2021-04-07 12:20                                   ` Jason Gunthorpe
  0 siblings, 1 reply; 269+ messages in thread
From: Tian, Kevin @ 2021-04-07  2:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Alex Williamson, Raj, Ashok,
	Jonathan Corbet, Jean-Philippe Brucker, LKML, Jiang, Dave, iommu,
	Li Zefan, Johannes Weiner, Tejun Heo, cgroups, Wu, Hao,
	David Woodhouse, Jason Wang

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, April 6, 2021 8:35 PM
> 
> On Tue, Apr 06, 2021 at 01:27:15AM +0000, Tian, Kevin wrote:
> >
> > and here is one example why using existing VFIO/VDPA interface makes
> > sense. say dev1 (w/ sva) and dev2 (w/o sva) are placed in a single VFIO
> > container.
> 
> Forget about SVA, it is an irrelevant detail of how a PASID is
> configured.
> 
> > The container is associated to an iommu domain which contains a
> > single 2nd-level page table, shared by both devices (when attached
> > to the domain).
> 
> This level should be described by an ioasid.
> 
> > The VFIO MAP operation is applied to the 2nd-level
> > page table thus naturally applied to both devices. Then userspace
> > could use /dev/ioasid to further allocate IOASIDs and bind multiple
> > 1st-level page tables for dev1, nested on the shared 2nd-level page
> > table.
> 
> Because if you don't then we enter insane world where a PASID is being
> created under /dev/ioasid but its translation path flows through setup
> done by VFIO and the whole user API becomes an incomprehensible mess.
> 
> How will you even associate the PASID with the other translation??

PASID is attached to a specific iommu domain (created by VFIO/VDPA), which
has GPA->HPA mappings already configured. If we view that mapping as an
attribute of the iommu domain, it's reasonable to have the userspace-bound
pgtable through /dev/ioasid to nest on it.


> 
> The entire translation path for any ioasid or PASID should be defined
> only by /dev/ioasid. Everything else is a legacy API.
> 
> > If following your suggestion then VFIO must deny VFIO MAP operations
> > on sva1 (assume userspace should not mix sva1 and sva2 in the same
> > container and instead use /dev/ioasid to map for sva1)?
> 
> No, userspace creates an iosaid for the guest physical mapping and
> passes this ioasid to VFIO PCI which will assign it as the first layer
> mapping on the RID

Is it an dummy ioasid just for providing GPA mappings for nesting purpose
of other IOASIDs? Then we waste one per VM?

> 
> When PASIDs are allocated the uAPI will be told to logically nested
> under the first ioasid. When VFIO authorizes a PASID for a RID it
> checks that all the HW rules are being followed.

As I explained above, why cannot we just use iommu domain to connect 
the dots? Every passthrough framework needs to create an iommu domain
first. and It needs to support both devices w/ PASID and devices w/o PASID.
For devices w/o PASID it needs to invent its own MAP interface anyway.
Then why do we bother creating another MAP interface through /dev/ioasid
which not only duplicates but also creating transition burden between 
two set of MAP interfaces when the guest turns on/off the pasid capability
on the device?

> 
> If there are rules like groups of VFIO devices must always use the
> same IOASID then VFIO will check these too (and realistically qemu
> will have only one guest physical map ioasid anyhow)
> 
> There is no real difference between setting up an IOMMU table for a
> (RID,PASID) tuple or just a RID. We can do it universally with
> one interface for all consumers.
> 

'universally' upon from which angle you look at this problem. From IOASID
p.o.v possibly yes, but from device passthrough p.o.v. it's the opposite
since the passthrough framework needs to handle devices w/o PASID anyway
(or even for device w/ PASID it could send traffic w/o PASID) thus 'universally'
makes more sense if the passthrough framework can use one interface of its
own to manage GPA mappings for all consumers (apply to the case when a
PASID is allowed/authorized).

Thanks
Kevin

^ permalink raw reply	[flat|nested] 269+ messages in thread

* RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-06 12:21                                               ` Jason Gunthorpe
@ 2021-04-07  2:23                                                 ` Tian, Kevin
  0 siblings, 0 replies; 269+ messages in thread
From: Tian, Kevin @ 2021-04-07  2:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, Jacob Pan, Jean-Philippe Brucker, LKML, Joerg Roedel,
	Lu Baolu, David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Alex Williamson,
	Eric Auger, Jonathan Corbet, Raj, Ashok, Wu, Hao, Jiang, Dave

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, April 6, 2021 8:21 PM
> 
> On Tue, Apr 06, 2021 at 01:02:05AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Tuesday, April 6, 2021 7:40 AM
> > >
> > > On Fri, Apr 02, 2021 at 07:58:02AM +0000, Tian, Kevin wrote:
> > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > Sent: Thursday, April 1, 2021 9:47 PM
> > > > >
> > > > > On Thu, Apr 01, 2021 at 01:43:36PM +0000, Liu, Yi L wrote:
> > > > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > > > Sent: Thursday, April 1, 2021 9:16 PM
> > > > > > >
> > > > > > > On Thu, Apr 01, 2021 at 01:10:48PM +0000, Liu, Yi L wrote:
> > > > > > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > > > > > Sent: Thursday, April 1, 2021 7:47 PM
> > > > > > > > [...]
> > > > > > > > > I'm worried Intel views the only use of PASID in a guest is with
> > > > > > > > > ENQCMD, but that is not consistent with the industry. We need
> to
> > > see
> > > > > > > > > normal nested PASID support with assigned PCI VFs.
> > > > > > > >
> > > > > > > > I'm not quire flow here. Intel also allows PASID usage in guest
> > > without
> > > > > > > > ENQCMD. e.g. Passthru a PF to guest, and use PASID on it
> without
> > > > > > > ENQCMD.
> > > > > > >
> > > > > > > Then you need all the parts, the hypervisor calls from the vIOMMU,
> > > and
> > > > > > > you can't really use a vPASID.
> > > > > >
> > > > > > This is a diagram shows the vSVA setup.
> > > > >
> > > > > I'm not talking only about vSVA. Generic PASID support with arbitary
> > > > > mappings.
> > > > >
> > > > > And how do you deal with the vPASID vs pPASID issue if the system
> has
> > > > > a mix of physical devices and mdevs?
> > > > >
> > > >
> > > > We plan to support two schemes. One is vPASID identity-mapped to
> > > > pPASID then the mixed scenario just works, with the limitation of
> > > > lacking of live migration support. The other is non-identity-mapped
> > > > scheme, where live migration is supported but physical devices and
> > > > mdevs should not be mixed in one VM if both expose SVA capability
> > > > (requires some filtering check in Qemu).
> > >
> > > That just becomes "block vPASID support if any device that
> > > doesn't use ENQCMD is plugged into the guest"
> >
> > The limitation is only for physical device. and in reality it is not that
> > bad. To support live migration with physical device we anyway need
> > additional work to migrate the device state (e.g. based on Max's work),
> > then it's not unreasonable to also mediate guest programming of
> > device specific PASID register to enable vPASID (need to translate in
> > the whole VM lifespan but likely is not a hot path).
> 
> IMHO that is pretty unreasonable.. More likely we end up with vPASID
> tables in each migratable device like KVM has.

just like mdev needs to maintain allowed PASID list, this extends it to
all migratable devices.

> 
> > > Which needs a special VFIO capability of some kind so qemu knows to
> > > block it. This really needs to all be layed out together so someone
> > > can understand it :(
> >
> > Or could simply based on whether the VFIO device supports live migration.
> 
> You need to define affirmative caps that indicate that vPASID will be
> supported by the VFIO device.

Yes, this is required as I acked in another mail.

> 
> > > Why doesn't the siov cookbook explaining this stuff??
> > >
> > > > We hope the /dev/ioasid can support both schemes, with the minimal
> > > > requirement of allowing userspace to tag a vPASID to a pPASID and
> > > > allowing mdev to translate vPASID into pPASID, i.e. not assuming that
> > > > the guest will always use pPASID.
> > >
> > > What I'm a unclear of is if /dev/ioasid even needs to care about
> > > vPASID or if vPASID is just a hidden artifact of the KVM connection to
> > > setup the translation table and the vIOMMU driver in qemu.
> >
> > Not just for KVM. Also required by mdev, which needs to translate
> > vPASID into pPASID when ENQCMD is not used.
> 
> Do we have any mdev's that will do this?

definitely. Actually any mdev which doesn't do ENQCMD needs to do this.
In normal case, the PASID is programmed to a MMIO register (or in-memory
context) associate with the backend resource of the mdev. The value 
programmed from the guest is vPASID, thus must be translated into pPASID
before updating the physical register.

> 
> > should only care about the operations related to pPASID. VFIO could
> > carry vPASID information to mdev.
> 
> It depends how common this is, I suppose
> 

based on above I think it's a common case.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 269+ messages in thread

* RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-06 12:42                               ` Jason Gunthorpe
  2021-04-07  2:06                                 ` Jason Wang
@ 2021-04-07  8:17                                 ` Tian, Kevin
  2021-04-07 11:58                                   ` Jason Gunthorpe
  2021-04-07 18:43                                   ` Jean-Philippe Brucker
  1 sibling, 2 replies; 269+ messages in thread
From: Tian, Kevin @ 2021-04-07  8:17 UTC (permalink / raw)
  To: Jason Gunthorpe, Jason Wang
  Cc: Jean-Philippe Brucker, Alex Williamson, Raj, Ashok,
	Jonathan Corbet, Jean-Philippe Brucker, LKML, Jiang, Dave, iommu,
	Li Zefan, Johannes Weiner, Tejun Heo, cgroups, Wu, Hao,
	David Woodhouse

> From: Jason Gunthorpe
> Sent: Tuesday, April 6, 2021 8:43 PM
> 
> On Tue, Apr 06, 2021 at 09:35:17AM +0800, Jason Wang wrote:
> 
> > > VFIO and VDPA has no buisness having map/unmap interfaces once we
> have
> > > /dev/ioasid. That all belongs in the iosaid side.
> > >
> > > I know they have those interfaces today, but that doesn't mean we have
> > > to keep using them for PASID use cases, they should be replaced with a
> > > 'do dma from this pasid on /dev/ioasid' interface certainly not a
> > > 'here is a pasid from /dev/ioasid, go ahead and configure it youself'
> > > interface
> >
> > So it looks like the PASID was bound to SVA in this design. I think it's not
> > necessairly the case:
> 
> No, I wish people would stop talking about SVA.
> 
> SVA and vSVA are a very special narrow configuration of a PASID. There
> are lots of other PASID configurations! That is the whole point, a
> PASID is complicated, there are many configuration scenarios, they
> need to be in one place with a very clearly defined uAPI
> 

I feel it also makes sense to allow a subsystem to specify which configurations
are permitted when allowing a PASID on its device, e.g. excluding things like
GPA mappings that existing subsystems (VFIO/VDPA) already handle well:

- Share GPA mappings between multiple devices (w/ or w/o PASID) for 
better IOTLB efficiency;

- Share GPA mappings between transactions w/ PASID and transactions w/o
PASID from the same device (e.g. GPU) for better IOTLB efficiency;

- Use the same page table for GPA mappings before and after the guest 
turns on/off the PASID capability;

All above are given as long as we continue to let VFIO/VDPA manage the
iommu domain and associated GPA mappings for PASID. The IOMMU driver 
already ensures a nested PASID entry linking to the established GPA paging 
structure of the domain when the 1st-level pgtable is bound through 
/dev/ioasid. 

In contrast, above merits are lost if forcing a model where GPA mappings
for PASID must be constructed through /dev/ioasid, as this will lead to
multiple paging structures for the same GPA mappings implying worse 
IOTLB usage and unnecessary cost of invalidations.

Therefore, I envision a scheme where the subsystem could specify 
permitted PASID configurations when doing ALLOW_PASID, and then 
userspace queries per-PASID capability to learn which operations
are allowed, e.g.:

1) To enable vSVA, VFIO/VDPA allows pgtable binding and related invalidation/
fault ops through /dev/ioasid;

2) for vDPA control vq usage, no configuration is allowed through /dev/ioasid;

3) for new subsystem which doesn't carry any legacy or similar usage as 
VFIO/VDPA, it could permit all configurations through /dev/ioasid including 
1st-level binding and 2nd-level mapping ops;

This approach also allows us to grow the uAPI in a staging approach. Now 
focus on 1) and 2) as VFIO/VDPA are the only two users for now with good
legacy to cover the GPA mappings. More ops can be introduced for 3) when 
there is a real example to show what exact ops are required for such a new 
subsystem.

Is this a good strategy to move forward?

btw this discussion was raised when discussing the I/O page fault handling
process. Currently the IOMMU layer implements a per-device fault reporting
mechanism, which requires VFIO to register a handler to receive all faults 
on its device and then forwards to ioasid if it's due to 1st-level. Possibly it 
makes more sense to convert it into a per-pgtable reporting scheme, and 
then the owner of each pgtable should register its own handler. It means
for 1) VFIO will register a 2nd-level pgtable handler while /dev/ioasid
will register a 1st-level pgtable handler, while for 3) /dev/ioasid will register 
handlers for both 1st-level and 2nd-level pgtable. Jean? also want to know 
your thoughts...  

Thanks
Kevin

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-07  8:17                                 ` Tian, Kevin
@ 2021-04-07 11:58                                   ` Jason Gunthorpe
  2021-04-07 18:43                                   ` Jean-Philippe Brucker
  1 sibling, 0 replies; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-07 11:58 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Wang, Jean-Philippe Brucker, Alex Williamson, Raj, Ashok,
	Jonathan Corbet, Jean-Philippe Brucker, LKML, Jiang, Dave, iommu,
	Li Zefan, Johannes Weiner, Tejun Heo, cgroups, Wu, Hao,
	David Woodhouse

On Wed, Apr 07, 2021 at 08:17:50AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe
> > Sent: Tuesday, April 6, 2021 8:43 PM
> > 
> > On Tue, Apr 06, 2021 at 09:35:17AM +0800, Jason Wang wrote:
> > 
> > > > VFIO and VDPA has no buisness having map/unmap interfaces once we
> > have
> > > > /dev/ioasid. That all belongs in the iosaid side.
> > > >
> > > > I know they have those interfaces today, but that doesn't mean we have
> > > > to keep using them for PASID use cases, they should be replaced with a
> > > > 'do dma from this pasid on /dev/ioasid' interface certainly not a
> > > > 'here is a pasid from /dev/ioasid, go ahead and configure it youself'
> > > > interface
> > >
> > > So it looks like the PASID was bound to SVA in this design. I think it's not
> > > necessairly the case:
> > 
> > No, I wish people would stop talking about SVA.
> > 
> > SVA and vSVA are a very special narrow configuration of a PASID. There
> > are lots of other PASID configurations! That is the whole point, a
> > PASID is complicated, there are many configuration scenarios, they
> > need to be in one place with a very clearly defined uAPI
> > 
> 
> I feel it also makes sense to allow a subsystem to specify which configurations
> are permitted when allowing a PASID on its device

huh? why?

> e.g. excluding things like
> GPA mappings that existing subsystems (VFIO/VDPA) already handle well:

They don't "handle well", they have some historical baggage that is no
longer suitable for the complexity this area has in the modern world.

Forget about the existing APIs and replace them in /dev/ioasid.

> - Share GPA mappings between multiple devices (w/ or w/o PASID) for 
> better IOTLB efficiency;
>
> - Share GPA mappings between transactions w/ PASID and transactions w/o
> PASID from the same device (e.g. GPU) for better IOTLB efficiency;
> 
> - Use the same page table for GPA mappings before and after the guest 
> turns on/off the PASID capability;

All of these are cases you need to design the /dev/ioasid to handle.

It is pretty clear to me that you'll need non-PASID IOASID's as
well.

Ideally a generic IOASID would just be a page table and it doesn't
crystalize into a RID or RID,PASID routing until devices are attached
to it.

Since IOASID can be nested the only thing that makes any sense is for
each level of the nest to be visible under /dev/ioasid. 

What a complete mess it would be if vfio-pci owns the GPA table,
/dev/ioasid has a nested PASID, and vfio-mdev is running a mdev on top
of that PASID.

> All above are given as long as we continue to let VFIO/VDPA manage the
> iommu domain and associated GPA mappings for PASID.

So don't do that. Don't I keep saying this weird split is making a
horrible mess?

You can't reasonably build the complex PASID scenarios you talk about
above unless the entire translation path is owned by one entity:
/dev/ioasid.

You need to focus on figuring out what that looks like then figure out
how to move VDPA and VFIO to consume /dev/ioasid for all of their
translation instead of open-coding half-baked internal versions.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-07  2:08                                 ` Tian, Kevin
@ 2021-04-07 12:20                                   ` Jason Gunthorpe
  2021-04-07 23:50                                     ` Tian, Kevin
  0 siblings, 1 reply; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-07 12:20 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jean-Philippe Brucker, Alex Williamson, Raj, Ashok,
	Jonathan Corbet, Jean-Philippe Brucker, LKML, Jiang, Dave, iommu,
	Li Zefan, Johannes Weiner, Tejun Heo, cgroups, Wu, Hao,
	David Woodhouse, Jason Wang

On Wed, Apr 07, 2021 at 02:08:33AM +0000, Tian, Kevin wrote:

> > Because if you don't then we enter insane world where a PASID is being
> > created under /dev/ioasid but its translation path flows through setup
> > done by VFIO and the whole user API becomes an incomprehensible mess.
> > 
> > How will you even associate the PASID with the other translation??
> 
> PASID is attached to a specific iommu domain (created by VFIO/VDPA), which
> has GPA->HPA mappings already configured. If we view that mapping as an
> attribute of the iommu domain, it's reasonable to have the userspace-bound
> pgtable through /dev/ioasid to nest on it.

A user controlled page table should absolutely not be an attribute of
a hidden kernel object, nor should two parts of the kernel silently
connect to each other via a hidden internal objects like this.

Security is important - the kind of connection must use some explicit
FD authorization to access shared objects, not be made implicit!

IMHO this direction is a dead end for this reason.

> > The entire translation path for any ioasid or PASID should be defined
> > only by /dev/ioasid. Everything else is a legacy API.
> > 
> > > If following your suggestion then VFIO must deny VFIO MAP operations
> > > on sva1 (assume userspace should not mix sva1 and sva2 in the same
> > > container and instead use /dev/ioasid to map for sva1)?
> > 
> > No, userspace creates an iosaid for the guest physical mapping and
> > passes this ioasid to VFIO PCI which will assign it as the first layer
> > mapping on the RID
> 
> Is it an dummy ioasid just for providing GPA mappings for nesting purpose
> of other IOASIDs? Then we waste one per VM?

Generic ioasid's are "free" they are just software constructs in the
kernel.

> > When PASIDs are allocated the uAPI will be told to logically nested
> > under the first ioasid. When VFIO authorizes a PASID for a RID it
> > checks that all the HW rules are being followed.
> 
> As I explained above, why cannot we just use iommu domain to connect 
> the dots? 

Security.

> Every passthrough framework needs to create an iommu domain
> first. and It needs to support both devices w/ PASID and devices w/o
> PASID.  For devices w/o PASID it needs to invent its own MAP
> interface anyway.  

No, it should consume a ioasid from /dev/ioasid, use a common ioasid
map interface and assign that ioasid to a RID.

Don't get so fixated on PASID as a special case

> Then why do we bother creating another MAP interface through
> /dev/ioasid which not only duplicates but also creating transition
> burden between two set of MAP interfaces when the guest turns on/off
> the pasid capability on the device?

Don't transition. Always use the new interface. qemu detects the
kernel supports /dev/ioasid and *all iommu page table configuration*
goes through there. VFIO and VDPA APIs become unused for iommu
configuration.

> 'universally' upon from which angle you look at this problem. From IOASID
> p.o.v possibly yes, but from device passthrough p.o.v. it's the opposite
> since the passthrough framework needs to handle devices w/o PASID anyway
> (or even for device w/ PASID it could send traffic w/o PASID) thus 'universally'
> makes more sense if the passthrough framework can use one interface of its
> own to manage GPA mappings for all consumers (apply to the case when a
> PASID is allowed/authorized).

You correctly named it /dev/ioasid, it is a generic way to allocate,
manage and assign IOMMU page tables, which when generalized, only some
of which may consume a limited PASID.

RID and RID,PASID are the same thing, just a small difference in how
they match TLPs.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-07  8:17                                 ` Tian, Kevin
  2021-04-07 11:58                                   ` Jason Gunthorpe
@ 2021-04-07 18:43                                   ` Jean-Philippe Brucker
  2021-04-07 19:36                                     ` Jason Gunthorpe
  1 sibling, 1 reply; 269+ messages in thread
From: Jean-Philippe Brucker @ 2021-04-07 18:43 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, Jason Wang, Alex Williamson, Raj, Ashok,
	Jonathan Corbet, Jean-Philippe Brucker, LKML, Jiang, Dave, iommu,
	Li Zefan, Johannes Weiner, Tejun Heo, cgroups, Wu, Hao,
	David Woodhouse

On Wed, Apr 07, 2021 at 08:17:50AM +0000, Tian, Kevin wrote:
> btw this discussion was raised when discussing the I/O page fault handling
> process. Currently the IOMMU layer implements a per-device fault reporting
> mechanism, which requires VFIO to register a handler to receive all faults 
> on its device and then forwards to ioasid if it's due to 1st-level. Possibly it 
> makes more sense to convert it into a per-pgtable reporting scheme, and 
> then the owner of each pgtable should register its own handler.

Maybe, but you do need device information in there, since that's how the
fault is reported to the guest and how the response is routed back to the
faulting device (only PASID+PRGI would cause aliasing). And we need to
report non-recoverable faults, as well as recoverable ones without PASID,
once we hand control of level-1 page tables to guests.

> It means
> for 1) VFIO will register a 2nd-level pgtable handler while /dev/ioasid
> will register a 1st-level pgtable handler, while for 3) /dev/ioasid will register 
> handlers for both 1st-level and 2nd-level pgtable. Jean? also want to know 
> your thoughts...  

Moving all IOMMU controls to /dev/ioasid rather that splitting them is
probably better. Hopefully the implementation can reuse most of
vfio_iommu_type1.

I'm trying to sketch what may work for Arm, if we have to reuse
/dev/ioasid to avoid duplication of fault and inval queues:

* Get a container handle out of /dev/ioasid (or /dev/iommu, really.)
  No operation available since we don't know what the device and IOMMU
  capabilities are.

* Attach the handle to a VF. With VFIO that would be
  VFIO_GROUP_SET_CONTAINER. That causes the kernel to associate an IOMMU
  with the handle, and decide which operations are available.

* With a map/unmap vIOMMU (or shadow mappings), a single translation level
  is supported. With a nesting vIOMMU, we're populating the level-2
  translation (some day maybe by binding the KVM page tables, but
  currently with map/unmap ioctl).

  Single-level translation needs single VF per container. Two level would
  allow sharing stage-2 between multiple VFs, though it's a pain to define
  and implement.

* Without a vIOMMU or if the vIOMMU starts in bypass, populate the
  container page tables.

Start the guest.

* With a map/unmap vIOMMU, guest creates mappings, userspace populates the
  page tables with map/unmap ioctl.

  It would be possible to add a PASID mode there: guest requests an
  address space with a specific PASID, userspace derives an IOASID handle
  from the container handle and populate that address space with map/unmap
  ioctl. That would enable PASID on sub-VF assignment, which requires the
  host to control which PASID is programmed into the VF (with
  DEVICE_ALLOW_IOASID, I guess). And either the host allocates the PASID
  in this case (which isn't supported by a vSMMU) or we have to do a
  vPASID -> pPASID. I don't know if it's worth the effort.

Or
* With a nesting vIOMMU, the guest attaches a PASID table to a VF,
  userspace issues a SET_PASID_TABLE ioctl on the container handle. If
  we support multiple VFs per container, we first need to derive a child
  container from the main one and the device, then attach the PASID table.

  Guest programs the PASID table, sends invalidations when removing
  mappings which are relayed to the host on the child container. Page
  faults and response queue would be per container, so if multiple VF per
  container, we could have one queue for the parent (level-2 faults) and
  one for each child (level-1 faults).

Thanks,
Jean

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-07 18:43                                   ` Jean-Philippe Brucker
@ 2021-04-07 19:36                                     ` Jason Gunthorpe
  2021-04-08  9:37                                       ` Jean-Philippe Brucker
  0 siblings, 1 reply; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-07 19:36 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Tian, Kevin, Jason Wang, Alex Williamson, Raj, Ashok,
	Jonathan Corbet, Jean-Philippe Brucker, LKML, Jiang, Dave, iommu,
	Li Zefan, Johannes Weiner, Tejun Heo, cgroups, Wu, Hao,
	David Woodhouse

On Wed, Apr 07, 2021 at 08:43:50PM +0200, Jean-Philippe Brucker wrote:

> * Get a container handle out of /dev/ioasid (or /dev/iommu, really.)
>   No operation available since we don't know what the device and IOMMU
>   capabilities are.
>
> * Attach the handle to a VF. With VFIO that would be
>   VFIO_GROUP_SET_CONTAINER. That causes the kernel to associate an IOMMU
>   with the handle, and decide which operations are available.

Right, this is basically the point, - the VFIO container (/dev/vfio)
and the /dev/ioasid we are talking about have a core of
similarity. ioasid is the generalized, modernized, and cross-subsystem
version of the same idea. Instead of calling it "vfio container" we
call it something that evokes the idea of controlling the iommu.

The issue is to seperate /dev/vfio generic functionality from vfio and
share it with every subsystem.

It may be that /dev/vfio and /dev/ioasid end up sharing a lot of code,
with a different IOCTL interface around it. The vfio_iommu_driver_ops
is not particularly VFIOy.

Creating /dev/ioasid may primarily start as a code reorganization
exercise.

> * With a map/unmap vIOMMU (or shadow mappings), a single translation level
>   is supported. With a nesting vIOMMU, we're populating the level-2
>   translation (some day maybe by binding the KVM page tables, but
>   currently with map/unmap ioctl).
> 
>   Single-level translation needs single VF per container. 

Really? Why?

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-07 12:20                                   ` Jason Gunthorpe
@ 2021-04-07 23:50                                     ` Tian, Kevin
  2021-04-08 11:41                                       ` Jason Gunthorpe
  0 siblings, 1 reply; 269+ messages in thread
From: Tian, Kevin @ 2021-04-07 23:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Alex Williamson, Raj, Ashok,
	Jonathan Corbet, Jean-Philippe Brucker, LKML, Jiang, Dave, iommu,
	Li Zefan, Johannes Weiner, Tejun Heo, cgroups, Wu, Hao,
	David Woodhouse, Jason Wang

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, April 7, 2021 8:21 PM
> 
> On Wed, Apr 07, 2021 at 02:08:33AM +0000, Tian, Kevin wrote:
> 
> > > Because if you don't then we enter insane world where a PASID is being
> > > created under /dev/ioasid but its translation path flows through setup
> > > done by VFIO and the whole user API becomes an incomprehensible
> mess.
> > >
> > > How will you even associate the PASID with the other translation??
> >
> > PASID is attached to a specific iommu domain (created by VFIO/VDPA),
> which
> > has GPA->HPA mappings already configured. If we view that mapping as an
> > attribute of the iommu domain, it's reasonable to have the userspace-
> bound
> > pgtable through /dev/ioasid to nest on it.
> 
> A user controlled page table should absolutely not be an attribute of
> a hidden kernel object, nor should two parts of the kernel silently
> connect to each other via a hidden internal objects like this.
> 
> Security is important - the kind of connection must use some explicit
> FD authorization to access shared objects, not be made implicit!
> 
> IMHO this direction is a dead end for this reason.
> 

Could you elaborate what exact security problem is brought with this 
approach? Isn't ALLOW_PASID the authorization interface for the 
connection?

Based on all your replies now I see what you actually want is generalizing
all IOMMU related stuff through /dev/ioasid (sort of /dev/iommu), which
requires factoring out the vfio_iommu_type1 into the general part. This is
a huge work.

Is it really the only practice in Linux that any new feature has to be
blocked as long as a refactoring work is identified? Don't people accept
any balance between enabling new features and completing refactoring
work through a staging approach, as long as we don't introduce an uAPI
specifically for the staging purpose? ☹

Thanks
Kevin

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-07 19:36                                     ` Jason Gunthorpe
@ 2021-04-08  9:37                                       ` Jean-Philippe Brucker
  0 siblings, 0 replies; 269+ messages in thread
From: Jean-Philippe Brucker @ 2021-04-08  9:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Jason Wang, Alex Williamson, Raj, Ashok,
	Jonathan Corbet, Jean-Philippe Brucker, LKML, Jiang, Dave, iommu,
	Li Zefan, Johannes Weiner, Tejun Heo, cgroups, Wu, Hao,
	David Woodhouse

On Wed, Apr 07, 2021 at 04:36:54PM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 07, 2021 at 08:43:50PM +0200, Jean-Philippe Brucker wrote:
> 
> > * Get a container handle out of /dev/ioasid (or /dev/iommu, really.)
> >   No operation available since we don't know what the device and IOMMU
> >   capabilities are.
> >
> > * Attach the handle to a VF. With VFIO that would be
> >   VFIO_GROUP_SET_CONTAINER. That causes the kernel to associate an IOMMU
> >   with the handle, and decide which operations are available.
> 
> Right, this is basically the point, - the VFIO container (/dev/vfio)
> and the /dev/ioasid we are talking about have a core of
> similarity. ioasid is the generalized, modernized, and cross-subsystem
> version of the same idea. Instead of calling it "vfio container" we
> call it something that evokes the idea of controlling the iommu.
> 
> The issue is to seperate /dev/vfio generic functionality from vfio and
> share it with every subsystem.
> 
> It may be that /dev/vfio and /dev/ioasid end up sharing a lot of code,
> with a different IOCTL interface around it. The vfio_iommu_driver_ops
> is not particularly VFIOy.
> 
> Creating /dev/ioasid may primarily start as a code reorganization
> exercise.
> 
> > * With a map/unmap vIOMMU (or shadow mappings), a single translation level
> >   is supported. With a nesting vIOMMU, we're populating the level-2
> >   translation (some day maybe by binding the KVM page tables, but
> >   currently with map/unmap ioctl).
> > 
> >   Single-level translation needs single VF per container. 
> 
> Really? Why?

The vIOMMU is started in bypass, so the device can do DMA to the GPA space
until the guest configures the vIOMMU, at which point each VF is either
kept in bypass or gets new DMA mappings, which requires the host to tear
down the bypass mappings and set up the guest mappings on a per-VF basis
(I'm not considering nesting translation in the host kernel for this,
because it's not supported by all pIOMMUs and is expensive in terms of TLB
and pinned memory). So keeping a single VF per container is simpler, but
there are certainly other programming models possible.

Thanks,
Jean


^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-07 23:50                                     ` Tian, Kevin
@ 2021-04-08 11:41                                       ` Jason Gunthorpe
  0 siblings, 0 replies; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-08 11:41 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jean-Philippe Brucker, Alex Williamson, Raj, Ashok,
	Jonathan Corbet, Jean-Philippe Brucker, LKML, Jiang, Dave, iommu,
	Li Zefan, Johannes Weiner, Tejun Heo, cgroups, Wu, Hao,
	David Woodhouse, Jason Wang

On Wed, Apr 07, 2021 at 11:50:02PM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, April 7, 2021 8:21 PM
> > 
> > On Wed, Apr 07, 2021 at 02:08:33AM +0000, Tian, Kevin wrote:
> > 
> > > > Because if you don't then we enter insane world where a PASID is being
> > > > created under /dev/ioasid but its translation path flows through setup
> > > > done by VFIO and the whole user API becomes an incomprehensible
> > mess.
> > > >
> > > > How will you even associate the PASID with the other translation??
> > >
> > > PASID is attached to a specific iommu domain (created by VFIO/VDPA),
> > which
> > > has GPA->HPA mappings already configured. If we view that mapping as an
> > > attribute of the iommu domain, it's reasonable to have the userspace-
> > bound
> > > pgtable through /dev/ioasid to nest on it.
> > 
> > A user controlled page table should absolutely not be an attribute of
> > a hidden kernel object, nor should two parts of the kernel silently
> > connect to each other via a hidden internal objects like this.
> > 
> > Security is important - the kind of connection must use some explicit
> > FD authorization to access shared objects, not be made implicit!
> > 
> > IMHO this direction is a dead end for this reason.
> > 
> 
> Could you elaborate what exact security problem is brought with this 
> approach? Isn't ALLOW_PASID the authorization interface for the
> connection?

If the kernel objects don't come out of FDs then no.

> Is it really the only practice in Linux that any new feature has to be
> blocked as long as a refactoring work is identified? 

The practice is to define uAPIs that make sense and have a good chance
to be supported over a long time period, as the software evolves, not
to hacky hacky a gaint uAPI mess just to get some feature out the
door. 

This proposal as it was oringial shown is exactly the kind of hacky
hacky uapi nobody wants to see. Tunneling an IOMMU uapi through a
whole bunch of other FDs is completely nutz.

Intel should basically be investing most of its time building a robust
and well designed uAPI here, and don't complain that the community is
not doing Intel's job for free.

> Don't people accept any balance between enabling new features and
> completing refactoring work through a staging approach, as long as
> we don't introduce an uAPI specifically for the staging purpose? ☹

Since this is all uapi I don't see it as applicable here.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-01 16:03                                         ` Jason Gunthorpe
  2021-04-02  7:30                                           ` Tian, Kevin
@ 2021-04-15 13:11                                           ` Auger Eric
  2021-04-15 23:07                                             ` Jason Gunthorpe
  1 sibling, 1 reply; 269+ messages in thread
From: Auger Eric @ 2021-04-15 13:11 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: Jean-Philippe Brucker, Tian, Kevin, Jacob Pan, LKML,
	Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker,
	Alex Williamson, Jonathan Corbet, Raj, Ashok, Wu, Hao, Jiang,
	Dave

Hi Jason,

On 4/1/21 6:03 PM, Jason Gunthorpe wrote:
> On Thu, Apr 01, 2021 at 02:08:17PM +0000, Liu, Yi L wrote:
> 
>> DMA page faults are delivered to root-complex via page request message and
>> it is per-device according to PCIe spec. Page request handling flow is:
>>
>> 1) iommu driver receives a page request from device
>> 2) iommu driver parses the page request message. Get the RID,PASID, faulted
>>    page and requested permissions etc.
>> 3) iommu driver triggers fault handler registered by device driver with
>>    iommu_report_device_fault()
> 
> This seems confused.
> 
> The PASID should define how to handle the page fault, not the driver.

In my series I don't use PASID at all. I am just enabling nested stage
and the guest uses a single context. I don't allocate any user PASID at
any point.

When there is a fault at physical level (a stage 1 fault that concerns
the guest), this latter needs to be reported and injected into the
guest. The vfio pci driver registers a fault handler to the iommu layer
and in that fault handler it fills a circ bugger and triggers an eventfd
that is listened to by the VFIO-PCI QEMU device. this latter retrives
the faault from the mmapped circ buffer, it knowns which vIOMMU it is
attached to, and passes the fault to the vIOMMU.
Then the vIOMMU triggers and IRQ in the guest.

We are reusing the existing concepts from VFIO, region, IRQ to do that.

For that use case, would you also use /dev/ioasid?

Thanks

Eric
> 
> I don't remember any device specific actions in ATS, so what is the
> driver supposed to do?
> 
>> 4) device driver's fault handler signals an event FD to notify userspace to
>>    fetch the information about the page fault. If it's VM case, inject the
>>    page fault to VM and let guest to solve it.
> 
> If the PASID is set to 'report page fault to userspace' then some
> event should come out of /dev/ioasid, or be reported to a linked
> eventfd, or whatever.
> 
> If the PASID is set to 'SVM' then the fault should be passed to
> handle_mm_fault
> 
> And so on.
> 
> Userspace chooses what happens based on how they configure the PASID
> through /dev/ioasid.
> 
> Why would a device driver get involved here?
> 
>> Eric has sent below series for the page fault reporting for VM with passthru
>> device.
>> https://lore.kernel.org/kvm/20210223210625.604517-5-eric.auger@redhat.com/
> 
> It certainly should not be in vfio pci. Everything using a PASID needs
> this infrastructure, VDPA, mdev, PCI, CXL, etc.
> 
> Jason
> 


^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-15 13:11                                           ` Auger Eric
@ 2021-04-15 23:07                                             ` Jason Gunthorpe
  2021-04-16 13:12                                               ` Jacob Pan
  2021-04-16 13:38                                               ` Auger Eric
  0 siblings, 2 replies; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-15 23:07 UTC (permalink / raw)
  To: Auger Eric
  Cc: Liu, Yi L, Jean-Philippe Brucker, Tian, Kevin, Jacob Pan, LKML,
	Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker,
	Alex Williamson, Jonathan Corbet, Raj, Ashok, Wu, Hao, Jiang,
	Dave

On Thu, Apr 15, 2021 at 03:11:19PM +0200, Auger Eric wrote:
> Hi Jason,
> 
> On 4/1/21 6:03 PM, Jason Gunthorpe wrote:
> > On Thu, Apr 01, 2021 at 02:08:17PM +0000, Liu, Yi L wrote:
> > 
> >> DMA page faults are delivered to root-complex via page request message and
> >> it is per-device according to PCIe spec. Page request handling flow is:
> >>
> >> 1) iommu driver receives a page request from device
> >> 2) iommu driver parses the page request message. Get the RID,PASID, faulted
> >>    page and requested permissions etc.
> >> 3) iommu driver triggers fault handler registered by device driver with
> >>    iommu_report_device_fault()
> > 
> > This seems confused.
> > 
> > The PASID should define how to handle the page fault, not the driver.
> 
> In my series I don't use PASID at all. I am just enabling nested stage
> and the guest uses a single context. I don't allocate any user PASID at
> any point.
> 
> When there is a fault at physical level (a stage 1 fault that concerns
> the guest), this latter needs to be reported and injected into the
> guest. The vfio pci driver registers a fault handler to the iommu layer
> and in that fault handler it fills a circ bugger and triggers an eventfd
> that is listened to by the VFIO-PCI QEMU device. this latter retrives
> the faault from the mmapped circ buffer, it knowns which vIOMMU it is
> attached to, and passes the fault to the vIOMMU.
> Then the vIOMMU triggers and IRQ in the guest.
> 
> We are reusing the existing concepts from VFIO, region, IRQ to do that.
> 
> For that use case, would you also use /dev/ioasid?

/dev/ioasid could do all the things you described vfio-pci as doing,
it can even do them the same way you just described.

Stated another way, do you plan to duplicate all of this code someday
for vfio-cxl? What about for vfio-platform? ARM SMMU can be hooked to
platform devices, right?

I feel what you guys are struggling with is some choice in the iommu
kernel APIs that cause the events to be delivered to the pci_device
owner, not the PASID owner.

That feels solvable.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-15 23:07                                             ` Jason Gunthorpe
@ 2021-04-16 13:12                                               ` Jacob Pan
  2021-04-16 15:45                                                 ` Alex Williamson
  2021-04-16 13:38                                               ` Auger Eric
  1 sibling, 1 reply; 269+ messages in thread
From: Jacob Pan @ 2021-04-16 13:12 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Auger Eric, Liu, Yi L, Jean-Philippe Brucker, Tian, Kevin, LKML,
	Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker,
	Alex Williamson, Jonathan Corbet, Raj, Ashok, Wu, Hao, Jiang,
	Dave, jacob.jun.pan

Hi Jason,

On Thu, 15 Apr 2021 20:07:32 -0300, Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Thu, Apr 15, 2021 at 03:11:19PM +0200, Auger Eric wrote:
> > Hi Jason,
> > 
> > On 4/1/21 6:03 PM, Jason Gunthorpe wrote:  
> > > On Thu, Apr 01, 2021 at 02:08:17PM +0000, Liu, Yi L wrote:
> > >   
> > >> DMA page faults are delivered to root-complex via page request
> > >> message and it is per-device according to PCIe spec. Page request
> > >> handling flow is:
> > >>
> > >> 1) iommu driver receives a page request from device
> > >> 2) iommu driver parses the page request message. Get the RID,PASID,
> > >> faulted page and requested permissions etc.
> > >> 3) iommu driver triggers fault handler registered by device driver
> > >> with iommu_report_device_fault()  
> > > 
> > > This seems confused.
> > > 
> > > The PASID should define how to handle the page fault, not the driver.
> > >  
> > 
> > In my series I don't use PASID at all. I am just enabling nested stage
> > and the guest uses a single context. I don't allocate any user PASID at
> > any point.
> > 
> > When there is a fault at physical level (a stage 1 fault that concerns
> > the guest), this latter needs to be reported and injected into the
> > guest. The vfio pci driver registers a fault handler to the iommu layer
> > and in that fault handler it fills a circ bugger and triggers an eventfd
> > that is listened to by the VFIO-PCI QEMU device. this latter retrives
> > the faault from the mmapped circ buffer, it knowns which vIOMMU it is
> > attached to, and passes the fault to the vIOMMU.
> > Then the vIOMMU triggers and IRQ in the guest.
> > 
> > We are reusing the existing concepts from VFIO, region, IRQ to do that.
> > 
> > For that use case, would you also use /dev/ioasid?  
> 
> /dev/ioasid could do all the things you described vfio-pci as doing,
> it can even do them the same way you just described.
> 
> Stated another way, do you plan to duplicate all of this code someday
> for vfio-cxl? What about for vfio-platform? ARM SMMU can be hooked to
> platform devices, right?
> 
> I feel what you guys are struggling with is some choice in the iommu
> kernel APIs that cause the events to be delivered to the pci_device
> owner, not the PASID owner.
> 
> That feels solvable.
> 
Perhaps more of a philosophical question for you and Alex. There is no
doubt that the direction you guided for /dev/ioasid is a much cleaner one,
especially after VDPA emerged as another IOMMU backed framework.

The question is what do we do with the nested translation features that have
been targeting the existing VFIO-IOMMU for the last three years? That
predates VDPA. Shall we put a stop marker *after* nested support and say no
more extensions for VFIO-IOMMU, new features must be built on this new
interface?

If we were to close a checkout line for some unforeseen reasons, should we
honor the customers already in line for a long time?

This is not a tactic or excuse for not working on the new /dev/ioasid
interface. In fact, I believe we can benefit from the lessons learned while
completing the existing. This will give confidence to the new
interface. Thoughts?

> Jason


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-15 23:07                                             ` Jason Gunthorpe
  2021-04-16 13:12                                               ` Jacob Pan
@ 2021-04-16 13:38                                               ` Auger Eric
  2021-04-16 14:05                                                 ` Jason Gunthorpe
  1 sibling, 1 reply; 269+ messages in thread
From: Auger Eric @ 2021-04-16 13:38 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, Jean-Philippe Brucker, Tian, Kevin, Jacob Pan, LKML,
	Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker,
	Alex Williamson, Jonathan Corbet, Raj, Ashok, Wu, Hao, Jiang,
	Dave

Hi Jason,

On 4/16/21 1:07 AM, Jason Gunthorpe wrote:
> On Thu, Apr 15, 2021 at 03:11:19PM +0200, Auger Eric wrote:
>> Hi Jason,
>>
>> On 4/1/21 6:03 PM, Jason Gunthorpe wrote:
>>> On Thu, Apr 01, 2021 at 02:08:17PM +0000, Liu, Yi L wrote:
>>>
>>>> DMA page faults are delivered to root-complex via page request message and
>>>> it is per-device according to PCIe spec. Page request handling flow is:
>>>>
>>>> 1) iommu driver receives a page request from device
>>>> 2) iommu driver parses the page request message. Get the RID,PASID, faulted
>>>>    page and requested permissions etc.
>>>> 3) iommu driver triggers fault handler registered by device driver with
>>>>    iommu_report_device_fault()
>>>
>>> This seems confused.
>>>
>>> The PASID should define how to handle the page fault, not the driver.
>>
>> In my series I don't use PASID at all. I am just enabling nested stage
>> and the guest uses a single context. I don't allocate any user PASID at
>> any point.
>>
>> When there is a fault at physical level (a stage 1 fault that concerns
>> the guest), this latter needs to be reported and injected into the
>> guest. The vfio pci driver registers a fault handler to the iommu layer
>> and in that fault handler it fills a circ bugger and triggers an eventfd
>> that is listened to by the VFIO-PCI QEMU device. this latter retrives
>> the faault from the mmapped circ buffer, it knowns which vIOMMU it is
>> attached to, and passes the fault to the vIOMMU.
>> Then the vIOMMU triggers and IRQ in the guest.
>>
>> We are reusing the existing concepts from VFIO, region, IRQ to do that.
>>
>> For that use case, would you also use /dev/ioasid?
> 
> /dev/ioasid could do all the things you described vfio-pci as doing,
> it can even do them the same way you just described.
> 
> Stated another way, do you plan to duplicate all of this code someday
> for vfio-cxl? What about for vfio-platform? ARM SMMU can be hooked to
> platform devices, right?
vfio regions and IRQ related APIs are common user interfaces exposed by
all vfio drivers, including platform. Then the actual circular buffer
implementation details can be put in a common lib.

as for the thin vfio iommu wrappers, the ones you don't like, they are
implemented in type1 code.

Maybe the need for /dev/ioasid is more crying for PASID management but
for the nested use case, that's not obvious to me and in your different
replies, it was not crystal clear where the use case belongs to.

The redesign requirement came pretty late in the development process.
The iommu user API is upstream for a while, the VFIO interfaces have
been submitted a long time ago and under review for a bunch of time.
Redesigning everything with a different API, undefined at this point, is
a major setback for our work and will have a large impact on the
introduction of features companies are looking forward, hence our
frustration.

Thanks

Eric


> 
> I feel what you guys are struggling with is some choice in the iommu
> kernel APIs that cause the events to be delivered to the pci_device
> owner, not the PASID owner.
> 
> That feels solvable.
> 
> Jason
> 


^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-16 13:38                                               ` Auger Eric
@ 2021-04-16 14:05                                                 ` Jason Gunthorpe
  2021-04-16 14:26                                                   ` Auger Eric
  0 siblings, 1 reply; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-16 14:05 UTC (permalink / raw)
  To: Auger Eric
  Cc: Liu, Yi L, Jean-Philippe Brucker, Tian, Kevin, Jacob Pan, LKML,
	Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker,
	Alex Williamson, Jonathan Corbet, Raj, Ashok, Wu, Hao, Jiang,
	Dave

On Fri, Apr 16, 2021 at 03:38:02PM +0200, Auger Eric wrote:

> The redesign requirement came pretty late in the development process.
> The iommu user API is upstream for a while, the VFIO interfaces have
> been submitted a long time ago and under review for a bunch of time.
> Redesigning everything with a different API, undefined at this point, is
> a major setback for our work and will have a large impact on the
> introduction of features companies are looking forward, hence our
> frustration.

I will answer both you and Jacob at once.

This is uAPI, once it is set it can never be changed.

The kernel process and philosophy is to invest heavily in uAPI
development and review to converge on the best uAPI possible.

Many past submissions have take a long time to get this right, there
are several high profile uAPI examples.

Do you think this case is so special, or the concerns so minor, that it
should get to bypass all of the normal process?

Ask yourself, is anyone advocating for the current direction on
technical merits alone?

Certainly the patches I last saw where completely disgusting from a
uAPI design perspective.

It was against the development process to organize this work the way
it was done. Merging a wack of dead code to the kernel to support a
uAPI vision that was never clearly articulated was a big mistake.

Start from the beginning. Invest heavily in defining a high quality
uAPI. Clearly describe the uAPI to all stake holders. Break up the
implementation into patch series without dead code. Make the
patches. Remove the dead code this group has already added.

None of this should be a surprise. The VDPA discussion and related
"what is a mdev" over a year ago made it pretty clear VFIO is not the
exclusive user of "IOMMU in userspace" and that places limits on what
kind of uAPIs expansion it should experience going forward.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-16 14:05                                                 ` Jason Gunthorpe
@ 2021-04-16 14:26                                                   ` Auger Eric
  2021-04-16 14:34                                                     ` Jason Gunthorpe
  0 siblings, 1 reply; 269+ messages in thread
From: Auger Eric @ 2021-04-16 14:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, Jean-Philippe Brucker, Tian, Kevin, Jacob Pan, LKML,
	Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker,
	Alex Williamson, Jonathan Corbet, Raj, Ashok, Wu, Hao, Jiang,
	Dave

Hi,
On 4/16/21 4:05 PM, Jason Gunthorpe wrote:
> On Fri, Apr 16, 2021 at 03:38:02PM +0200, Auger Eric wrote:
> 
>> The redesign requirement came pretty late in the development process.
>> The iommu user API is upstream for a while, the VFIO interfaces have
>> been submitted a long time ago and under review for a bunch of time.
>> Redesigning everything with a different API, undefined at this point, is
>> a major setback for our work and will have a large impact on the
>> introduction of features companies are looking forward, hence our
>> frustration.
> 
> I will answer both you and Jacob at once.
> 
> This is uAPI, once it is set it can never be changed.
> 
> The kernel process and philosophy is to invest heavily in uAPI
> development and review to converge on the best uAPI possible.
> 
> Many past submissions have take a long time to get this right, there
> are several high profile uAPI examples.
> 
> Do you think this case is so special, or the concerns so minor, that it
> should get to bypass all of the normal process?

That's not my intent to bypass any process. I am just trying to
understand what needs to be re-designed and for what use case.
> 
> Ask yourself, is anyone advocating for the current direction on
> technical merits alone?
> 
> Certainly the patches I last saw where completely disgusting from a
> uAPI design perspective.
> 
> It was against the development process to organize this work the way
> it was done. Merging a wack of dead code to the kernel to support a
> uAPI vision that was never clearly articulated was a big mistake.
> 
> Start from the beginning. Invest heavily in defining a high quality
> uAPI. Clearly describe the uAPI to all stake holders.
This was largely done during several confs including plumber, KVM forum,
for several years. Also API docs were shared on the ML. I don't remember
any voice was raised at those moments.

 Break up the
> implementation into patch series without dead code. Make the
> patches. Remove the dead code this group has already added.
> 
> None of this should be a surprise. The VDPA discussion and related
> "what is a mdev" over a year ago made it pretty clear VFIO is not the
> exclusive user of "IOMMU in userspace" and that places limits on what
> kind of uAPIs expansion it should experience going forward.
Maybe clear for you but most probably not for many other stakeholders.

Anyway I do not intend to further argue and I will be happy to learn
from you and work with you, Jacob, Liu and all other stakeholders to
define a better integration.

Thanks

Eric
> 
> Jason
> 


^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-16 14:26                                                   ` Auger Eric
@ 2021-04-16 14:34                                                     ` Jason Gunthorpe
  2021-04-16 15:00                                                       ` Auger Eric
  0 siblings, 1 reply; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-16 14:34 UTC (permalink / raw)
  To: Auger Eric
  Cc: Liu, Yi L, Jean-Philippe Brucker, Tian, Kevin, Jacob Pan, LKML,
	Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker,
	Alex Williamson, Jonathan Corbet, Raj, Ashok, Wu, Hao, Jiang,
	Dave

On Fri, Apr 16, 2021 at 04:26:19PM +0200, Auger Eric wrote:

> This was largely done during several confs including plumber, KVM forum,
> for several years. Also API docs were shared on the ML. I don't remember
> any voice was raised at those moments.

I don't think anyone objects to the high level ideas, but
implementation does matter. I don't think anyone presented "hey we
will tunnel an uAPI through VFIO to the IOMMU subsystem" - did they?

Look at the fairly simple IMS situation, for example. This was
presented at plumbers too, and the slides were great - but the
implementation was too hacky. It required a major rework of the x86
interrupt handling before it was OK.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-16 14:34                                                     ` Jason Gunthorpe
@ 2021-04-16 15:00                                                       ` Auger Eric
  0 siblings, 0 replies; 269+ messages in thread
From: Auger Eric @ 2021-04-16 15:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, Jean-Philippe Brucker, Tian, Kevin, Jacob Pan, LKML,
	Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker,
	Alex Williamson, Jonathan Corbet, Raj, Ashok, Wu, Hao, Jiang,
	Dave

Hi Jason,

On 4/16/21 4:34 PM, Jason Gunthorpe wrote:
> On Fri, Apr 16, 2021 at 04:26:19PM +0200, Auger Eric wrote:
> 
>> This was largely done during several confs including plumber, KVM forum,
>> for several years. Also API docs were shared on the ML. I don't remember
>> any voice was raised at those moments.
> 
> I don't think anyone objects to the high level ideas, but
> implementation does matter. I don't think anyone presented "hey we
> will tunnel an uAPI through VFIO to the IOMMU subsystem" - did they?

At minimum
https://events19.linuxfoundation.cn/wp-content/uploads/2017/11/Shared-Virtual-Memory-in-KVM_Yi-Liu.pdf

But most obviously everything is documented in
Documentation/userspace-api/iommu.rst where the VFIO tunneling is
clearly stated ;-)

But well let's work together to design a better and more elegant
solution then.

Thanks

Eric
> 
> Look at the fairly simple IMS situation, for example. This was
> presented at plumbers too, and the slides were great - but the
> implementation was too hacky. It required a major rework of the x86
> interrupt handling before it was OK.
> 
> Jason
> 


^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-16 13:12                                               ` Jacob Pan
@ 2021-04-16 15:45                                                 ` Alex Williamson
  2021-04-16 17:23                                                   ` Jacob Pan
  2021-04-21 13:18                                                   ` Liu, Yi L
  0 siblings, 2 replies; 269+ messages in thread
From: Alex Williamson @ 2021-04-16 15:45 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Jason Gunthorpe, Auger Eric, Liu, Yi L, Jean-Philippe Brucker,
	Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, cgroups, Tejun Heo, Li Zefan, Johannes Weiner,
	Jean-Philippe Brucker, Jonathan Corbet, Raj, Ashok, Wu, Hao,
	Jiang, Dave

On Fri, 16 Apr 2021 06:12:58 -0700
Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:

> Hi Jason,
> 
> On Thu, 15 Apr 2021 20:07:32 -0300, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Thu, Apr 15, 2021 at 03:11:19PM +0200, Auger Eric wrote:  
> > > Hi Jason,
> > > 
> > > On 4/1/21 6:03 PM, Jason Gunthorpe wrote:    
> > > > On Thu, Apr 01, 2021 at 02:08:17PM +0000, Liu, Yi L wrote:
> > > >     
> > > >> DMA page faults are delivered to root-complex via page request
> > > >> message and it is per-device according to PCIe spec. Page request
> > > >> handling flow is:
> > > >>
> > > >> 1) iommu driver receives a page request from device
> > > >> 2) iommu driver parses the page request message. Get the RID,PASID,
> > > >> faulted page and requested permissions etc.
> > > >> 3) iommu driver triggers fault handler registered by device driver
> > > >> with iommu_report_device_fault()    
> > > > 
> > > > This seems confused.
> > > > 
> > > > The PASID should define how to handle the page fault, not the driver.
> > > >    
> > > 
> > > In my series I don't use PASID at all. I am just enabling nested stage
> > > and the guest uses a single context. I don't allocate any user PASID at
> > > any point.
> > > 
> > > When there is a fault at physical level (a stage 1 fault that concerns
> > > the guest), this latter needs to be reported and injected into the
> > > guest. The vfio pci driver registers a fault handler to the iommu layer
> > > and in that fault handler it fills a circ bugger and triggers an eventfd
> > > that is listened to by the VFIO-PCI QEMU device. this latter retrives
> > > the faault from the mmapped circ buffer, it knowns which vIOMMU it is
> > > attached to, and passes the fault to the vIOMMU.
> > > Then the vIOMMU triggers and IRQ in the guest.
> > > 
> > > We are reusing the existing concepts from VFIO, region, IRQ to do that.
> > > 
> > > For that use case, would you also use /dev/ioasid?    
> > 
> > /dev/ioasid could do all the things you described vfio-pci as doing,
> > it can even do them the same way you just described.
> > 
> > Stated another way, do you plan to duplicate all of this code someday
> > for vfio-cxl? What about for vfio-platform? ARM SMMU can be hooked to
> > platform devices, right?
> > 
> > I feel what you guys are struggling with is some choice in the iommu
> > kernel APIs that cause the events to be delivered to the pci_device
> > owner, not the PASID owner.
> > 
> > That feels solvable.
> >   
> Perhaps more of a philosophical question for you and Alex. There is no
> doubt that the direction you guided for /dev/ioasid is a much cleaner one,
> especially after VDPA emerged as another IOMMU backed framework.

I think this statement answers all your remaining questions ;)

> The question is what do we do with the nested translation features that have
> been targeting the existing VFIO-IOMMU for the last three years? That
> predates VDPA. Shall we put a stop marker *after* nested support and say no
> more extensions for VFIO-IOMMU, new features must be built on this new
> interface?
>
> If we were to close a checkout line for some unforeseen reasons, should we
> honor the customers already in line for a long time?
> 
> This is not a tactic or excuse for not working on the new /dev/ioasid
> interface. In fact, I believe we can benefit from the lessons learned while
> completing the existing. This will give confidence to the new
> interface. Thoughts?

I understand a big part of Jason's argument is that we shouldn't be in
the habit of creating duplicate interfaces, we should create one, well
designed interfaces to share among multiple subsystems.  As new users
have emerged, our solution needs to change to a common one rather than
a VFIO specific one.  The IOMMU uAPI provides an abstraction, but at
the wrong level, requiring userspace interfaces for each subsystem.

Luckily the IOMMU uAPI is not really exposed as an actual uAPI, but
that changes if we proceed to enable the interfaces to tunnel it
through VFIO.

The logical answer would therefore be that we don't make that
commitment to the IOMMU uAPI if we believe now that it's fundamentally
flawed.

Ideally this new /dev/ioasid interface, and making use of it as a VFIO
IOMMU backend, should replace type1.  Type1 will live on until that
interface gets to parity, at which point we may deprecate type1, but it
wouldn't make sense to continue to expand type1 in the same direction
as we intend /dev/ioasid to take over in the meantime, especially if it
means maintaining an otherwise dead uAPI.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-16 15:45                                                 ` Alex Williamson
@ 2021-04-16 17:23                                                   ` Jacob Pan
  2021-04-16 17:54                                                     ` Jason Gunthorpe
  2021-04-21 13:18                                                   ` Liu, Yi L
  1 sibling, 1 reply; 269+ messages in thread
From: Jacob Pan @ 2021-04-16 17:23 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, Auger Eric, Liu, Yi L, Jean-Philippe Brucker,
	Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, cgroups, Tejun Heo, Li Zefan, Johannes Weiner,
	Jean-Philippe Brucker, Jonathan Corbet, Raj, Ashok, Wu, Hao,
	Jiang, Dave, jacob.jun.pan

Hi Alex,

On Fri, 16 Apr 2021 09:45:47 -0600, Alex Williamson
<alex.williamson@redhat.com> wrote:

> On Fri, 16 Apr 2021 06:12:58 -0700
> Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:
> 
> > Hi Jason,
> > 
> > On Thu, 15 Apr 2021 20:07:32 -0300, Jason Gunthorpe <jgg@nvidia.com>
> > wrote: 
> > > On Thu, Apr 15, 2021 at 03:11:19PM +0200, Auger Eric wrote:    
> > > > Hi Jason,
> > > > 
> > > > On 4/1/21 6:03 PM, Jason Gunthorpe wrote:      
> > > > > On Thu, Apr 01, 2021 at 02:08:17PM +0000, Liu, Yi L wrote:
> > > > >       
> > > > >> DMA page faults are delivered to root-complex via page request
> > > > >> message and it is per-device according to PCIe spec. Page request
> > > > >> handling flow is:
> > > > >>
> > > > >> 1) iommu driver receives a page request from device
> > > > >> 2) iommu driver parses the page request message. Get the
> > > > >> RID,PASID, faulted page and requested permissions etc.
> > > > >> 3) iommu driver triggers fault handler registered by device
> > > > >> driver with iommu_report_device_fault()      
> > > > > 
> > > > > This seems confused.
> > > > > 
> > > > > The PASID should define how to handle the page fault, not the
> > > > > driver. 
> > > > 
> > > > In my series I don't use PASID at all. I am just enabling nested
> > > > stage and the guest uses a single context. I don't allocate any
> > > > user PASID at any point.
> > > > 
> > > > When there is a fault at physical level (a stage 1 fault that
> > > > concerns the guest), this latter needs to be reported and injected
> > > > into the guest. The vfio pci driver registers a fault handler to
> > > > the iommu layer and in that fault handler it fills a circ bugger
> > > > and triggers an eventfd that is listened to by the VFIO-PCI QEMU
> > > > device. this latter retrives the faault from the mmapped circ
> > > > buffer, it knowns which vIOMMU it is attached to, and passes the
> > > > fault to the vIOMMU. Then the vIOMMU triggers and IRQ in the guest.
> > > > 
> > > > We are reusing the existing concepts from VFIO, region, IRQ to do
> > > > that.
> > > > 
> > > > For that use case, would you also use /dev/ioasid?      
> > > 
> > > /dev/ioasid could do all the things you described vfio-pci as doing,
> > > it can even do them the same way you just described.
> > > 
> > > Stated another way, do you plan to duplicate all of this code someday
> > > for vfio-cxl? What about for vfio-platform? ARM SMMU can be hooked to
> > > platform devices, right?
> > > 
> > > I feel what you guys are struggling with is some choice in the iommu
> > > kernel APIs that cause the events to be delivered to the pci_device
> > > owner, not the PASID owner.
> > > 
> > > That feels solvable.
> > >     
> > Perhaps more of a philosophical question for you and Alex. There is no
> > doubt that the direction you guided for /dev/ioasid is a much cleaner
> > one, especially after VDPA emerged as another IOMMU backed framework.  
> 
> I think this statement answers all your remaining questions ;)
> 
> > The question is what do we do with the nested translation features that
> > have been targeting the existing VFIO-IOMMU for the last three years?
> > That predates VDPA. Shall we put a stop marker *after* nested support
> > and say no more extensions for VFIO-IOMMU, new features must be built
> > on this new interface?
> >
> > If we were to close a checkout line for some unforeseen reasons, should
> > we honor the customers already in line for a long time?
> > 
> > This is not a tactic or excuse for not working on the new /dev/ioasid
> > interface. In fact, I believe we can benefit from the lessons learned
> > while completing the existing. This will give confidence to the new
> > interface. Thoughts?  
> 
> I understand a big part of Jason's argument is that we shouldn't be in
> the habit of creating duplicate interfaces, we should create one, well
> designed interfaces to share among multiple subsystems.  As new users
> have emerged, our solution needs to change to a common one rather than
> a VFIO specific one.  The IOMMU uAPI provides an abstraction, but at
> the wrong level, requiring userspace interfaces for each subsystem.
> 
> Luckily the IOMMU uAPI is not really exposed as an actual uAPI, but
> that changes if we proceed to enable the interfaces to tunnel it
> through VFIO.
> 
> The logical answer would therefore be that we don't make that
> commitment to the IOMMU uAPI if we believe now that it's fundamentally
> flawed.
> 
I agree the uAPI data tunneling is definitely flawed in terms of
scalability.

I was just thinking it is still a small part of the overall
picture. Considering there are other parts such as fault reporting, user
space deployment, performance, and security. By completing the support on
the existing VFIO framework, it would at least offer a clear landscape where
the new /dev/ioasid can improve upon.

Perhaps similar to cgroup v1 vs v2, it took a long time and with known
limitations in v1.

Anyway, I am glad we have a clear direction now.

Thanks,

Jacob

> Ideally this new /dev/ioasid interface, and making use of it as a VFIO
> IOMMU backend, should replace type1.  Type1 will live on until that
> interface gets to parity, at which point we may deprecate type1, but it
> wouldn't make sense to continue to expand type1 in the same direction
> as we intend /dev/ioasid to take over in the meantime, especially if it
> means maintaining an otherwise dead uAPI.  Thanks,
> 

> Alex
> 


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-16 17:23                                                   ` Jacob Pan
@ 2021-04-16 17:54                                                     ` Jason Gunthorpe
  0 siblings, 0 replies; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-16 17:54 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Alex Williamson, Auger Eric, Liu, Yi L, Jean-Philippe Brucker,
	Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, cgroups, Tejun Heo, Li Zefan, Johannes Weiner,
	Jean-Philippe Brucker, Jonathan Corbet, Raj, Ashok, Wu, Hao,
	Jiang, Dave

On Fri, Apr 16, 2021 at 10:23:32AM -0700, Jacob Pan wrote:

> Perhaps similar to cgroup v1 vs v2, it took a long time and with known
> limitations in v1.

cgroup v2 is still having transition problems, if anything it is a
cautionary tale to think really hard about uAPI because transitioning
can be really hard.

It might be very wise to make /dev/ioasid and /dev/vfio ioctl
compatible in some way so existing software has a smoother upgrade
path.

For instance by defining a default IOASID

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-16 15:45                                                 ` Alex Williamson
  2021-04-16 17:23                                                   ` Jacob Pan
@ 2021-04-21 13:18                                                   ` Liu, Yi L
  2021-04-21 16:23                                                     ` Jason Gunthorpe
  1 sibling, 1 reply; 269+ messages in thread
From: Liu, Yi L @ 2021-04-21 13:18 UTC (permalink / raw)
  To: Alex Williamson, Jacob Pan
  Cc: Jason Gunthorpe, Auger Eric, Jean-Philippe Brucker, Tian, Kevin,
	LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker,
	Jonathan Corbet, Raj, Ashok, Wu, Hao, Jiang, Dave

Hi Alex,

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, April 16, 2021 11:46 PM
[...]
> > This is not a tactic or excuse for not working on the new /dev/ioasid
> > interface. In fact, I believe we can benefit from the lessons learned
> > while completing the existing. This will give confidence to the new
> > interface. Thoughts?
> 
> I understand a big part of Jason's argument is that we shouldn't be in
> the habit of creating duplicate interfaces, we should create one, well
> designed interfaces to share among multiple subsystems.  As new users
> have emerged, our solution needs to change to a common one rather than
> a VFIO specific one.  The IOMMU uAPI provides an abstraction, but at
> the wrong level, requiring userspace interfaces for each subsystem.
> 
> Luckily the IOMMU uAPI is not really exposed as an actual uAPI, but
> that changes if we proceed to enable the interfaces to tunnel it
> through VFIO.
> 
> The logical answer would therefore be that we don't make that
> commitment to the IOMMU uAPI if we believe now that it's fundamentally
> flawed.
> 
> Ideally this new /dev/ioasid interface, and making use of it as a VFIO
> IOMMU backend, should replace type1. 

yeah, just a double check, I think this also requires a new set of uAPIs
(e.g. new MAP/UNMAP), which means the current VFIO IOMMU type1 related ioctls
would be deprecated in future. right?

> Type1 will live on until that
> interface gets to parity, at which point we may deprecate type1, but it
> wouldn't make sense to continue to expand type1 in the same direction
> as we intend /dev/ioasid to take over in the meantime, especially if it
> means maintaining an otherwise dead uAPI.  Thanks,

understood.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-21 13:18                                                   ` Liu, Yi L
@ 2021-04-21 16:23                                                     ` Jason Gunthorpe
  2021-04-21 16:54                                                       ` Alex Williamson
  0 siblings, 1 reply; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-21 16:23 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Alex Williamson, Jacob Pan, Auger Eric, Jean-Philippe Brucker,
	Tian, Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse,
	iommu, cgroups, Tejun Heo, Li Zefan, Johannes Weiner,
	Jean-Philippe Brucker, Jonathan Corbet, Raj, Ashok, Wu, Hao,
	Jiang, Dave

On Wed, Apr 21, 2021 at 01:18:07PM +0000, Liu, Yi L wrote:
> > Ideally this new /dev/ioasid interface, and making use of it as a VFIO
> > IOMMU backend, should replace type1. 
> 
> yeah, just a double check, I think this also requires a new set of uAPIs
> (e.g. new MAP/UNMAP), which means the current VFIO IOMMU type1 related ioctls
> would be deprecated in future. right?

This is something to think about, it might make sense to run the
current ioctls in some "compat" mode under /dev/ioasid just to make
migration easier

In this sense /dev/ioasid would be a container that holds multiple
IOASIDs and every new format ioctl specifies the IOASID to operate
on. The legacy ioctls would use some default IOASID but otherwise act
the same.

I'm assuming here there is nothing especially wrong with the /dev/vfio
interface beyond being in the wrong place in the kernel and not
supporting multiple IOASIDs?

Then there may be a fairly simple approch to just make /dev/vfio ==
/dev/ioasid, at least for type 1.

By this I mean we could have the new /dev/ioasid code take over the
/dev/vfio char dev and present both interfaces, but with the same
fops.

The VFIO code would have to remain somehow to support PPC until
someone from ppc world migrates the SPAPR_TCE to use the kernel's new
common IOMMU framework instead of the arch specialty thing it does
now. But it can at least be compile disabled on everything except ppc.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-21 16:23                                                     ` Jason Gunthorpe
@ 2021-04-21 16:54                                                       ` Alex Williamson
  2021-04-21 17:52                                                         ` Jason Gunthorpe
  0 siblings, 1 reply; 269+ messages in thread
From: Alex Williamson @ 2021-04-21 16:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, Jacob Pan, Auger Eric, Jean-Philippe Brucker, Tian,
	Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu,
	cgroups, Tejun Heo, Li Zefan, Johannes Weiner,
	Jean-Philippe Brucker, Jonathan Corbet, Raj, Ashok, Wu, Hao,
	Jiang, Dave

On Wed, 21 Apr 2021 13:23:07 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Apr 21, 2021 at 01:18:07PM +0000, Liu, Yi L wrote:
> > > Ideally this new /dev/ioasid interface, and making use of it as a VFIO
> > > IOMMU backend, should replace type1.   
> > 
> > yeah, just a double check, I think this also requires a new set of uAPIs
> > (e.g. new MAP/UNMAP), which means the current VFIO IOMMU type1 related ioctls
> > would be deprecated in future. right?  
> 
> This is something to think about, it might make sense to run the
> current ioctls in some "compat" mode under /dev/ioasid just to make
> migration easier

Right, deprecating type1 doesn't necessarily mean deprecating the uAPI.
We created a type1v2 with minor semantic differences in unmap behavior
within the same uAPI.  Userspace is able to query and select an IOMMU
backend model and each model might have a different uAPI.  The SPAPR
IOMMU backend already takes advantage of this, using some ioctls
consistent with type1, but also requiring some extra steps.

Also note that the simple MAP and UNMAP uAPI of type1 has its
limitations, which we already struggle with.  See for example the
massive performance issues backing a vIOMMU with this uAPI.  The
/dev/ioasid approach should alleviate some of that, using a page table
for the 1st level, but a more advanced uAPI for the 2nd level seems
necessary at some point as well.

> In this sense /dev/ioasid would be a container that holds multiple
> IOASIDs and every new format ioctl specifies the IOASID to operate
> on. The legacy ioctls would use some default IOASID but otherwise act
> the same.
> 
> I'm assuming here there is nothing especially wrong with the /dev/vfio
> interface beyond being in the wrong place in the kernel and not
> supporting multiple IOASIDs?
> 
> Then there may be a fairly simple approch to just make /dev/vfio ==
> /dev/ioasid, at least for type 1.
>
> By this I mean we could have the new /dev/ioasid code take over the
> /dev/vfio char dev and present both interfaces, but with the same
> fops.

That's essentially replacing vfio-core, where I think we're more
talking about /dev/ioasid being an available IOMMU backend driver which
a user can select when available.  The interface of making that
selection might change to accept an external /dev/ioasid file
descriptor, of course.  Maybe you can elaborate on how the vfio device
and group uAPI live (or not) in this new scheme were /dev/ioasid is the
primary interface.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-21 16:54                                                       ` Alex Williamson
@ 2021-04-21 17:52                                                         ` Jason Gunthorpe
  2021-04-21 19:33                                                           ` Alex Williamson
  0 siblings, 1 reply; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-21 17:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Liu, Yi L, Jacob Pan, Auger Eric, Jean-Philippe Brucker, Tian,
	Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu,
	cgroups, Tejun Heo, Li Zefan, Johannes Weiner,
	Jean-Philippe Brucker, Jonathan Corbet, Raj, Ashok, Wu, Hao,
	Jiang, Dave

On Wed, Apr 21, 2021 at 10:54:51AM -0600, Alex Williamson wrote:

> That's essentially replacing vfio-core, where I think we're more

I am only talking about /dev/vfio here which is basically the IOMMU
interface part.

I still expect that VFIO_GROUP_SET_CONTAINER will be used to connect
/dev/{ioasid,vfio} to the VFIO group and all the group and device
logic stays inside VFIO.

The appeal of unifying /dev/{ioasid,vfio} to a single fops is that it
cleans up vfio a lot - we don't have to have two different code paths
where one handles a vfio_container and the other a ioasid_container
and the all the related different iommu ops and so on.

Everything can be switched to ioasid_container all down the line. If
it wasn't for PPC this looks fairly simple.

Since getting rid of PPC looks a bit hard, we'd be stuck with
accepting a /dev/ioasid and then immediately wrappering it in a
vfio_container an shimming it through a vfio_iommu_ops. It is not
ideal at all, but in my look around I don't see a major problem if
type1 implementation is moved to live under /dev/ioasid.

For concreteness if we look at the set container flow with ioasid I'd
say something like:

vfio_group_fops_unl_ioctl()
 VFIO_GROUP_SET_CONTAINER
  vfio_group_set_container()
     if (f.file->f_op == &vfio_fops) {
          // Use a real vfio_container and vfio_iommu_driver
          driver->ops->attach_group()
             tce_iommu_attach_group()
     }

     if (ioasid_container = ioasid_get_from_fd(container_fd)) {
         // create a dummy vfio_container and use the ioasid driver
	 container = kzalloc()
         container->iommu_driver = ioasid_shim
         driver->ops->attach_group()
             ioasid_shim_attach_group(ioasid_container, ...)
                 ioasid_attach_group()
                     // What used to be vfio_iommu_attach_group()

Broadly all the ops vfio need go through the ioasid_shim which relays
them to the generic ioasid API.

We end up with a ioasid.h that basically has the vfio_iommu_type1 code
lightly recast into some 'struct iommu_container' and a set of
ioasid_* function entry points that follow vfio_iommu_driver_ops_type1:
  ioasid_attach_group
  ioasid_detatch_group
  ioasid_<something about user pages>
  ioasid_read/ioasid_write

If we have this, and /dev/ioasid implements the legacy IOCTLs, then
/dev/vfio == /dev/ioasid and we can compile out vfio_fops and related
from vfio.c and tell ioasid.c to create /dev/vfio instead using the
ops it owns.

This is a very long winded way of saying ideally we'd do
approximately:
  git mv drivers/vfio/vfio_iommu_type1.c drivers/ioasid/ioasid.c

As the first step. Essentially we declare that what is type1 is really
the user interface to the internal kernel IOMMU kAPI, which has been
steadily evolving since type1 was created 10 years ago.

> The interface of making that selection might change to accept an
> external /dev/ioasid file descriptor, of course.  Maybe you can
> elaborate on how the vfio device and group uAPI live (or not) in
> this new scheme were /dev/ioasid is the primary interface.  Thanks,

They say in vfio. You'd still open a group and you'd still pass in
either /dev/vfio or /dev/ioasid to define the container

Though, completely as an unrelated aside, I admit to not entirely
understanding why the group is the central element of the uAPI.

It is weird that the vfio "drivers" all work on the struct vfio_device
(at least after my series), and it has a file_operations presence via
vfio_device_fops, but instead of struct vfio_device directly having a
'struct device' and cdev to access the FD we get it through a group FD
and agroup chardev via VFIO_GROUP_GET_DEVICE_FD

If we were to revise this, and I don't see a huge reason to do so, I
would put a struct device and cdev in struct vfio_device, attach the
vfio_device directly to the ioasid and then forget about the group, at
least as uapi, completely.

Or at least I don't see where that gets into trouble, but I'm not too
familiar with the multi-vfio in a process scenario..

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-21 17:52                                                         ` Jason Gunthorpe
@ 2021-04-21 19:33                                                           ` Alex Williamson
  2021-04-21 23:03                                                             ` Jason Gunthorpe
  2021-04-22 12:55                                                             ` Liu Yi L
  0 siblings, 2 replies; 269+ messages in thread
From: Alex Williamson @ 2021-04-21 19:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, Jacob Pan, Auger Eric, Jean-Philippe Brucker, Tian,
	Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu,
	cgroups, Tejun Heo, Li Zefan, Johannes Weiner,
	Jean-Philippe Brucker, Jonathan Corbet, Raj, Ashok, Wu, Hao,
	Jiang, Dave

On Wed, 21 Apr 2021 14:52:03 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Apr 21, 2021 at 10:54:51AM -0600, Alex Williamson wrote:
> 
> > That's essentially replacing vfio-core, where I think we're more  
> 
> I am only talking about /dev/vfio here which is basically the IOMMU
> interface part.
> 
> I still expect that VFIO_GROUP_SET_CONTAINER will be used to connect
> /dev/{ioasid,vfio} to the VFIO group and all the group and device
> logic stays inside VFIO.

But that group and device logic is also tied to the container, where
the IOMMU backend is the interchangeable thing that provides the IOMMU
manipulation for that container.  If you're using
VFIO_GROUP_SET_CONTAINER to associate a group to a /dev/ioasid, then
you're really either taking that group outside of vfio or you're
re-implementing group management in /dev/ioasid.  I'd expect the
transition point at VFIO_SET_IOMMU.

> The appeal of unifying /dev/{ioasid,vfio} to a single fops is that it
> cleans up vfio a lot - we don't have to have two different code paths
> where one handles a vfio_container and the other a ioasid_container
> and the all the related different iommu ops and so on.

Currently vfio IOMMU backends don't know about containers either.
Setting the vfio IOMMU for a container creates an object within the
IOMMU backend representing that IOMMU context.  IOMMU groups are then
attached to that context, where the IOMMU backend can add to or create a
new IOMMU domain to include that group, or if no compatible IOMMU
context can be created, reject it.

> Everything can be switched to ioasid_container all down the line. If
> it wasn't for PPC this looks fairly simple.

At what point is it no longer vfio?  I'd venture to say that replacing
the container rather than invoking a different IOMMU backend is that
point.

> Since getting rid of PPC looks a bit hard, we'd be stuck with
> accepting a /dev/ioasid and then immediately wrappering it in a
> vfio_container an shimming it through a vfio_iommu_ops. It is not
> ideal at all, but in my look around I don't see a major problem if
> type1 implementation is moved to live under /dev/ioasid.

But type1 is \just\ an IOMMU backend, not "/dev/vfio".  Given that
nobody flinched at removing NVLink support, maybe just deprecate SPAPR
now and see if anyone objects ;)

> For concreteness if we look at the set container flow with ioasid I'd
> say something like:
> 
> vfio_group_fops_unl_ioctl()
>  VFIO_GROUP_SET_CONTAINER
>   vfio_group_set_container()
>      if (f.file->f_op == &vfio_fops) {
>           // Use a real vfio_container and vfio_iommu_driver
>           driver->ops->attach_group()
>              tce_iommu_attach_group()
>      }
> 
>      if (ioasid_container = ioasid_get_from_fd(container_fd)) {
>          // create a dummy vfio_container and use the ioasid driver
> 	 container = kzalloc()
>          container->iommu_driver = ioasid_shim
>          driver->ops->attach_group()
>              ioasid_shim_attach_group(ioasid_container, ...)
>                  ioasid_attach_group()
>                      // What used to be vfio_iommu_attach_group()

How do you handle multiple groups with the same container?  Again, I'd
expect some augmentation of VFIO_SET_IOMMU so that /dev/vfio continues
to exist and manage group to container mapping and /dev/ioasid manages
the IOMMU context of that container.
> 
> Broadly all the ops vfio need go through the ioasid_shim which relays
> them to the generic ioasid API.

/dev/vfio essentially already passes through all fops to the IOMMU
backend once the VFIO_SET_IOMMU is established.
 
> We end up with a ioasid.h that basically has the vfio_iommu_type1 code
> lightly recast into some 'struct iommu_container' and a set of
> ioasid_* function entry points that follow vfio_iommu_driver_ops_type1:
>   ioasid_attach_group
>   ioasid_detatch_group
>   ioasid_<something about user pages>
>   ioasid_read/ioasid_write

Again, this looks like a vfio IOMMU backend.  What are we accomplishing
by replacing /dev/vfio with /dev/ioasid versus some manipulation of
VFIO_SET_IOMMU accepting a /dev/ioasid fd?

> If we have this, and /dev/ioasid implements the legacy IOCTLs, then
> /dev/vfio == /dev/ioasid and we can compile out vfio_fops and related
> from vfio.c and tell ioasid.c to create /dev/vfio instead using the
> ops it owns.

Why would we want /dev/ioasid to implement legacy ioctls instead of
simply implementing an interface to allow /dev/ioasid to be used as a
vfio IOMMU backend?
 
> This is a very long winded way of saying ideally we'd do
> approximately:
>   git mv drivers/vfio/vfio_iommu_type1.c drivers/ioasid/ioasid.c
> 
> As the first step. Essentially we declare that what is type1 is really
> the user interface to the internal kernel IOMMU kAPI, which has been
> steadily evolving since type1 was created 10 years ago.

The pseudo code above really suggests you do want to remove
/dev/vfio/vfio, but this is only one of the IOMMU backends for vfio, so
I can't quite figure out if we're talking past each other.

As I expressed in another thread, type1 has a lot of shortcomings.  The
mapping interface leaves userspace trying desperately to use statically
mapped buffers because the map/unmap latency is too high.  We have
horrible issues with duplicate locked page accounting across
containers.  It suffers pretty hard from feature creep in various
areas.  A new IOMMU backend is an opportunity to redesign some of these
things.

> > The interface of making that selection might change to accept an
> > external /dev/ioasid file descriptor, of course.  Maybe you can
> > elaborate on how the vfio device and group uAPI live (or not) in
> > this new scheme were /dev/ioasid is the primary interface.  Thanks,  
> 
> They say in vfio. You'd still open a group and you'd still pass in
> either /dev/vfio or /dev/ioasid to define the container
> 
> Though, completely as an unrelated aside, I admit to not entirely
> understanding why the group is the central element of the uAPI.
> 
> It is weird that the vfio "drivers" all work on the struct vfio_device
> (at least after my series), and it has a file_operations presence via
> vfio_device_fops, but instead of struct vfio_device directly having a
> 'struct device' and cdev to access the FD we get it through a group FD
> and agroup chardev via VFIO_GROUP_GET_DEVICE_FD
> 
> If we were to revise this, and I don't see a huge reason to do so, I
> would put a struct device and cdev in struct vfio_device, attach the
> vfio_device directly to the ioasid and then forget about the group, at
> least as uapi, completely.
> 
> Or at least I don't see where that gets into trouble, but I'm not too
> familiar with the multi-vfio in a process scenario..

The vfio_group is the unit of userspace ownership as it reflects the
IOMMU group as the unit of isolation.  Ideally there's a 1:1 mapping
between device and group, but that is of course not always the case.

The IOMMU group also abstracts isolation and visibility relative to
DMA.  For example, in a PCIe topology a multi-function device may not
have isolation between functions, but each requester ID is visible to
the IOMMU.  This lacks isolation but not IOMMU granularity, or
visibility.  A conventional PCI topology however lacks both isolation
and visibility, all devices downstream use either the PCIe-to-PCI
bridge RID or a RID derived from the secondary bus.  We can also have
mixed topologies, for example PCIe-to-PCI<->PCI-to-PCIe, where the
grouping code needs to search upstream for the highest level where we
achieve both isolation and visibility.

To simplify this, we use the group as the unit of IOMMU context, again
favoring singleton group behavior.

An individual vfio_device doesn't know about these isolation
dependencies, thus while a vfio bus/device driver like vfio-pci can
expose a device, it's vfio-core than manages whether the isolated set
of devices which includes that device, ie. the group, meets the
requirements for userspace access.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-21 19:33                                                           ` Alex Williamson
@ 2021-04-21 23:03                                                             ` Jason Gunthorpe
  2021-04-22  8:34                                                               ` Tian, Kevin
  2021-04-22 17:13                                                               ` Alex Williamson
  2021-04-22 12:55                                                             ` Liu Yi L
  1 sibling, 2 replies; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-21 23:03 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Liu, Yi L, Jacob Pan, Auger Eric, Jean-Philippe Brucker, Tian,
	Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu,
	cgroups, Tejun Heo, Li Zefan, Johannes Weiner,
	Jean-Philippe Brucker, Jonathan Corbet, Raj, Ashok, Wu, Hao,
	Jiang, Dave

On Wed, Apr 21, 2021 at 01:33:12PM -0600, Alex Williamson wrote:

> > I still expect that VFIO_GROUP_SET_CONTAINER will be used to connect
> > /dev/{ioasid,vfio} to the VFIO group and all the group and device
> > logic stays inside VFIO.
> 
> But that group and device logic is also tied to the container, where
> the IOMMU backend is the interchangeable thing that provides the IOMMU
> manipulation for that container.

I think that is an area where the discussion would need to be focused.

I don't feel very prepared to have it in details, as I haven't dug
into all the group and iommu micro-operation very much.

But, it does seem like the security concept that VFIO is creating with
the group also has to be present in the lower iommu layer too.

With different subsystems joining devices to the same ioasid's we
still have to enforce the security propery the vfio group is creating.

> If you're using VFIO_GROUP_SET_CONTAINER to associate a group to a
> /dev/ioasid, then you're really either taking that group outside of
> vfio or you're re-implementing group management in /dev/ioasid. 

This sounds right.

> > Everything can be switched to ioasid_container all down the line. If
> > it wasn't for PPC this looks fairly simple.
> 
> At what point is it no longer vfio?  I'd venture to say that replacing
> the container rather than invoking a different IOMMU backend is that
> point.

sorry, which is no longer vfio?
 
> > Since getting rid of PPC looks a bit hard, we'd be stuck with
> > accepting a /dev/ioasid and then immediately wrappering it in a
> > vfio_container an shimming it through a vfio_iommu_ops. It is not
> > ideal at all, but in my look around I don't see a major problem if
> > type1 implementation is moved to live under /dev/ioasid.
> 
> But type1 is \just\ an IOMMU backend, not "/dev/vfio".  Given that
> nobody flinched at removing NVLink support, maybe just deprecate SPAPR
> now and see if anyone objects ;)

Would simplify this project, but I wonder :)

In any event, it does look like today we'd expect the SPAPR stuff
would be done through the normal iommu APIs, perhaps enhanced a bit,
which makes me suspect an enhanced type1 can implement SPAPR.

I say this because the SPAPR looks quite a lot like PASID when it has
APIs for allocating multiple tables and other things. I would be
interested to hear someone from IBM talk about what it is doing and
how it doesn't fit into today's IOMMU API.

It is very old and the iommu world has advanced tremendously lately,
maybe I'm too optimisitic?

> > We end up with a ioasid.h that basically has the vfio_iommu_type1 code
> > lightly recast into some 'struct iommu_container' and a set of
> > ioasid_* function entry points that follow vfio_iommu_driver_ops_type1:
> >   ioasid_attach_group
> >   ioasid_detatch_group
> >   ioasid_<something about user pages>
> >   ioasid_read/ioasid_write
> 
> Again, this looks like a vfio IOMMU backend.  What are we accomplishing
> by replacing /dev/vfio with /dev/ioasid versus some manipulation of
> VFIO_SET_IOMMU accepting a /dev/ioasid fd?

The point of all of this is to make the user api for the IOMMU
cross-subsystem. It is not a vfio IOMMU backend, it is moving the
IOMMU abstraction from VFIO into the iommu framework and giving the
iommu framework a re-usable user API.

My ideal outcome would be for VFIO to use only the new iommu/ioasid
API and have no iommu pluggability at all. The iommu subsystem
provides everything needed to VFIO, and provides it equally to VDPA
and everything else.

drivers/vfio/ becomes primarily about 'struct vfio_device' and
everything related to its IOCTL interface.

drivers/iommu and ioasid.c become all about a pluggable IOMMU
interface, including a uAPI for it.

IMHO it makes a high level sense, though it may be a pipe dream.

> > If we have this, and /dev/ioasid implements the legacy IOCTLs, then
> > /dev/vfio == /dev/ioasid and we can compile out vfio_fops and related
> > from vfio.c and tell ioasid.c to create /dev/vfio instead using the
> > ops it owns.
> 
> Why would we want /dev/ioasid to implement legacy ioctls instead of
> simply implementing an interface to allow /dev/ioasid to be used as a
> vfio IOMMU backend?

Only to make our own migration easier. I'd imagine everyone would want
to sit down and design this new clear ioasid API that can co-exist on
/dev/ioasid with the legacy once.

> The pseudo code above really suggests you do want to remove
> /dev/vfio/vfio, but this is only one of the IOMMU backends for vfio, so
> I can't quite figure out if we're talking past each other.

I'm not quite sure what you mean by "one of the IOMMU backends?" You
mean type1, right?
 
> As I expressed in another thread, type1 has a lot of shortcomings.  The
> mapping interface leaves userspace trying desperately to use statically
> mapped buffers because the map/unmap latency is too high.  We have
> horrible issues with duplicate locked page accounting across
> containers.  It suffers pretty hard from feature creep in various
> areas.  A new IOMMU backend is an opportunity to redesign some of these
> things.

Sure, but also those kinds of transformational things go alot better
if you can smoothly go from the old to the new and have technical
co-existance in side the kernel. Having a shim that maps the old APIs
to new APIs internally to Linux helps keep the implementation from
becoming too bogged down with compatibility.

> The IOMMU group also abstracts isolation and visibility relative to
> DMA.  For example, in a PCIe topology a multi-function device may not
> have isolation between functions, but each requester ID is visible to
> the IOMMU.  

Okay, I'm glad I have this all right in my head, as I was pretty sure
this was what the group was about.

My next question is why do we have three things as a FD: group, device
and container (aka IOMMU interface)?

Do we have container because the /dev/vfio/vfio can hold only a single
page table so we need to swap containers sometimes?

If we start from a clean sheet and make a sketch..

/dev/ioasid is the IOMMU control interface. It can create multiple
IOASIDs that have page tables and it can manipulate those page tables.
Each IOASID is identified by some number.

struct vfio_device/vdpa_device/etc are consumers of /dev/ioasid

When a device attaches to an ioasid userspace gives VFIO/VDPA the
ioasid FD and the ioasid # in the FD.

The security rule for isolation is that once a device is attached to a
/dev/ioasid fd then all other devices in that security group must be
attached to the same ioasid FD or left unused.

Thus /dev/ioasid also becomes the unit of security and the IOMMU
subsystem level becomes aware of and enforces the group security
rules. Userspace does not need to "see" the group

In sketch it would be like
  ioasid_fd = open("/dev/ioasid");
  vfio_device_fd = open("/dev/vfio/device0")
  vdpa_device_fd = open("/dev/vdpa/device0")
  ioctl(vifo_device_fd, JOIN_IOASID_FD, ioasifd)
  ioctl(vdpa_device_fd, JOIN_IOASID_FD, ioasifd)

  gpa_ioasid_id = ioctl(ioasid_fd, CREATE_IOASID, ..)
  ioctl(ioasid_fd, SET_IOASID_PAGE_TABLES, ..)

  ioctl(vfio_device, ATTACH_IOASID, gpa_ioasid_id)
  ioctl(vpda_device, ATTACH_IOASID, gpa_ioasid_id)

  .. both VDPA and VFIO see the guest physical map and the kernel has
     enough info that both could use the same IOMMU page table
     structure ..

  // Guest viommu turns off bypass mode for the vfio device
  ioctl(vfio_device, DETATCH_IOASID)
 
  // Guest viommu creates a new page table
  rid_ioasid_id = ioctl(ioasid_fd, CREATE_IOASID, ..)
  ioctl(ioasid_fd, SET_IOASID_PAGE_TABLES, ..)

  // Guest viommu links the new page table to the RID
  ioctl(vfio_device, ATTACH_IOASID, rid_ioasid_id)

The group security concept becomes implicit and hidden from the
uAPI. JOIN_IOASID_FD implicitly finds the device's group inside the
kernel and requires that all members of the group be joined only to
this ioasid_fd.

Essentially we discover the group from the device instead of the
device from the group.

Where does it fall down compared to the three FD version we have
today?

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-21 23:03                                                             ` Jason Gunthorpe
@ 2021-04-22  8:34                                                               ` Tian, Kevin
  2021-04-22 12:10                                                                 ` Jason Gunthorpe
  2021-04-22 17:13                                                               ` Alex Williamson
  1 sibling, 1 reply; 269+ messages in thread
From: Tian, Kevin @ 2021-04-22  8:34 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Liu, Yi L, Jacob Pan, Auger Eric, Jean-Philippe Brucker, LKML,
	Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker,
	Jonathan Corbet, Raj, Ashok, Wu, Hao, Jiang, Dave

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, April 22, 2021 7:03 AM
> 
> > The pseudo code above really suggests you do want to remove
> > /dev/vfio/vfio, but this is only one of the IOMMU backends for vfio, so
> > I can't quite figure out if we're talking past each other.
> 
> I'm not quite sure what you mean by "one of the IOMMU backends?" You
> mean type1, right?

I think Alex meant that type1 is one of the IOMMU backends in VFIO (type1,
type1v2, tce, tce_v2, noiommu, etc.) which are all configured through 
/dev/vfio/vfio. If we are just moving type1 to /dev/ioasid, the justification
is not sufficient by replacing /dev/vfio/vfio with /dev/ioasid, at least in
this transition phase (before all iommu bits are consolidated in /dev/ioasid
in your ideal outcome).

> 
> > As I expressed in another thread, type1 has a lot of shortcomings.  The
> > mapping interface leaves userspace trying desperately to use statically
> > mapped buffers because the map/unmap latency is too high.  We have
> > horrible issues with duplicate locked page accounting across
> > containers.  It suffers pretty hard from feature creep in various
> > areas.  A new IOMMU backend is an opportunity to redesign some of these
> > things.
> 
> Sure, but also those kinds of transformational things go alot better
> if you can smoothly go from the old to the new and have technical
> co-existance in side the kernel. Having a shim that maps the old APIs
> to new APIs internally to Linux helps keep the implementation from
> becoming too bogged down with compatibility.

The shim layer could be considered as a new iommu backend in VFIO,
which connects VFIO iommu ops to the internal helpers in drivers/ioasid.
In this case then we don't need to replicate the VFIO uAPI through
/dev/ioasid. Instead the new interface just supports new uAPI. An old
VFIO userspace still opens /dev/vfio/vfio to conduct iommu operations
which implicitly goes to drivers/ioasid. A new VFIO userspace uses 
/dev/vfio/vfio to join ioasid_fd and then use new uAPIs through /dev/
ioasid to manage iommu pgtables, as you described below.

> 
> > The IOMMU group also abstracts isolation and visibility relative to
> > DMA.  For example, in a PCIe topology a multi-function device may not
> > have isolation between functions, but each requester ID is visible to
> > the IOMMU.
> 
> Okay, I'm glad I have this all right in my head, as I was pretty sure
> this was what the group was about.
> 
> My next question is why do we have three things as a FD: group, device
> and container (aka IOMMU interface)?
> 
> Do we have container because the /dev/vfio/vfio can hold only a single
> page table so we need to swap containers sometimes?

Yes, one container can hold only a single page table. When vIOMMU is
exposed, VFIO requires each device/group in different containers to
support per-device address space (before nested translation is supported),
which is switched between GPA and gIOVA when bypass mode is turned 
on/off for a given device.

Another tricky thing is that a container may be linked to multiple iommu
domains in VFIO, as devices in the container may locate behind different
IOMMUs with inconsistent capability (commit 1ef3e2bc). In this case 
more accurately one container can hold a single address space, which could
be replayed into multiple page tables (with exact same mappings). I'm not
sure whether this is something that could be simplified (or not supported)
in the new interface. In the end each pgtable operation is per iommu domain
in the iommu layer. I wonder where we want to maintain the relationship
between the ioasid_fd and associated iommu domains.

> 
> If we start from a clean sheet and make a sketch..
> 
> /dev/ioasid is the IOMMU control interface. It can create multiple
> IOASIDs that have page tables and it can manipulate those page tables.
> Each IOASID is identified by some number.
> 
> struct vfio_device/vdpa_device/etc are consumers of /dev/ioasid
> 
> When a device attaches to an ioasid userspace gives VFIO/VDPA the
> ioasid FD and the ioasid # in the FD.
> 
> The security rule for isolation is that once a device is attached to a
> /dev/ioasid fd then all other devices in that security group must be
> attached to the same ioasid FD or left unused.
> 
> Thus /dev/ioasid also becomes the unit of security and the IOMMU
> subsystem level becomes aware of and enforces the group security
> rules. Userspace does not need to "see" the group
> 
> In sketch it would be like
>   ioasid_fd = open("/dev/ioasid");
>   vfio_device_fd = open("/dev/vfio/device0")
>   vdpa_device_fd = open("/dev/vdpa/device0")
>   ioctl(vifo_device_fd, JOIN_IOASID_FD, ioasifd)
>   ioctl(vdpa_device_fd, JOIN_IOASID_FD, ioasifd)
> 
>   gpa_ioasid_id = ioctl(ioasid_fd, CREATE_IOASID, ..)
>   ioctl(ioasid_fd, SET_IOASID_PAGE_TABLES, ..)
> 
>   ioctl(vfio_device, ATTACH_IOASID, gpa_ioasid_id)
>   ioctl(vpda_device, ATTACH_IOASID, gpa_ioasid_id)
> 
>   .. both VDPA and VFIO see the guest physical map and the kernel has
>      enough info that both could use the same IOMMU page table
>      structure ..
> 
>   // Guest viommu turns off bypass mode for the vfio device
>   ioctl(vfio_device, DETATCH_IOASID)
> 
>   // Guest viommu creates a new page table
>   rid_ioasid_id = ioctl(ioasid_fd, CREATE_IOASID, ..)
>   ioctl(ioasid_fd, SET_IOASID_PAGE_TABLES, ..)
> 
>   // Guest viommu links the new page table to the RID
>   ioctl(vfio_device, ATTACH_IOASID, rid_ioasid_id)

Just to confirm. Above flow is for current map/unmap flavor as what
VFIO/vDPA do today. Later when nested translation is supported,
there is no need to detach gpa_ioasid_fd. Instead, a new cmd will
be introduced to nest rid_ioasid_fd on top of gpa_ioasid_fd:

ioctl(ioasid_fd, NEST_IOASIDS, rid_ioasid_id, gpa_ioasid_id);
ioctl(ioasid_fd, BIND_PGTABLE, rid_ioasid_id, ...);

and vSVA will follow the same flow:

ioctl(ioasid_fd, NEST_IOASIDS, sva_ioasid_id, gpa_ioasid_id);
ioctl(ioasid_fd, BIND_PGTABLE, sva_ioasid_id, ...);

Does it match your mind when expanding /dev/ioasid to support
vSVA and other new usages?

> 
> The group security concept becomes implicit and hidden from the
> uAPI. JOIN_IOASID_FD implicitly finds the device's group inside the
> kernel and requires that all members of the group be joined only to
> this ioasid_fd.
> 
> Essentially we discover the group from the device instead of the
> device from the group.
> 
> Where does it fall down compared to the three FD version we have
> today?
> 

I also feel hiding group from uAPI is a good thing and is interested in
the rationale behind for explicitly managing group in vfio (which is
essentially the same boundary as provided by iommu group), e.g. for 
better user experience when group security is broken? 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-22  8:34                                                               ` Tian, Kevin
@ 2021-04-22 12:10                                                                 ` Jason Gunthorpe
  2021-04-23  9:06                                                                   ` Tian, Kevin
                                                                                     ` (2 more replies)
  0 siblings, 3 replies; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-22 12:10 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Liu, Yi L, Jacob Pan, Auger Eric,
	Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Jonathan Corbet, Raj,
	Ashok, Wu, Hao, Jiang, Dave

On Thu, Apr 22, 2021 at 08:34:32AM +0000, Tian, Kevin wrote:

> The shim layer could be considered as a new iommu backend in VFIO,
> which connects VFIO iommu ops to the internal helpers in
> drivers/ioasid.

It may be the best we can do because of SPAPR, but the ideal outcome
should be to remove the entire pluggable IOMMU stuff from vfio
entirely and have it only use /dev/ioasid

We should never add another pluggable IOMMU type to vfio - everything
should be done through drives/iommu now that it is much more capable.

> Another tricky thing is that a container may be linked to multiple iommu
> domains in VFIO, as devices in the container may locate behind different
> IOMMUs with inconsistent capability (commit 1ef3e2bc). 

Frankly this sounds over complicated. I would think /dev/ioasid should
select the IOMMU when the first device is joined, and all future joins
must be compatible with the original IOMMU - ie there is only one set
of IOMMU capabilities in a /dev/ioasid.

This means qemue might have multiple /dev/ioasid's if the system has
multiple incompatible IOMMUs (is this actually a thing?) The platform
should design its IOMMU domains to minimize the number of
/dev/ioasid's required.

Is there a reason we need to share IOASID'd between completely
divergance IOMMU implementations? I don't expect the HW should be able
to physically share page tables??

That decision point alone might be the thing that just says we can't
ever have /dev/vfio/vfio == /dev/ioasid

> Just to confirm. Above flow is for current map/unmap flavor as what
> VFIO/vDPA do today. Later when nested translation is supported,
> there is no need to detach gpa_ioasid_fd. Instead, a new cmd will
> be introduced to nest rid_ioasid_fd on top of gpa_ioasid_fd:

Sure.. The tricky bit will be to define both of the common nested
operating modes.

  nested_ioasid = ioctl(ioasid_fd, CREATE_NESTED_IOASID,  gpa_ioasid_id);
  ioctl(ioasid_fd, SET_NESTED_IOASID_PAGE_TABLES, nested_ioasid, ..)

   // IOMMU will match on the device RID, no PASID:
  ioctl(vfio_device, ATTACH_IOASID, nested_ioasid);

   // IOMMU will match on the device RID and PASID:
  ioctl(vfio_device, ATTACH_IOASID_PASID, pasid, nested_ioasid);

Notice that ATTACH (or bind, whatever) is always done on the
vfio_device FD. ATTACH tells the IOMMU HW to link the PCI BDF&PASID to
a specific page table defined by an IOASID.

I expect we have many flavours of IOASID tables, eg we have normal,
and 'nested with table controlled by hypervisor'. ARM has 'nested with
table controlled by guest' right? So like this?

  nested_ioasid = ioctl(ioasid_fd, CREATE_DELGATED_IOASID,
                   gpa_ioasid_id, <some kind of viommu_id>)
  // PASID now goes to <viommu_id>
  ioctl(vfio_device, ATTACH_IOASID_PASID, pasid, nested_ioasid);

Where <viommu_id> is some internal to the guest handle of the viommu
page table scoped within gpa_ioasid_id? Like maybe it is GPA of the
base of the page table?

The guest can't select its own PASIDs without telling the hypervisor,
right?

> I also feel hiding group from uAPI is a good thing and is interested in
> the rationale behind for explicitly managing group in vfio (which is
> essentially the same boundary as provided by iommu group), e.g. for 
> better user experience when group security is broken? 

Indeed, I can see how things might have just evolved into this, but if
it has a purpose it seems pretty hidden.
we need it or not seems pretty hidden.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-21 19:33                                                           ` Alex Williamson
  2021-04-21 23:03                                                             ` Jason Gunthorpe
@ 2021-04-22 12:55                                                             ` Liu Yi L
  1 sibling, 0 replies; 269+ messages in thread
From: Liu Yi L @ 2021-04-22 12:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: yi.l.liu, Jason Gunthorpe, Jean-Philippe Brucker, Tian, Kevin,
	Jiang, Dave, Raj, Ashok, Jonathan Corbet, Jean-Philippe Brucker,
	Li Zefan, LKML, iommu, Johannes Weiner, Tejun Heo, cgroups, Wu,
	Hao, David Woodhouse

On Wed, 21 Apr 2021 13:33:12 -0600, Alex Williamson wrote:

> On Wed, 21 Apr 2021 14:52:03 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Apr 21, 2021 at 10:54:51AM -0600, Alex Williamson wrote:
> >   
> > > That's essentially replacing vfio-core, where I think we're more    
> > 
> > I am only talking about /dev/vfio here which is basically the IOMMU
> > interface part.
> > 
> > I still expect that VFIO_GROUP_SET_CONTAINER will be used to connect
> > /dev/{ioasid,vfio} to the VFIO group and all the group and device
> > logic stays inside VFIO.  
> 
> But that group and device logic is also tied to the container, where
> the IOMMU backend is the interchangeable thing that provides the IOMMU
> manipulation for that container.  If you're using
> VFIO_GROUP_SET_CONTAINER to associate a group to a /dev/ioasid, then
> you're really either taking that group outside of vfio or you're
> re-implementing group management in /dev/ioasid.  I'd expect the
> transition point at VFIO_SET_IOMMU.

per my understanding, transiting at the VFIO_SET_IOMMU point makes more
sense as VFIO can still have the group and device logic, which is the key
concept of group granularity isolation for userspace direct access.

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-21 23:03                                                             ` Jason Gunthorpe
  2021-04-22  8:34                                                               ` Tian, Kevin
@ 2021-04-22 17:13                                                               ` Alex Williamson
  2021-04-22 17:57                                                                 ` Jason Gunthorpe
  2021-04-27  4:50                                                                 ` David Gibson
  1 sibling, 2 replies; 269+ messages in thread
From: Alex Williamson @ 2021-04-22 17:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, Jacob Pan, Auger Eric, Jean-Philippe Brucker, Tian,
	Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu,
	cgroups, Tejun Heo, Li Zefan, Johannes Weiner,
	Jean-Philippe Brucker, Jonathan Corbet, Raj, Ashok, Wu, Hao,
	Jiang, Dave, David Gibson, Alexey Kardashevskiy

On Wed, 21 Apr 2021 20:03:01 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Apr 21, 2021 at 01:33:12PM -0600, Alex Williamson wrote:
> 
> > > I still expect that VFIO_GROUP_SET_CONTAINER will be used to connect
> > > /dev/{ioasid,vfio} to the VFIO group and all the group and device
> > > logic stays inside VFIO.  
> > 
> > But that group and device logic is also tied to the container, where
> > the IOMMU backend is the interchangeable thing that provides the IOMMU
> > manipulation for that container.  
> 
> I think that is an area where the discussion would need to be focused.
> 
> I don't feel very prepared to have it in details, as I haven't dug
> into all the group and iommu micro-operation very much.
> 
> But, it does seem like the security concept that VFIO is creating with
> the group also has to be present in the lower iommu layer too.
> 
> With different subsystems joining devices to the same ioasid's we
> still have to enforce the security propery the vfio group is creating.
> 
> > If you're using VFIO_GROUP_SET_CONTAINER to associate a group to a
> > /dev/ioasid, then you're really either taking that group outside of
> > vfio or you're re-implementing group management in /dev/ioasid.   
> 
> This sounds right.
> 
> > > Everything can be switched to ioasid_container all down the line. If
> > > it wasn't for PPC this looks fairly simple.  
> > 
> > At what point is it no longer vfio?  I'd venture to say that replacing
> > the container rather than invoking a different IOMMU backend is that
> > point.  
> 
> sorry, which is no longer vfio?

I'm suggesting that if we're replacing the container/group model with
an ioasid then we're effectively creating a new thing that really only
retains the vfio device uapi.

> > > Since getting rid of PPC looks a bit hard, we'd be stuck with
> > > accepting a /dev/ioasid and then immediately wrappering it in a
> > > vfio_container an shimming it through a vfio_iommu_ops. It is not
> > > ideal at all, but in my look around I don't see a major problem if
> > > type1 implementation is moved to live under /dev/ioasid.  
> > 
> > But type1 is \just\ an IOMMU backend, not "/dev/vfio".  Given that
> > nobody flinched at removing NVLink support, maybe just deprecate SPAPR
> > now and see if anyone objects ;)  
> 
> Would simplify this project, but I wonder :)
> 
> In any event, it does look like today we'd expect the SPAPR stuff
> would be done through the normal iommu APIs, perhaps enhanced a bit,
> which makes me suspect an enhanced type1 can implement SPAPR.

David Gibson has argued for some time that SPAPR could be handled via a
converged type1 model.  We has mapped that out at one point,
essentially a "type2", but neither of us had any bandwidth to pursue it.

> I say this because the SPAPR looks quite a lot like PASID when it has
> APIs for allocating multiple tables and other things. I would be
> interested to hear someone from IBM talk about what it is doing and
> how it doesn't fit into today's IOMMU API.

[Cc David, Alexey]

> It is very old and the iommu world has advanced tremendously lately,
> maybe I'm too optimisitic?
> 
> > > We end up with a ioasid.h that basically has the vfio_iommu_type1 code
> > > lightly recast into some 'struct iommu_container' and a set of
> > > ioasid_* function entry points that follow vfio_iommu_driver_ops_type1:
> > >   ioasid_attach_group
> > >   ioasid_detatch_group
> > >   ioasid_<something about user pages>
> > >   ioasid_read/ioasid_write  
> > 
> > Again, this looks like a vfio IOMMU backend.  What are we accomplishing
> > by replacing /dev/vfio with /dev/ioasid versus some manipulation of
> > VFIO_SET_IOMMU accepting a /dev/ioasid fd?  
> 
> The point of all of this is to make the user api for the IOMMU
> cross-subsystem. It is not a vfio IOMMU backend, it is moving the
> IOMMU abstraction from VFIO into the iommu framework and giving the
> iommu framework a re-usable user API.

Right, but I don't see that implies it cannot work within the vfio
IOMMU model.  Currently when an IOMMU is set, the /dev/vfio/vfio
container becomes a conduit for file ops from the container to be
forwarded to the IOMMU.  But that's in part because the user doesn't
have another object to interact with the IOMMU.  It's entirely possible
that with an ioasid shim, the user would continue to interact directly
with the /dev/ioasid fd for IOMMU manipulation and only use
VFIO_SET_IOMMU to associate a vfio container to that ioasid.

> My ideal outcome would be for VFIO to use only the new iommu/ioasid
> API and have no iommu pluggability at all. The iommu subsystem
> provides everything needed to VFIO, and provides it equally to VDPA
> and everything else.

As above, we don't necessarily need to have the vfio container be the
access mechanism for the IOMMU, it can become just an means to
association the container with an IOMMU.  This has quite a few
transitional benefits.

> drivers/vfio/ becomes primarily about 'struct vfio_device' and
> everything related to its IOCTL interface.
> 
> drivers/iommu and ioasid.c become all about a pluggable IOMMU
> interface, including a uAPI for it.
> 
> IMHO it makes a high level sense, though it may be a pipe dream.

This is where we've dissolved all but the vfio device uapi, which
suggests the group and container model were never necessary and I'm not
sure exactly what that uapi looks like.  We currently make use of an
IOMMU api that is group aware, but that awareness extends out to the
vfio uapi.

> > > If we have this, and /dev/ioasid implements the legacy IOCTLs, then
> > > /dev/vfio == /dev/ioasid and we can compile out vfio_fops and related
> > > from vfio.c and tell ioasid.c to create /dev/vfio instead using the
> > > ops it owns.  
> > 
> > Why would we want /dev/ioasid to implement legacy ioctls instead of
> > simply implementing an interface to allow /dev/ioasid to be used as a
> > vfio IOMMU backend?  
> 
> Only to make our own migration easier. I'd imagine everyone would want
> to sit down and design this new clear ioasid API that can co-exist on
> /dev/ioasid with the legacy once.

vfio really just wants to be able to attach groups to an address space
to consider them isolated, everything else about the IOMMU API could
happen via a new ioasid file descriptor representing that context, ie.
vfio handles the group ownership and device access, ioasid handles the
actual mappings.

> > The pseudo code above really suggests you do want to remove
> > /dev/vfio/vfio, but this is only one of the IOMMU backends for vfio, so
> > I can't quite figure out if we're talking past each other.  
> 
> I'm not quite sure what you mean by "one of the IOMMU backends?" You
> mean type1, right?
>  
> > As I expressed in another thread, type1 has a lot of shortcomings.  The
> > mapping interface leaves userspace trying desperately to use statically
> > mapped buffers because the map/unmap latency is too high.  We have
> > horrible issues with duplicate locked page accounting across
> > containers.  It suffers pretty hard from feature creep in various
> > areas.  A new IOMMU backend is an opportunity to redesign some of these
> > things.  
> 
> Sure, but also those kinds of transformational things go alot better
> if you can smoothly go from the old to the new and have technical
> co-existance in side the kernel. Having a shim that maps the old APIs
> to new APIs internally to Linux helps keep the implementation from
> becoming too bogged down with compatibility.

I'm afraid /dev/ioasid providing type1 compatibility would be just that.

> > The IOMMU group also abstracts isolation and visibility relative to
> > DMA.  For example, in a PCIe topology a multi-function device may not
> > have isolation between functions, but each requester ID is visible to
> > the IOMMU.    
> 
> Okay, I'm glad I have this all right in my head, as I was pretty sure
> this was what the group was about.
> 
> My next question is why do we have three things as a FD: group, device
> and container (aka IOMMU interface)?
> 
> Do we have container because the /dev/vfio/vfio can hold only a single
> page table so we need to swap containers sometimes?

The container represents an IOMMU address space, which can be shared by
multiple groups, where each group may contain one or more devices.
Swapping a container would require releasing all the devices (the user
cannot have access to a non-isolated device), then a group could be
moved from one container to another.

> If we start from a clean sheet and make a sketch..
> 
> /dev/ioasid is the IOMMU control interface. It can create multiple
> IOASIDs that have page tables and it can manipulate those page tables.
> Each IOASID is identified by some number.
> 
> struct vfio_device/vdpa_device/etc are consumers of /dev/ioasid
> 
> When a device attaches to an ioasid userspace gives VFIO/VDPA the
> ioasid FD and the ioasid # in the FD.
> 
> The security rule for isolation is that once a device is attached to a
> /dev/ioasid fd then all other devices in that security group must be
> attached to the same ioasid FD or left unused.

Sounds like a group...  Note also that if those other devices are not
isolated from the user's device, the user could manipulate "unused"
devices via DMA.  So even unused devices should be within the same
IOMMU context... thus attaching groups to IOMMU domains.

> Thus /dev/ioasid also becomes the unit of security and the IOMMU
> subsystem level becomes aware of and enforces the group security
> rules. Userspace does not need to "see" the group

What tools does userspace have to understand isolation of individual
devices without groups?
 
> In sketch it would be like
>   ioasid_fd = open("/dev/ioasid");
>   vfio_device_fd = open("/dev/vfio/device0")
>   vdpa_device_fd = open("/dev/vdpa/device0")
>   ioctl(vifo_device_fd, JOIN_IOASID_FD, ioasifd)
>   ioctl(vdpa_device_fd, JOIN_IOASID_FD, ioasifd)
> 
>   gpa_ioasid_id = ioctl(ioasid_fd, CREATE_IOASID, ..)
>   ioctl(ioasid_fd, SET_IOASID_PAGE_TABLES, ..)
> 
>   ioctl(vfio_device, ATTACH_IOASID, gpa_ioasid_id)
>   ioctl(vpda_device, ATTACH_IOASID, gpa_ioasid_id)
> 
>   .. both VDPA and VFIO see the guest physical map and the kernel has
>      enough info that both could use the same IOMMU page table
>      structure ..
> 
>   // Guest viommu turns off bypass mode for the vfio device
>   ioctl(vfio_device, DETATCH_IOASID)
>  
>   // Guest viommu creates a new page table
>   rid_ioasid_id = ioctl(ioasid_fd, CREATE_IOASID, ..)
>   ioctl(ioasid_fd, SET_IOASID_PAGE_TABLES, ..)
> 
>   // Guest viommu links the new page table to the RID
>   ioctl(vfio_device, ATTACH_IOASID, rid_ioasid_id)
> 
> The group security concept becomes implicit and hidden from the
> uAPI. JOIN_IOASID_FD implicitly finds the device's group inside the
> kernel and requires that all members of the group be joined only to
> this ioasid_fd.
> 
> Essentially we discover the group from the device instead of the
> device from the group.
> 
> Where does it fall down compared to the three FD version we have
> today?

The group concept is explicit today because how does userspace learn
about implicit dependencies between devices?  For example, if the user
has a conventional PCI bus with a couple devices on it, how do they
understand that those devices cannot be assigned to separate userspace
drivers?  The group fd fills that gap.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-22 17:13                                                               ` Alex Williamson
@ 2021-04-22 17:57                                                                 ` Jason Gunthorpe
  2021-04-22 19:37                                                                   ` Alex Williamson
  2021-04-27  4:50                                                                 ` David Gibson
  1 sibling, 1 reply; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-22 17:57 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Liu, Yi L, Jacob Pan, Auger Eric, Jean-Philippe Brucker, Tian,
	Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu,
	cgroups, Tejun Heo, Li Zefan, Johannes Weiner,
	Jean-Philippe Brucker, Jonathan Corbet, Raj, Ashok, Wu, Hao,
	Jiang, Dave, David Gibson, Alexey Kardashevskiy

On Thu, Apr 22, 2021 at 11:13:37AM -0600, Alex Williamson wrote:
> I'm suggesting that if we're replacing the container/group model with
> an ioasid then we're effectively creating a new thing that really only
> retains the vfio device uapi.

Yes, I think that is a fair assessment, but not necessarily bad.

The VFIO device uAPI is really the thing that is unique to VFIO and
cannot be re-used anyplace else, in my assesment this is what vfio
*is*, and the series I've been working on make it more obvious how
broad that statement really is.

> > In any event, it does look like today we'd expect the SPAPR stuff
> > would be done through the normal iommu APIs, perhaps enhanced a bit,
> > which makes me suspect an enhanced type1 can implement SPAPR.
> 
> David Gibson has argued for some time that SPAPR could be handled via a
> converged type1 model.  We has mapped that out at one point,
> essentially a "type2", but neither of us had any bandwidth to pursue it.

Cool! Well, let's put a pin in it, I don't think revising SPAPR should
be a pre-condition for anything here - but we can all agree than an
ideal world would be able to access SPAPR functionality from
devices/iommu and /dev/ioasid

And it would be nice to map this out enough to provide enough
preperation in the new /dev/ioasid uAPI. For instance I saw the only
SPAPR specific stuff in DPDK was to preset the IOVA range that the
IOASID would support. This would be trivial to add and may have
benifits to other IOMMUS by reducing the number of translation levels
or somethign.
 
> Right, but I don't see that implies it cannot work within the vfio
> IOMMU model.  Currently when an IOMMU is set, the /dev/vfio/vfio
> container becomes a conduit for file ops from the container to be
> forwarded to the IOMMU.  But that's in part because the user doesn't
> have another object to interact with the IOMMU.  It's entirely possible
> that with an ioasid shim, the user would continue to interact directly
> with the /dev/ioasid fd for IOMMU manipulation and only use
> VFIO_SET_IOMMU to associate a vfio container to that ioasid.

I am looking at this in two directions, the first is if we have
/dev/ioasid how do we connect it to VFIO? And here I aruge we need new
device IOCTLs and ideally a VFIO world that does not have a vestigial
container FD at all.

This is because /dev/ioasid will have to be multi-IOASID and it just
does not fit well into the VFIO IOMMU pluggable model at all - or at
least trying to make it fit will defeat the point of having it in the
first place.

This does not seem to be a big deal - the required device IOCTLs
should be small and having two code paths isn't going to be an
exploding complexity.

The second direction is how do we keep /dev/vfio/vfio entire uAPI
without duplicating a lot of code. There is where building a ioasid
back end or making ioasid == vfio are areas to look at.

> vfio really just wants to be able to attach groups to an address space
> to consider them isolated, everything else about the IOMMU API could
> happen via a new ioasid file descriptor representing that context, ie.
> vfio handles the group ownership and device access, ioasid handles the
> actual mappings.

Right, exactly.
 
> > Do we have container because the /dev/vfio/vfio can hold only a single
> > page table so we need to swap containers sometimes?
> 
> The container represents an IOMMU address space, which can be shared by
> multiple groups, where each group may contain one or more devices.
> Swapping a container would require releasing all the devices (the user
> cannot have access to a non-isolated device), then a group could be
> moved from one container to another.

So, basically, the answer is yes.

Having the container FD hold a single IOASID forced the group FD to
exist because we can't maintain the security object of a group in the
container FD if the work flow is to swap the container FD.

Here what I suggest is to merge the group security and the multiple
"IOMMU address space" concept into one FD. The /dev/ioasid would have
multiple page tables objects within it called IOASID'd and each IOASID
effectively represents what /dev/vfio/vfio does today.

We can assign any device joined to a /dev/ioasid FD to any IOASID inside
that FD, dynamically.

> > The security rule for isolation is that once a device is attached to a
> > /dev/ioasid fd then all other devices in that security group must be
> > attached to the same ioasid FD or left unused.
> 
> Sounds like a group...  Note also that if those other devices are not
> isolated from the user's device, the user could manipulate "unused"
> devices via DMA.  So even unused devices should be within the same
> IOMMU context... thus attaching groups to IOMMU domains.

That is a very interesting point. So, say, in the classic PCI bus
world if I have a NIC and HD on my PCI bus and both are in the group,
I assign the NIC to a /dev/ioasid & VFIO then it is possible to use
the NIC to access the HD via DMA

And here you want a more explicit statement that the HD is at risk by
using the NIC?

Honestly, I'm not sure the current group FD is actually showing that
very strongly - though I get the point it is modeled in the sysfs and
kind of implicit in the API - we evolved things in a way where most
actual applications are taking in a PCI BDF from the user, not a group
reference. So the actual security impact seems lost on the user.

Along my sketch if we have:

   ioctl(vifo_device_fd, JOIN_IOASID_FD, ioasifd)
   [..]
   ioctl(vfio_device, ATTACH_IOASID, gpa_ioasid_id) == ENOPERM

I would feel comfortable if the ATTACH_IOASID fails by default if all
devices in the group have not been joined to the same ioasidfd.

So in the NIC&HD example the application would need to do:

   ioasid_fd = open("/dev/ioasid");
   nic_device_fd = open("/dev/vfio/device0")
   hd_device_fd = open("/dev/vfio/device1")
   
   ioctl(nic_device_fd, JOIN_IOASID_FD, ioasifd)
   ioctl(hd_device_fd, JOIN_IOASID_FD, ioasifd)
   [..]
   ioctl(nice_device, ATTACH_IOASID, gpa_ioasid_id) == SUCCESS

Now the security relation is forced by the kernel to be very explicit.

However to keep current semantics, I'd suggest a flag on
JOIN_IOASID_FD called "IOASID_IMPLICIT_GROUP" which has the effect of
allowing the ATTACH_IOASID to succeed without the user having to
explicitly join all the group devices. This is analogous to the world
we have today of opening the VFIO group FD but only instantiating one
device FD.

In effect the ioasid FD becomes the group and the numbered IOASID's
inside the FD become the /dev/vfio/vfio objects - we don't end up with
fewer objects in the system, they just have different uAPI
presentations.

I'd envision applications like DPDK that are BDF centric to use the
first API with some '--allow-insecure-vfio' flag to switch on the
IOASID_IMPLICIT_GROUP. Maybe good applications would also print:
  "Danger Will Robinson these PCI BDFs [...] are also at risk"
When the switch is used by parsing the sysfs

> > Thus /dev/ioasid also becomes the unit of security and the IOMMU
> > subsystem level becomes aware of and enforces the group security
> > rules. Userspace does not need to "see" the group
> 
> What tools does userspace have to understand isolation of individual
> devices without groups?

I think we can continue to show all of this group information in sysfs
files, it just doesn't require application code to open a group FD.

This becomes relavent the more I think about it - elmininating the
group and container FD uAPI by directly creating the device FD also
sidesteps questions about how to model these objects in a /dev/ioasid
only world. We simply don't have them at all so the answer is pretty
easy.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-22 17:57                                                                 ` Jason Gunthorpe
@ 2021-04-22 19:37                                                                   ` Alex Williamson
  2021-04-22 20:00                                                                     ` Jason Gunthorpe
  0 siblings, 1 reply; 269+ messages in thread
From: Alex Williamson @ 2021-04-22 19:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, Jacob Pan, Auger Eric, Jean-Philippe Brucker, Tian,
	Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu,
	cgroups, Tejun Heo, Li Zefan, Johannes Weiner,
	Jean-Philippe Brucker, Jonathan Corbet, Raj, Ashok, Wu, Hao,
	Jiang, Dave, David Gibson, Alexey Kardashevskiy

On Thu, 22 Apr 2021 14:57:15 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:
> > > The security rule for isolation is that once a device is attached to a
> > > /dev/ioasid fd then all other devices in that security group must be
> > > attached to the same ioasid FD or left unused.  
> > 
> > Sounds like a group...  Note also that if those other devices are not
> > isolated from the user's device, the user could manipulate "unused"
> > devices via DMA.  So even unused devices should be within the same
> > IOMMU context... thus attaching groups to IOMMU domains.  
> 
> That is a very interesting point. So, say, in the classic PCI bus
> world if I have a NIC and HD on my PCI bus and both are in the group,
> I assign the NIC to a /dev/ioasid & VFIO then it is possible to use
> the NIC to access the HD via DMA
> 
> And here you want a more explicit statement that the HD is at risk by
> using the NIC?

If by "classic" you mean conventional PCI bus, then this is much worse
than simply "at risk".  The IOMMU cannot differentiate devices behind a
PCIe-to-PCI bridge, so the moment you turn on the IOMMU context for the
NIC, the address space for your HBA is pulled out from under it.  In
the vfio world, the NIC and HBA are grouped and managed together, the
user cannot change the IOMMU context of a group unless all of the
devices in the group are "viable", ie. they are released from any host
drivers.

> Honestly, I'm not sure the current group FD is actually showing that
> very strongly - though I get the point it is modeled in the sysfs and
> kind of implicit in the API - we evolved things in a way where most
> actual applications are taking in a PCI BDF from the user, not a group
> reference. So the actual security impact seems lost on the user.

vfio users are extremely aware of grouping, they understand the model,
if not always the reason for the grouping.  You only need to look at
r/VFIO to find various lsgroup scripts and kernel patches to manipulate
grouping.  The visibility to the user is valuable imo.

> Along my sketch if we have:
> 
>    ioctl(vifo_device_fd, JOIN_IOASID_FD, ioasifd)
>    [..]
>    ioctl(vfio_device, ATTACH_IOASID, gpa_ioasid_id) == ENOPERM
> 
> I would feel comfortable if the ATTACH_IOASID fails by default if all
> devices in the group have not been joined to the same ioasidfd.

And without a group representation to userspace, how would a user know
to resolve that?

> So in the NIC&HD example the application would need to do:
> 
>    ioasid_fd = open("/dev/ioasid");
>    nic_device_fd = open("/dev/vfio/device0")
>    hd_device_fd = open("/dev/vfio/device1")
>    
>    ioctl(nic_device_fd, JOIN_IOASID_FD, ioasifd)
>    ioctl(hd_device_fd, JOIN_IOASID_FD, ioasifd)
>    [..]
>    ioctl(nice_device, ATTACH_IOASID, gpa_ioasid_id) == SUCCESS
> 
> Now the security relation is forced by the kernel to be very explicit.

But not discoverable to the user.

> However to keep current semantics, I'd suggest a flag on
> JOIN_IOASID_FD called "IOASID_IMPLICIT_GROUP" which has the effect of
> allowing the ATTACH_IOASID to succeed without the user having to
> explicitly join all the group devices. This is analogous to the world
> we have today of opening the VFIO group FD but only instantiating one
> device FD.
> 
> In effect the ioasid FD becomes the group and the numbered IOASID's
> inside the FD become the /dev/vfio/vfio objects - we don't end up with
> fewer objects in the system, they just have different uAPI
> presentations.
> 
> I'd envision applications like DPDK that are BDF centric to use the
> first API with some '--allow-insecure-vfio' flag to switch on the
> IOASID_IMPLICIT_GROUP. Maybe good applications would also print:
>   "Danger Will Robinson these PCI BDFs [...] are also at risk"
> When the switch is used by parsing the sysfs

So the group still exist in sysfs, they just don't have vfio
representations?  An implicit grouping does what, automatically unbind
the devices, so an admin gives a user access to the NIC but their HBA
device disappears because they were implicitly linked?  That's why vfio
basis ownership on the group, if a user owns the group but the group is
not viable because a device is still bound to another kernel driver,
the use can't do anything.  Implicitly snarfing up subtly affected
devices is bad.

> > > Thus /dev/ioasid also becomes the unit of security and the IOMMU
> > > subsystem level becomes aware of and enforces the group security
> > > rules. Userspace does not need to "see" the group  
> > 
> > What tools does userspace have to understand isolation of individual
> > devices without groups?  
> 
> I think we can continue to show all of this group information in sysfs
> files, it just doesn't require application code to open a group FD.
> 
> This becomes relavent the more I think about it - elmininating the
> group and container FD uAPI by directly creating the device FD also
> sidesteps questions about how to model these objects in a /dev/ioasid
> only world. We simply don't have them at all so the answer is pretty
> easy.

I'm not sold.  Ideally each device would be fully isolated, then we
could assume a 1:1 relation of group and device and collapse the model
to work on devices.  We don't live in that world and I see a benefit to
making that explicit in the uapi, even if that group fd might seem
superfluous at times.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-22 19:37                                                                   ` Alex Williamson
@ 2021-04-22 20:00                                                                     ` Jason Gunthorpe
  2021-04-22 22:38                                                                       ` Alex Williamson
  0 siblings, 1 reply; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-22 20:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Liu, Yi L, Jacob Pan, Auger Eric, Jean-Philippe Brucker, Tian,
	Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu,
	cgroups, Tejun Heo, Li Zefan, Johannes Weiner,
	Jean-Philippe Brucker, Jonathan Corbet, Raj, Ashok, Wu, Hao,
	Jiang, Dave, David Gibson, Alexey Kardashevskiy

On Thu, Apr 22, 2021 at 01:37:47PM -0600, Alex Williamson wrote:

> If by "classic" you mean conventional PCI bus, then this is much worse
> than simply "at risk".  The IOMMU cannot differentiate devices behind a
> PCIe-to-PCI bridge, so the moment you turn on the IOMMU context for the
> NIC, the address space for your HBA is pulled out from under it.  

Yes, I understand this, but this is fine and not really surprising if
the HD device is just forced to remain "unusued"

To my mind the bigger issue is the NIC now has access to the HD and
nobody really raised an alarm unless the HD happened to have a kernel
driver bound.

> the vfio world, the NIC and HBA are grouped and managed together, the
> user cannot change the IOMMU context of a group unless all of the
> devices in the group are "viable", ie. they are released from any host
> drivers.

Yes, I don't propose to change any of that, I just suggest to make the
'change the IOMMU context" into "join a /dev/ioasid fd"

All devices in the group have to be joined to the same ioasid or, with
the flag, left "unused" with no kernel driver. 

This is the same viability test VFIO is doing now, just moved slightly
in the programming flow.

> vfio users are extremely aware of grouping, they understand the model,
> if not always the reason for the grouping.  You only need to look at
> r/VFIO to find various lsgroup scripts and kernel patches to manipulate
> grouping.  The visibility to the user is valuable imo.

I don't propose to remove visibility, sysfs and the lsgroup scripts
should all still be there.

I'm just acknowledging reality that the user command line experiance
we have is focused on single BDFs not on groups. The user only sees
the group idea when things explode, so why do we have it as such an
integral part of the programming model?

> >    ioctl(vifo_device_fd, JOIN_IOASID_FD, ioasifd)
> >    [..]
> >    ioctl(vfio_device, ATTACH_IOASID, gpa_ioasid_id) == ENOPERM
> > 
> > I would feel comfortable if the ATTACH_IOASID fails by default if all
> > devices in the group have not been joined to the same ioasidfd.
> 
> And without a group representation to userspace, how would a user know
> to resolve that?

Userspace can continue to read sysfs files that show the group
relation.

I'm only talking about the group char device and FD.

> So the group still exist in sysfs, they just don't have vfio
> representations?  An implicit grouping does what, automatically unbind
> the devices, so an admin gives a user access to the NIC but their HBA
> device disappears because they were implicitly linked?  

It does exactly the same thing as opening the VFIO group FD and
instantiating a single device FD does today.

Most software, like dpdk, automatically deduces the VFIO group from
the commandline BDF, I'm mainly suggesting we move that deduction from
userspace software to kernel software.

> basis ownership on the group, if a user owns the group but the group is
> not viable because a device is still bound to another kernel driver,
> the use can't do anything.  Implicitly snarfing up subtly affected
> devices is bad.

The user would get an /dev/ioasid join failure just like they get a
failure from VFIO_GROUP_SET_CONTAINER (?) today that reflects the
group is not viable.

Otherwise what is the alternative?

How do we model the VFIO group security concept to something like
VDPA?

How do you reconcile the ioasid security model with the VFIO container
and group FDs?

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-22 20:00                                                                     ` Jason Gunthorpe
@ 2021-04-22 22:38                                                                       ` Alex Williamson
  2021-04-22 23:39                                                                         ` Jason Gunthorpe
  0 siblings, 1 reply; 269+ messages in thread
From: Alex Williamson @ 2021-04-22 22:38 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, Jacob Pan, Auger Eric, Jean-Philippe Brucker, Tian,
	Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu,
	cgroups, Tejun Heo, Li Zefan, Johannes Weiner,
	Jean-Philippe Brucker, Jonathan Corbet, Raj, Ashok, Wu, Hao,
	Jiang, Dave, David Gibson, Alexey Kardashevskiy

On Thu, 22 Apr 2021 17:00:24 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Thu, Apr 22, 2021 at 01:37:47PM -0600, Alex Williamson wrote:
> 
> > If by "classic" you mean conventional PCI bus, then this is much worse
> > than simply "at risk".  The IOMMU cannot differentiate devices behind a
> > PCIe-to-PCI bridge, so the moment you turn on the IOMMU context for the
> > NIC, the address space for your HBA is pulled out from under it.    
> 
> Yes, I understand this, but this is fine and not really surprising if
> the HD device is just forced to remain "unusued"
> 
> To my mind the bigger issue is the NIC now has access to the HD and
> nobody really raised an alarm unless the HD happened to have a kernel
> driver bound.
> 
> > the vfio world, the NIC and HBA are grouped and managed together, the
> > user cannot change the IOMMU context of a group unless all of the
> > devices in the group are "viable", ie. they are released from any host
> > drivers.  
> 
> Yes, I don't propose to change any of that, I just suggest to make the
> 'change the IOMMU context" into "join a /dev/ioasid fd"
> 
> All devices in the group have to be joined to the same ioasid or, with
> the flag, left "unused" with no kernel driver. 
> 
> This is the same viability test VFIO is doing now, just moved slightly
> in the programming flow.
> 
> > vfio users are extremely aware of grouping, they understand the model,
> > if not always the reason for the grouping.  You only need to look at
> > r/VFIO to find various lsgroup scripts and kernel patches to manipulate
> > grouping.  The visibility to the user is valuable imo.  
> 
> I don't propose to remove visibility, sysfs and the lsgroup scripts
> should all still be there.
> 
> I'm just acknowledging reality that the user command line experiance
> we have is focused on single BDFs not on groups. The user only sees
> the group idea when things explode, so why do we have it as such an
> integral part of the programming model?

Because it's fundamental to the isolation of the device?  What you're
proposing doesn't get around the group issue, it just makes it implicit
rather than explicit in the uapi.  For what?  Some ideal notion that
every device should be isolated at the expense of userspace drivers
that then fail randomly because they didn't take into account groups
because it's not part of the uapi?

> > >    ioctl(vifo_device_fd, JOIN_IOASID_FD, ioasifd)
> > >    [..]
> > >    ioctl(vfio_device, ATTACH_IOASID, gpa_ioasid_id) == ENOPERM
> > > 
> > > I would feel comfortable if the ATTACH_IOASID fails by default if all
> > > devices in the group have not been joined to the same ioasidfd.  
> > 
> > And without a group representation to userspace, how would a user know
> > to resolve that?  
> 
> Userspace can continue to read sysfs files that show the group
> relation.
> 
> I'm only talking about the group char device and FD.
> 
> > So the group still exist in sysfs, they just don't have vfio
> > representations?  An implicit grouping does what, automatically unbind
> > the devices, so an admin gives a user access to the NIC but their HBA
> > device disappears because they were implicitly linked?    
> 
> It does exactly the same thing as opening the VFIO group FD and
> instantiating a single device FD does today.
> 
> Most software, like dpdk, automatically deduces the VFIO group from
> the commandline BDF, I'm mainly suggesting we move that deduction from
> userspace software to kernel software.
> 
> > basis ownership on the group, if a user owns the group but the group is
> > not viable because a device is still bound to another kernel driver,
> > the use can't do anything.  Implicitly snarfing up subtly affected
> > devices is bad.  
> 
> The user would get an /dev/ioasid join failure just like they get a
> failure from VFIO_GROUP_SET_CONTAINER (?) today that reflects the
> group is not viable.
> 
> Otherwise what is the alternative?
> 
> How do we model the VFIO group security concept to something like
> VDPA?

Is it really a "VFIO group security concept"?  We're reflecting the
reality of the hardware, not all devices are fully isolated.  An IOMMU
group is the smallest set of devices we believe are isolated from all
other sets of devices.  VFIO groups simply reflect that notion of an
IOMMU group.  This is the reality that any userspace driver needs to
play in, it doesn't magically go away because we drop the group file
descriptor.  It only makes the uapi more difficult to use correctly
because userspace drivers need to go outside of the uapi to have any
idea that this restriction exists.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-22 22:38                                                                       ` Alex Williamson
@ 2021-04-22 23:39                                                                         ` Jason Gunthorpe
  2021-04-23 10:31                                                                           ` Tian, Kevin
                                                                                             ` (2 more replies)
  0 siblings, 3 replies; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-22 23:39 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Liu, Yi L, Jacob Pan, Auger Eric, Jean-Philippe Brucker, Tian,
	Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu,
	cgroups, Tejun Heo, Li Zefan, Johannes Weiner,
	Jean-Philippe Brucker, Jonathan Corbet, Raj, Ashok, Wu, Hao,
	Jiang, Dave, David Gibson, Alexey Kardashevskiy

On Thu, Apr 22, 2021 at 04:38:08PM -0600, Alex Williamson wrote:

> Because it's fundamental to the isolation of the device?  What you're
> proposing doesn't get around the group issue, it just makes it implicit
> rather than explicit in the uapi.

I'm not even sure it makes it explicit or implicit, it just takes away
the FD.

There are four group IOCTLs, I see them mapping to /dev/ioasid follows:
 VFIO_GROUP_GET_STATUS - 
   + VFIO_GROUP_FLAGS_CONTAINER_SET is fairly redundant
   + VFIO_GROUP_FLAGS_VIABLE could be in a new sysfs under
     kernel/iomm_groups, or could be an IOCTL on /dev/ioasid
       IOASID_ALL_DEVICES_VIABLE

 VFIO_GROUP_SET_CONTAINER -
   + This happens implicitly when the device joins the IOASID
     so it gets moved to the vfio_device FD:
      ioctl(vifo_device_fd, JOIN_IOASID_FD, ioasifd)

 VFIO_GROUP_UNSET_CONTAINER -
   + Also moved to the vfio_device FD, opposite of JOIN_IOASID_FD

 VFIO_GROUP_GET_DEVICE_FD -
   + Replaced by opening /dev/vfio/deviceX
     Learn the deviceX which will be the cdev sysfs shows as:
      /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/vfio/deviceX/dev
    Open /dev/vfio/deviceX

> > How do we model the VFIO group security concept to something like
> > VDPA?
> 
> Is it really a "VFIO group security concept"?  We're reflecting the
> reality of the hardware, not all devices are fully isolated.  

Well, exactly.

/dev/ioasid should understand the group concept somehow, otherwise it
is incomplete and maybe even security broken.

So, how do I add groups to, say, VDPA in a way that makes sense? The
only answer I come to is broadly what I outlined here - make
/dev/ioasid do all the group operations, and do them when we enjoin
the VDPA device to the ioasid.

Once I have solved all the groups problems with the non-VFIO users,
then where does that leave VFIO? Why does VFIO need a group FD if
everyone else doesn't?

> IOMMU group.  This is the reality that any userspace driver needs to
> play in, it doesn't magically go away because we drop the group file
> descriptor.  

I'm not saying it does, I'm saying it makes the uAPI more regular and
easier to fit into /dev/ioasid without the group FD.

> It only makes the uapi more difficult to use correctly because
> userspace drivers need to go outside of the uapi to have any idea
> that this restriction exists.  

I don't think it makes any substantive difference one way or the
other.

With the group FD: the userspace has to read sysfs, find the list of
devices in the group, open the group fd, create device FDs for each
device using the name from sysfs.

Starting from a BDF the general pseudo code is
 group_path = readlink("/sys/bus/pci/devices/BDF/iommu_group")
 group_name = basename(group_path)
 group_fd = open("/dev/vfio/"+group_name)
 device_fd = ioctl(VFIO_GROUP_GET_DEVICE_FD, BDF);

Without the group FD: the userspace has to read sysfs, find the list
of devices in the group and then open the device-specific cdev (found
via sysfs) and link them to a /dev/ioasid FD.

Starting from a BDF the general pseudo code is:
 device_name = first_directory_of("/sys/bus/pci/devices/BDF/vfio/")
 device_fd = open("/dev/vfio/"+device_name)
 ioasidfd = open("/dev/ioasid")
 ioctl(device_fd, JOIN_IOASID_FD, ioasidfd)

These two routes can have identical outcomes and identical security
checks.

In both cases if userspace wants a list of BDFs in the same group as
the BDF it is interested in:
   readdir("/sys/bus/pci/devices/BDF/iommu_group/devices")

It seems like a very small difference to me.

I still don't see how the group restriction gets surfaced to the
application through the group FD. The applications I looked through
just treat the group FD as a step on their way to get the device_fd.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-22 12:10                                                                 ` Jason Gunthorpe
@ 2021-04-23  9:06                                                                   ` Tian, Kevin
  2021-04-23 11:49                                                                     ` Jason Gunthorpe
  2021-04-29  8:55                                                                   ` Auger Eric
  2021-04-29 13:26                                                                   ` Auger Eric
  2 siblings, 1 reply; 269+ messages in thread
From: Tian, Kevin @ 2021-04-23  9:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Liu, Yi L, Jacob Pan, Auger Eric,
	Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Jonathan Corbet, Raj,
	Ashok, Wu, Hao, Jiang, Dave

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, April 22, 2021 8:10 PM
> 
> On Thu, Apr 22, 2021 at 08:34:32AM +0000, Tian, Kevin wrote:
> 
> > Another tricky thing is that a container may be linked to multiple iommu
> > domains in VFIO, as devices in the container may locate behind different
> > IOMMUs with inconsistent capability (commit 1ef3e2bc).
> 
> Frankly this sounds over complicated. I would think /dev/ioasid should
> select the IOMMU when the first device is joined, and all future joins
> must be compatible with the original IOMMU - ie there is only one set
> of IOMMU capabilities in a /dev/ioasid.

Or could we still have just one /dev/ioasid but allow userspace to create
multiple gpa_ioasid_id's each associated to a different iommu domain? 
Then the compatibility check will be done at ATTACH_IOASID instead of 
JOIN_IOASID_FD.

This does impose one burden to userspace though, to understand the 
IOMMU compatibilities and figure out which incompatible features may
affect the page table management (while such knowledge is IOMMU
vendor specific) and then explicitly manage multiple /dev/ioasid's or 
multiple gpa_ioasid_id's.

Alternatively is it a good design by having the kernel return error at 
attach/join time to indicate that incompatibility is detected then the 
userspace should open a new /dev/ioasid or creates a new gpa_ioasid_id
for the failing device upon such failure, w/o constructing its own 
compatibility knowledge?

> 
> This means qemue might have multiple /dev/ioasid's if the system has
> multiple incompatible IOMMUs (is this actually a thing?) The platform

One example is Intel platform with igd. Typically there is one IOMMU
dedicated for igd and the other IOMMU serving all the remaining devices.
The igd IOMMU may not support IOMMU_CACHE while the other one
does.

> should design its IOMMU domains to minimize the number of
> /dev/ioasid's required.
> 
> Is there a reason we need to share IOASID'd between completely
> divergance IOMMU implementations? I don't expect the HW should be able
> to physically share page tables??

yes, e.g. in vSVA both devices (behind divergence IOMMUs) are bound
to a single guest process which has an unique PASID and 1st-level page
table. Earlier incompatibility example is only for 2nd-level.

> 
> That decision point alone might be the thing that just says we can't
> ever have /dev/vfio/vfio == /dev/ioasid

yes, unless we adopt the vfio scheme, i.e. implicitly managing incompatible 
iommu domains in /dev/ioasid.

> 
> > Just to confirm. Above flow is for current map/unmap flavor as what
> > VFIO/vDPA do today. Later when nested translation is supported,
> > there is no need to detach gpa_ioasid_fd. Instead, a new cmd will
> > be introduced to nest rid_ioasid_fd on top of gpa_ioasid_fd:
> 
> Sure.. The tricky bit will be to define both of the common nested
> operating modes.
> 
>   nested_ioasid = ioctl(ioasid_fd, CREATE_NESTED_IOASID,  gpa_ioasid_id);
>   ioctl(ioasid_fd, SET_NESTED_IOASID_PAGE_TABLES, nested_ioasid, ..)
> 
>    // IOMMU will match on the device RID, no PASID:
>   ioctl(vfio_device, ATTACH_IOASID, nested_ioasid);
> 
>    // IOMMU will match on the device RID and PASID:
>   ioctl(vfio_device, ATTACH_IOASID_PASID, pasid, nested_ioasid);

I'm a bit confused here why we have both pasid and ioasid notations together.
Why not use nested_ioasid as pasid directly (i.e. every pasid in nested mode
is created by CREATE_NESTED_IOASID)?

Below I list different scenarios for ATTACH_IOASID in my view. Here 
vfio_device could be a real PCI function (RID), or a subfunction device 
(RID+def_ioasid). The vfio_device could be attached to a gpa_ioasid 
(RID in guest view, no nesting), a nested_ioasid (RID in guest view, nesting) 
or a nested_ioasid (RID+PASID in guest view, nesting). 

// IOMMU will match on the device RID, no nesting, no PASID:
ioctl(vfio_device, ATTACH_IOASID, gpa_ioasid);

// IOMMU will match on the device (RID+def_ioasid), no nesting, no PASID:
ioctl(vfio_subdevice, ATTACH_IOASID, gpa_ioasid);

// IOMMU will match on the device RID, nesting, no PASID:
ioctl(vfio_device, ATTACH_IOASID, nested_ioasid);

// IOMMU will match on the device (RID+def_ioasid), nesting, no PASID:
ioctl(vfio_subdevice, ATTACH_IOASID, nested_ioasid);

// IOMMU will match on the device (RID+nested_ioasid), nesting, PASID:
ioctl(vfio_device, ATTACH_IOASID_PASID, nested_ioasid);

// IOMMU will match on the device (RID+nested_ioasid), nesting, PASID:
ioctl(vfio_subdevice, ATTACH_IOASID_PASID, nested_ioasid);

> 
> Notice that ATTACH (or bind, whatever) is always done on the
> vfio_device FD. ATTACH tells the IOMMU HW to link the PCI BDF&PASID to
> a specific page table defined by an IOASID.
> 
> I expect we have many flavours of IOASID tables, eg we have normal,
> and 'nested with table controlled by hypervisor'. ARM has 'nested with
> table controlled by guest' right? So like this?
> 
>   nested_ioasid = ioctl(ioasid_fd, CREATE_DELGATED_IOASID,
>                    gpa_ioasid_id, <some kind of viommu_id>)
>   // PASID now goes to <viommu_id>
>   ioctl(vfio_device, ATTACH_IOASID_PASID, pasid, nested_ioasid);
> 
> Where <viommu_id> is some internal to the guest handle of the viommu
> page table scoped within gpa_ioasid_id? Like maybe it is GPA of the
> base of the page table?
> 
> The guest can't select its own PASIDs without telling the hypervisor,
> right?
> 

If the whole PASID table is delegated to the guest in ARM case, the guest
can select its own PASIDs w/o telling the hypervisor. I haven't thought
carefully about a clean way to support this scheme, e.g. if mandating 
guest to always allocate PASIDs through hypervisor even in this case
would it make the uAPI simpler...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 269+ messages in thread

* RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-22 23:39                                                                         ` Jason Gunthorpe
@ 2021-04-23 10:31                                                                           ` Tian, Kevin
  2021-04-23 11:57                                                                             ` Jason Gunthorpe
  2021-04-27  5:11                                                                             ` David Gibson
  2021-04-23 16:38                                                                           ` Alex Williamson
  2021-04-27  5:08                                                                           ` David Gibson
  2 siblings, 2 replies; 269+ messages in thread
From: Tian, Kevin @ 2021-04-23 10:31 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Liu, Yi L, Jacob Pan, Auger Eric, Jean-Philippe Brucker, LKML,
	Joerg Roedel, Lu Baolu, David Woodhouse, iommu, cgroups,
	Tejun Heo, Li Zefan, Johannes Weiner, Jean-Philippe Brucker,
	Jonathan Corbet, Raj, Ashok, Wu, Hao, Jiang, Dave, David Gibson,
	Alexey Kardashevskiy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, April 23, 2021 7:40 AM
> 
> On Thu, Apr 22, 2021 at 04:38:08PM -0600, Alex Williamson wrote:
> 
> > Because it's fundamental to the isolation of the device?  What you're
> > proposing doesn't get around the group issue, it just makes it implicit
> > rather than explicit in the uapi.
> 
> I'm not even sure it makes it explicit or implicit, it just takes away
> the FD.
> 
> There are four group IOCTLs, I see them mapping to /dev/ioasid follows:
>  VFIO_GROUP_GET_STATUS -
>    + VFIO_GROUP_FLAGS_CONTAINER_SET is fairly redundant
>    + VFIO_GROUP_FLAGS_VIABLE could be in a new sysfs under
>      kernel/iomm_groups, or could be an IOCTL on /dev/ioasid
>        IOASID_ALL_DEVICES_VIABLE
> 
>  VFIO_GROUP_SET_CONTAINER -
>    + This happens implicitly when the device joins the IOASID
>      so it gets moved to the vfio_device FD:
>       ioctl(vifo_device_fd, JOIN_IOASID_FD, ioasifd)
> 
>  VFIO_GROUP_UNSET_CONTAINER -
>    + Also moved to the vfio_device FD, opposite of JOIN_IOASID_FD
> 
>  VFIO_GROUP_GET_DEVICE_FD -
>    + Replaced by opening /dev/vfio/deviceX
>      Learn the deviceX which will be the cdev sysfs shows as:
>       /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/vfio/deviceX/dev
>     Open /dev/vfio/deviceX
> 
> > > How do we model the VFIO group security concept to something like
> > > VDPA?
> >
> > Is it really a "VFIO group security concept"?  We're reflecting the
> > reality of the hardware, not all devices are fully isolated.
> 
> Well, exactly.
> 
> /dev/ioasid should understand the group concept somehow, otherwise it
> is incomplete and maybe even security broken.
> 
> So, how do I add groups to, say, VDPA in a way that makes sense? The
> only answer I come to is broadly what I outlined here - make
> /dev/ioasid do all the group operations, and do them when we enjoin
> the VDPA device to the ioasid.
> 
> Once I have solved all the groups problems with the non-VFIO users,
> then where does that leave VFIO? Why does VFIO need a group FD if
> everyone else doesn't?
> 
> > IOMMU group.  This is the reality that any userspace driver needs to
> > play in, it doesn't magically go away because we drop the group file
> > descriptor.
> 
> I'm not saying it does, I'm saying it makes the uAPI more regular and
> easier to fit into /dev/ioasid without the group FD.
> 
> > It only makes the uapi more difficult to use correctly because
> > userspace drivers need to go outside of the uapi to have any idea
> > that this restriction exists.
> 
> I don't think it makes any substantive difference one way or the
> other.
> 
> With the group FD: the userspace has to read sysfs, find the list of
> devices in the group, open the group fd, create device FDs for each
> device using the name from sysfs.
> 
> Starting from a BDF the general pseudo code is
>  group_path = readlink("/sys/bus/pci/devices/BDF/iommu_group")
>  group_name = basename(group_path)
>  group_fd = open("/dev/vfio/"+group_name)
>  device_fd = ioctl(VFIO_GROUP_GET_DEVICE_FD, BDF);
> 
> Without the group FD: the userspace has to read sysfs, find the list
> of devices in the group and then open the device-specific cdev (found
> via sysfs) and link them to a /dev/ioasid FD.
> 
> Starting from a BDF the general pseudo code is:
>  device_name = first_directory_of("/sys/bus/pci/devices/BDF/vfio/")
>  device_fd = open("/dev/vfio/"+device_name)
>  ioasidfd = open("/dev/ioasid")
>  ioctl(device_fd, JOIN_IOASID_FD, ioasidfd)
> 
> These two routes can have identical outcomes and identical security
> checks.
> 
> In both cases if userspace wants a list of BDFs in the same group as
> the BDF it is interested in:
>    readdir("/sys/bus/pci/devices/BDF/iommu_group/devices")
> 
> It seems like a very small difference to me.
> 
> I still don't see how the group restriction gets surfaced to the
> application through the group FD. The applications I looked through
> just treat the group FD as a step on their way to get the device_fd.
> 

So your proposal sort of moves the entire container/group/domain 
managment into /dev/ioasid and then leaves vfio only provide device
specific uAPI. An ioasid represents a page table (address space), thus 
is equivalent to the scope of VFIO container. Having the device join 
an ioasid is equivalent to attaching a device to VFIO container, and 
here the group integrity must be enforced. Then /dev/ioasid anyway 
needs to manage group objects and their association with ioasid and 
underlying iommu domain thus it's pointless to keep same logic within
VFIO. Is this understanding correct?

btw one remaining open is whether you expect /dev/ioasid to be 
associated with a single iommu domain, or multiple. If only a single 
domain is allowed, the ioasid_fd is equivalent to the scope of VFIO 
container. It is supposed to have only one gpa_ioasid_id since one 
iommu domain can only have a single 2nd level pgtable. Then all other 
ioasids, once allocated, must be nested on this gpa_ioasid_id to fit 
in the same domain. if a legacy vIOMMU is exposed (which disallows 
nesting), the userspace has to open an ioasid_fd for every group. 
This is basically the VFIO way. On the other hand if multiple domains 
is allowed, there could be multiple ioasid_ids each holding a 2nd level 
pgtable and an iommu domain (or a list of pgtables and domains due to
incompatibility issue as discussed in another thread), and can be
nested by other ioasids respectively. The application only needs
to open /dev/ioasid once regardless of whether vIOMMU allows 
nesting, and has a single interface for ioasid allocation. Which way
do you prefer to?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-23  9:06                                                                   ` Tian, Kevin
@ 2021-04-23 11:49                                                                     ` Jason Gunthorpe
  2021-04-25  9:24                                                                       ` Tian, Kevin
  2021-04-29  8:54                                                                       ` Auger Eric
  0 siblings, 2 replies; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-23 11:49 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Liu, Yi L, Jacob Pan, Auger Eric,
	Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Jonathan Corbet, Raj,
	Ashok, Wu, Hao, Jiang, Dave

On Fri, Apr 23, 2021 at 09:06:44AM +0000, Tian, Kevin wrote:

> Or could we still have just one /dev/ioasid but allow userspace to create
> multiple gpa_ioasid_id's each associated to a different iommu domain? 
> Then the compatibility check will be done at ATTACH_IOASID instead of 
> JOIN_IOASID_FD.

To my mind what makes sense that that /dev/ioasid presents a single
IOMMU behavior that is basically the same. This may ultimately not be
what we call a domain today.

We may end up with a middle object which is a group of domains that
all have the same capabilities, and we define capabilities in a way
that most platforms have a single group of domains.

The key capability of a group of domains is they can all share the HW
page table representation, so if an IOASID instantiates a page table
it can be assigned to any device on any domain in the gruop of domains.

If you try to say that /dev/ioasid has many domains and they can't
have their HW page tables shared then I think the implementation
complexity will explode.

> This does impose one burden to userspace though, to understand the 
> IOMMU compatibilities and figure out which incompatible features may
> affect the page table management (while such knowledge is IOMMU
> vendor specific) and then explicitly manage multiple /dev/ioasid's or 
> multiple gpa_ioasid_id's.

Right, this seems very hard in the general case..
 
> Alternatively is it a good design by having the kernel return error at
> attach/join time to indicate that incompatibility is detected then the 
> userspace should open a new /dev/ioasid or creates a new gpa_ioasid_id
> for the failing device upon such failure, w/o constructing its own 
> compatibility knowledge?

Yes, this feels workable too

> > This means qemue might have multiple /dev/ioasid's if the system has
> > multiple incompatible IOMMUs (is this actually a thing?) The platform
> 
> One example is Intel platform with igd. Typically there is one IOMMU
> dedicated for igd and the other IOMMU serving all the remaining devices.
> The igd IOMMU may not support IOMMU_CACHE while the other one
> does.

If we can do as above the two domains may be in the same group of
domains and the IOMMU_CACHE is not exposed at the /dev/ioasid level.

For instance the API could specifiy IOMMU_CACHE during attach, not
during IOASID creation.

Getting all the data model right in the API is going to be trickiest
part of this.

> yes, e.g. in vSVA both devices (behind divergence IOMMUs) are bound
> to a single guest process which has an unique PASID and 1st-level page
> table. Earlier incompatibility example is only for 2nd-level.

Because when we get to here, things become inscrutable as an API if
you are trying to say two different IOMMU presentations can actually
be nested.

> > Sure.. The tricky bit will be to define both of the common nested
> > operating modes.
> > 
> >   nested_ioasid = ioctl(ioasid_fd, CREATE_NESTED_IOASID,  gpa_ioasid_id);
> >   ioctl(ioasid_fd, SET_NESTED_IOASID_PAGE_TABLES, nested_ioasid, ..)
> > 
> >    // IOMMU will match on the device RID, no PASID:
> >   ioctl(vfio_device, ATTACH_IOASID, nested_ioasid);
> > 
> >    // IOMMU will match on the device RID and PASID:
> >   ioctl(vfio_device, ATTACH_IOASID_PASID, pasid, nested_ioasid);
> 
> I'm a bit confused here why we have both pasid and ioasid notations together.
> Why not use nested_ioasid as pasid directly (i.e. every pasid in nested mode
> is created by CREATE_NESTED_IOASID)?

The IOASID is not a PASID, it is just a page table.

A generic IOMMU matches on either RID or (RID,PASID), so you should
specify the PASID when establishing the match.

IOASID only specifies the page table.

So you read the above as configuring the path

  PCI_DEVICE -> (RID,PASID) -> nested_ioasid -> gpa_ioasid_id -> physical

Where (RID,PASID) indicate values taken from the PCI packet.

In principle the IOMMU could also be commanded to reuse the same
ioasid page table with a different PASID:

  PCI_DEVICE_B -> (RID_B,PASID_B) -> nested_ioasid -> gpa_ioasid_id -> physical

This is impossible if the ioasid == PASID in the API.

> Below I list different scenarios for ATTACH_IOASID in my view. Here 
> vfio_device could be a real PCI function (RID), or a subfunction device 
> (RID+def_ioasid). 

What is RID+def_ioasid? The IOMMU does not match on IOASID's.

A subfunction device always need to use PASID, or an internal IOMMU,
confused what you are trying to explain?

> If the whole PASID table is delegated to the guest in ARM case, the guest
> can select its own PASIDs w/o telling the hypervisor. 

The hypervisor has to route the PASID's to the guest at some point - a
guest can't just claim a PASID unilaterally, that would not be secure.

If it is not done with per-PASID hypercalls then the hypervisor has to
route all PASID's for a RID to the guest and /dev/ioasid needs to have
a nested IOASID object that represents this connection - ie it points
to the PASID table of the guest vIOMMU or something.

Remember this all has to be compatible with mdev's too and without
hypercalls to create PASIDs that will be hard: mdev sharing a RID and
slicing the physical PASIDs can't support a 'send all PASIDs to the
guest' model, or even a 'the guest gets to pick the PASID' option.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-23 10:31                                                                           ` Tian, Kevin
@ 2021-04-23 11:57                                                                             ` Jason Gunthorpe
  2021-04-27  5:11                                                                             ` David Gibson
  1 sibling, 0 replies; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-23 11:57 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Liu, Yi L, Jacob Pan, Auger Eric,
	Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Jonathan Corbet, Raj,
	Ashok, Wu, Hao, Jiang, Dave, David Gibson, Alexey Kardashevskiy

On Fri, Apr 23, 2021 at 10:31:46AM +0000, Tian, Kevin wrote:

> So your proposal sort of moves the entire container/group/domain 
> managment into /dev/ioasid and then leaves vfio only provide device
> specific uAPI. An ioasid represents a page table (address space), thus 
> is equivalent to the scope of VFIO container. Having the device join 
> an ioasid is equivalent to attaching a device to VFIO container, and 
> here the group integrity must be enforced. Then /dev/ioasid anyway 
> needs to manage group objects and their association with ioasid and 
> underlying iommu domain thus it's pointless to keep same logic within
> VFIO. Is this understanding correct?

Yes, I haven't thought of a way to define /dev/ioasid in a way that is
useful to VDPA/etc without all these parts.. If you come up with a
better idea do share.

> btw one remaining open is whether you expect /dev/ioasid to be 
> associated with a single iommu domain, or multiple. If only a single
> domain is allowed, the ioasid_fd is equivalent to the scope of VFIO
> container. 

See the prior email for a more complete set of thoughts on this.

> It is supposed to have only one gpa_ioasid_id since one iommu domain
> can only have a single 2nd level pgtable. Then all other ioasids,
> once allocated, must be nested on this gpa_ioasid_id to fit in the
> same domain. if a legacy vIOMMU is exposed (which disallows
> nesting), the userspace has to open an ioasid_fd for every group.
> This is basically the VFIO way. On the other hand if multiple
> domains is allowed, there could be multiple ioasid_ids each holding
> a 2nd level pgtable and an iommu domain (or a list of pgtables and
> domains due to incompatibility issue as discussed in another
> thread), and can be nested by other ioasids respectively. The
> application only needs to open /dev/ioasid once regardless of
> whether vIOMMU allows nesting, and has a single interface for ioasid
> allocation. Which way do you prefer to?

I have a feeling we want to have a single IOASID be usable in as many
contexts as possible - as many domains, devices and groups as we can
get away with.

The IOASID is the expensive object, it is the pagetable, it is
potentially a lot of memory. The API should be designed so we don't
have to have multiple copies of the same pagetable.

For this reason I think having multiple IOASID's in a single
/dev/ioasid container is the way to go.

To my mind the /dev/ioasid should be linked to a HW page table format
and any device/domain/group that uses that same HW page table format
can be joined to it. This implies we can have multiple domains under
/dev/ioasid, but there is a limitation on what domains can be grouped
together.

This probably does not match the exact IOMMU capability/domain model
we have today, so I present it as an inspirational goal. The other
tricky thing here will be to define small steps.. 

eg V1 may only allow one domain, but the uAPI will not reflect this as
we expect V2 will allow multiple domains..

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-22 23:39                                                                         ` Jason Gunthorpe
  2021-04-23 10:31                                                                           ` Tian, Kevin
@ 2021-04-23 16:38                                                                           ` Alex Williamson
  2021-04-23 22:28                                                                             ` Jason Gunthorpe
  2021-04-27  5:08                                                                           ` David Gibson
  2 siblings, 1 reply; 269+ messages in thread
From: Alex Williamson @ 2021-04-23 16:38 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, Jacob Pan, Auger Eric, Jean-Philippe Brucker, Tian,
	Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu,
	cgroups, Tejun Heo, Li Zefan, Johannes Weiner,
	Jean-Philippe Brucker, Jonathan Corbet, Raj, Ashok, Wu, Hao,
	Jiang, Dave, David Gibson, Alexey Kardashevskiy

On Thu, 22 Apr 2021 20:39:50 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Thu, Apr 22, 2021 at 04:38:08PM -0600, Alex Williamson wrote:
> 
> > Because it's fundamental to the isolation of the device?  What you're
> > proposing doesn't get around the group issue, it just makes it implicit
> > rather than explicit in the uapi.  
> 
> I'm not even sure it makes it explicit or implicit, it just takes away
> the FD.
> 
> There are four group IOCTLs, I see them mapping to /dev/ioasid follows:
>  VFIO_GROUP_GET_STATUS - 
>    + VFIO_GROUP_FLAGS_CONTAINER_SET is fairly redundant
>    + VFIO_GROUP_FLAGS_VIABLE could be in a new sysfs under
>      kernel/iomm_groups, or could be an IOCTL on /dev/ioasid
>        IOASID_ALL_DEVICES_VIABLE
> 
>  VFIO_GROUP_SET_CONTAINER -
>    + This happens implicitly when the device joins the IOASID
>      so it gets moved to the vfio_device FD:
>       ioctl(vifo_device_fd, JOIN_IOASID_FD, ioasifd)
> 
>  VFIO_GROUP_UNSET_CONTAINER -
>    + Also moved to the vfio_device FD, opposite of JOIN_IOASID_FD
> 
>  VFIO_GROUP_GET_DEVICE_FD -
>    + Replaced by opening /dev/vfio/deviceX
>      Learn the deviceX which will be the cdev sysfs shows as:
>       /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/vfio/deviceX/dev
>     Open /dev/vfio/deviceX
> 
> > > How do we model the VFIO group security concept to something like
> > > VDPA?  
> > 
> > Is it really a "VFIO group security concept"?  We're reflecting the
> > reality of the hardware, not all devices are fully isolated.    
> 
> Well, exactly.
> 
> /dev/ioasid should understand the group concept somehow, otherwise it
> is incomplete and maybe even security broken.
> 
> So, how do I add groups to, say, VDPA in a way that makes sense? The
> only answer I come to is broadly what I outlined here - make
> /dev/ioasid do all the group operations, and do them when we enjoin
> the VDPA device to the ioasid.
> 
> Once I have solved all the groups problems with the non-VFIO users,
> then where does that leave VFIO? Why does VFIO need a group FD if
> everyone else doesn't?

This assumes there's a solution for vDPA that doesn't just ignore the
problem and hope for the best.  I can't speak to a vDPA solution.

> > IOMMU group.  This is the reality that any userspace driver needs to
> > play in, it doesn't magically go away because we drop the group file
> > descriptor.    
> 
> I'm not saying it does, I'm saying it makes the uAPI more regular and
> easier to fit into /dev/ioasid without the group FD.
> 
> > It only makes the uapi more difficult to use correctly because
> > userspace drivers need to go outside of the uapi to have any idea
> > that this restriction exists.    
> 
> I don't think it makes any substantive difference one way or the
> other.
> 
> With the group FD: the userspace has to read sysfs, find the list of
> devices in the group, open the group fd, create device FDs for each
> device using the name from sysfs.
> 
> Starting from a BDF the general pseudo code is
>  group_path = readlink("/sys/bus/pci/devices/BDF/iommu_group")
>  group_name = basename(group_path)
>  group_fd = open("/dev/vfio/"+group_name)
>  device_fd = ioctl(VFIO_GROUP_GET_DEVICE_FD, BDF);
> 
> Without the group FD: the userspace has to read sysfs, find the list
> of devices in the group and then open the device-specific cdev (found
> via sysfs) and link them to a /dev/ioasid FD.
> 
> Starting from a BDF the general pseudo code is:
>  device_name = first_directory_of("/sys/bus/pci/devices/BDF/vfio/")
>  device_fd = open("/dev/vfio/"+device_name)
>  ioasidfd = open("/dev/ioasid")
>  ioctl(device_fd, JOIN_IOASID_FD, ioasidfd)

This is exactly the implicit vs explicit semantics.  In the existing
vfio case, the user needs to explicitly interact with the group.  In
your proposal, the user interacts with the device, the group concept is
an implicit restriction.  You've listed a step in the description about
a "list of devices in the group", but nothing in the pseudo code
reflects that step.  I expect it would be a subtly missed by any
userspace driver developer unless they happen to work on a system where
the grouping is not ideal.  I think that makes the interface hard to
use correctly.

> These two routes can have identical outcomes and identical security
> checks.
> 
> In both cases if userspace wants a list of BDFs in the same group as
> the BDF it is interested in:
>    readdir("/sys/bus/pci/devices/BDF/iommu_group/devices")
> 
> It seems like a very small difference to me.

The difference is that the group becomes a nuance, that I expect would
be ignored, rather than a first class concept in the API.

> I still don't see how the group restriction gets surfaced to the
> application through the group FD. The applications I looked through
> just treat the group FD as a step on their way to get the device_fd.

A step where the developer hopefully recognizes that there might be
other devices in a group, a group can't be opened more than once, the
group has a flag indicating viability, they can't actually get the
device fd until the group is attached to an IOMMU context, all devices
in the group therefore have the same IOMMU context, and device fd isn't
actually available until they've gone through a proper setup, which is
an additional layer of protection.  A userspace vfio developer may not
understand why a group isn't viable, but they have that clue that it's
something at the group level to investigate.

Most of the userspace vfio drivers that I haven't contributed myself
have come about with little or no interaction from me, so I'd like to
think that the vfio uapi is actually somewhat intuitive in its concepts
and difficult to use incorrectly.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-23 16:38                                                                           ` Alex Williamson
@ 2021-04-23 22:28                                                                             ` Jason Gunthorpe
  2021-04-27  5:15                                                                               ` David Gibson
  0 siblings, 1 reply; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-23 22:28 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Liu, Yi L, Jacob Pan, Auger Eric, Jean-Philippe Brucker, Tian,
	Kevin, LKML, Joerg Roedel, Lu Baolu, David Woodhouse, iommu,
	cgroups, Tejun Heo, Li Zefan, Johannes Weiner,
	Jean-Philippe Brucker, Jonathan Corbet, Raj, Ashok, Wu, Hao,
	Jiang, Dave, David Gibson, Alexey Kardashevskiy

On Fri, Apr 23, 2021 at 10:38:51AM -0600, Alex Williamson wrote:
> On Thu, 22 Apr 2021 20:39:50 -0300

> > /dev/ioasid should understand the group concept somehow, otherwise it
> > is incomplete and maybe even security broken.
> > 
> > So, how do I add groups to, say, VDPA in a way that makes sense? The
> > only answer I come to is broadly what I outlined here - make
> > /dev/ioasid do all the group operations, and do them when we enjoin
> > the VDPA device to the ioasid.
> > 
> > Once I have solved all the groups problems with the non-VFIO users,
> > then where does that leave VFIO? Why does VFIO need a group FD if
> > everyone else doesn't?
> 
> This assumes there's a solution for vDPA that doesn't just ignore the
> problem and hope for the best.  I can't speak to a vDPA solution.

I don't think we can just ignore the question and succeed with
/dev/ioasid.

Guess it should get answered as best it can for ioasid "in general"
then we can decide if it makes sense for VFIO to use the group FD or
not when working in ioasid mode.

Maybe a better idea will come up

> an implicit restriction.  You've listed a step in the description about
> a "list of devices in the group", but nothing in the pseudo code
> reflects that step.

I gave it below with the readdir() - it isn't in the pseudo code
because the applications I looked through didn't use it, and wouldn't
benefit from it. I tried to show what things were doing today.

> I expect it would be a subtly missed by any userspace driver
> developer unless they happen to work on a system where the grouping
> is not ideal.

I'm still unclear - what are be the consequence if the application
designer misses the group detail? 

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-23 11:49                                                                     ` Jason Gunthorpe
@ 2021-04-25  9:24                                                                       ` Tian, Kevin
  2021-04-26 12:38                                                                         ` Jason Gunthorpe
  2021-04-29  8:54                                                                       ` Auger Eric
  1 sibling, 1 reply; 269+ messages in thread
From: Tian, Kevin @ 2021-04-25  9:24 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Liu, Yi L, Jacob Pan, Auger Eric,
	Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Jonathan Corbet, Raj,
	Ashok, Wu, Hao, Jiang, Dave

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, April 23, 2021 7:50 PM
> 
> On Fri, Apr 23, 2021 at 09:06:44AM +0000, Tian, Kevin wrote:
> 
> > Or could we still have just one /dev/ioasid but allow userspace to create
> > multiple gpa_ioasid_id's each associated to a different iommu domain?
> > Then the compatibility check will be done at ATTACH_IOASID instead of
> > JOIN_IOASID_FD.
> 
> To my mind what makes sense that that /dev/ioasid presents a single
> IOMMU behavior that is basically the same. This may ultimately not be
> what we call a domain today.
> 
> We may end up with a middle object which is a group of domains that
> all have the same capabilities, and we define capabilities in a way
> that most platforms have a single group of domains.
> 
> The key capability of a group of domains is they can all share the HW
> page table representation, so if an IOASID instantiates a page table
> it can be assigned to any device on any domain in the gruop of domains.

Sorry that I didn't quite get it. If a group of domains can share the 
same page table then why not just attaching all devices under those
domains into a single domain? IMO the iommu domain is introduced
to describe the HW page table. Ideally a new iommu domain should
be created only when it's impossible to share an existing page table. 
Otherwise you'll get bad iotlb efficiency because each domain has its
unique domain id (tagged in iotlb) then duplicated iotlb entries may
exist even when a single page table is shared by those domains.

Then does it imply that you are actually suggesting /dev/ioasid associated 
with a single 2nd-level page table (which can be nested by multiple 
1st-level page tables represented by other ioasids) thus a single iommu 
domain for all devices linked to compatible IOMMUs?

Or, can you elaborate what is the targeted usage by having a group of
domains which all share the same page table?

> 
> If you try to say that /dev/ioasid has many domains and they can't
> have their HW page tables shared then I think the implementation
> complexity will explode.

Want to hear your opinion for one open here. There is no doubt that
an ioasid represents a HW page table when the table is constructed by 
userspace and then linked to the IOMMU through the bind/unbind
API. But I'm not very sure about whether an ioasid should represent 
the exact pgtable or the mapping metadata when the underlying 
pgtable is indirectly constructed through map/unmap API. VFIO does
the latter way, which is why it allows multiple incompatible domains
in a single container which all share the same mapping metadata.

> 
> > This does impose one burden to userspace though, to understand the
> > IOMMU compatibilities and figure out which incompatible features may
> > affect the page table management (while such knowledge is IOMMU
> > vendor specific) and then explicitly manage multiple /dev/ioasid's or
> > multiple gpa_ioasid_id's.
> 
> Right, this seems very hard in the general case..
> 
> > Alternatively is it a good design by having the kernel return error at
> > attach/join time to indicate that incompatibility is detected then the
> > userspace should open a new /dev/ioasid or creates a new gpa_ioasid_id
> > for the failing device upon such failure, w/o constructing its own
> > compatibility knowledge?
> 
> Yes, this feels workable too
> 
> > > This means qemue might have multiple /dev/ioasid's if the system has
> > > multiple incompatible IOMMUs (is this actually a thing?) The platform
> >
> > One example is Intel platform with igd. Typically there is one IOMMU
> > dedicated for igd and the other IOMMU serving all the remaining devices.
> > The igd IOMMU may not support IOMMU_CACHE while the other one
> > does.
> 
> If we can do as above the two domains may be in the same group of
> domains and the IOMMU_CACHE is not exposed at the /dev/ioasid level.
> 
> For instance the API could specifiy IOMMU_CACHE during attach, not
> during IOASID creation.
> 
> Getting all the data model right in the API is going to be trickiest
> part of this.
> 
> > yes, e.g. in vSVA both devices (behind divergence IOMMUs) are bound
> > to a single guest process which has an unique PASID and 1st-level page
> > table. Earlier incompatibility example is only for 2nd-level.
> 
> Because when we get to here, things become inscrutable as an API if
> you are trying to say two different IOMMU presentations can actually
> be nested.
> 
> > > Sure.. The tricky bit will be to define both of the common nested
> > > operating modes.
> > >
> > >   nested_ioasid = ioctl(ioasid_fd, CREATE_NESTED_IOASID,
> gpa_ioasid_id);
> > >   ioctl(ioasid_fd, SET_NESTED_IOASID_PAGE_TABLES, nested_ioasid, ..)
> > >
> > >    // IOMMU will match on the device RID, no PASID:
> > >   ioctl(vfio_device, ATTACH_IOASID, nested_ioasid);
> > >
> > >    // IOMMU will match on the device RID and PASID:
> > >   ioctl(vfio_device, ATTACH_IOASID_PASID, pasid, nested_ioasid);
> >
> > I'm a bit confused here why we have both pasid and ioasid notations
> together.
> > Why not use nested_ioasid as pasid directly (i.e. every pasid in nested
> mode
> > is created by CREATE_NESTED_IOASID)?
> 
> The IOASID is not a PASID, it is just a page table.
> 
> A generic IOMMU matches on either RID or (RID,PASID), so you should
> specify the PASID when establishing the match.
> 
> IOASID only specifies the page table.
> 
> So you read the above as configuring the path
> 
>   PCI_DEVICE -> (RID,PASID) -> nested_ioasid -> gpa_ioasid_id -> physical
> 
> Where (RID,PASID) indicate values taken from the PCI packet.
> 
> In principle the IOMMU could also be commanded to reuse the same
> ioasid page table with a different PASID:
> 
>   PCI_DEVICE_B -> (RID_B,PASID_B) -> nested_ioasid -> gpa_ioasid_id ->
> physical
> 
> This is impossible if the ioasid == PASID in the API.

OK, now I see where the disconnection comes from. In my context ioasid
is the identifier that is actually used in the wire, but seems you treat it as 
a sw-defined namespace purely for representing page tables. We should 
clear this concept first before further discussing other details. 😊

Below is the description when the kernel ioasid allocator was introduced:

--
commit fa83433c92e340822a056a610a4fa2063a3db304
Author: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Date:   Wed Oct 2 12:42:41 2019 -0700

    iommu: Add I/O ASID allocator

    Some devices might support multiple DMA address spaces, in particular
    those that have the PCI PASID feature. PASID (Process Address Space ID)
    allows to share process address spaces with devices (SVA), partition a
    device into VM-assignable entities (VFIO mdev) or simply provide
    multiple DMA address space to kernel drivers. Add a global PASID
    allocator usable by different drivers at the same time. Name it I/O ASID
    to avoid confusion with ASIDs allocated by arch code, which are usually
    a separate ID space.

    The IOASID space is global. Each device can have its own PASID space,
    but by convention the IOMMU ended up having a global PASID space, so
    that with SVA, each mm_struct is associated to a single PASID.

    The allocator is primarily used by IOMMU subsystem but in rare occasions
    drivers would like to allocate PASIDs for devices that aren't managed by
    an IOMMU, using the same ID space as IOMMU.
--

ioasid and pasid are used interchangeably within the kernel, and the ioasid
value returned by the ioasid allocator is directly used as PASID when the
driver programs IOMMU and device, respectively. My context is based on 
this understanding, which is why I thought nested_ioasid can be directly 
used as PASID in earlier reply. Do you see a problem with this approach?

Then following your proposal, does it mean that we need another interface
for allocating PASID? and since ioasid means different thing in uAPI and
in-kernel API, possibly a new name is required to avoid confusion?

> 
> > Below I list different scenarios for ATTACH_IOASID in my view. Here
> > vfio_device could be a real PCI function (RID), or a subfunction device
> > (RID+def_ioasid).
> 
> What is RID+def_ioasid? The IOMMU does not match on IOASID's.
> 
> A subfunction device always need to use PASID, or an internal IOMMU,
> confused what you are trying to explain?

Here the def_ioasid is the PASID that is associated with the subfunction
in my context with above explanation.

> 
> > If the whole PASID table is delegated to the guest in ARM case, the guest
> > can select its own PASIDs w/o telling the hypervisor.
> 
> The hypervisor has to route the PASID's to the guest at some point - a
> guest can't just claim a PASID unilaterally, that would not be secure.
> 
> If it is not done with per-PASID hypercalls then the hypervisor has to
> route all PASID's for a RID to the guest and /dev/ioasid needs to have
> a nested IOASID object that represents this connection - ie it points
> to the PASID table of the guest vIOMMU or something.

yes, this might be the model that will work for ARM case. In their
architecture the PASID table locates in the GPA space thus naturally
to be managed by the guest (though Jean ever mentioned some
tricky method to allow the host managing it by stealing a GPA window).

> 
> Remember this all has to be compatible with mdev's too and without
> hypercalls to create PASIDs that will be hard: mdev sharing a RID and
> slicing the physical PASIDs can't support a 'send all PASIDs to the
> guest' model, or even a 'the guest gets to pick the PASID' option.
> 

yes, with mdev above is inevitable. I guess ARM may need some
similar extension in their SMMU as what VT-d provide for mdev
usage, e.g. at least not mandating PASID table in GPA space. But
before that they may still expect a way to delegate the whole
PASID space per RID to the guest.

Really lots of subtle differences to be generalized...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-25  9:24                                                                       ` Tian, Kevin
@ 2021-04-26 12:38                                                                         ` Jason Gunthorpe
  2021-04-28  6:34                                                                           ` Tian, Kevin
  0 siblings, 1 reply; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-26 12:38 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Liu, Yi L, Jacob Pan, Auger Eric,
	Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Jonathan Corbet, Raj,
	Ashok, Wu, Hao, Jiang, Dave

On Sun, Apr 25, 2021 at 09:24:46AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Friday, April 23, 2021 7:50 PM
> > 
> > On Fri, Apr 23, 2021 at 09:06:44AM +0000, Tian, Kevin wrote:
> > 
> > > Or could we still have just one /dev/ioasid but allow userspace to create
> > > multiple gpa_ioasid_id's each associated to a different iommu domain?
> > > Then the compatibility check will be done at ATTACH_IOASID instead of
> > > JOIN_IOASID_FD.
> > 
> > To my mind what makes sense that that /dev/ioasid presents a single
> > IOMMU behavior that is basically the same. This may ultimately not be
> > what we call a domain today.
> > 
> > We may end up with a middle object which is a group of domains that
> > all have the same capabilities, and we define capabilities in a way
> > that most platforms have a single group of domains.
> > 
> > The key capability of a group of domains is they can all share the HW
> > page table representation, so if an IOASID instantiates a page table
> > it can be assigned to any device on any domain in the gruop of domains.
> 
> Sorry that I didn't quite get it. If a group of domains can share the 
> same page table then why not just attaching all devices under those
> domains into a single domain?

Sure, if that works. But you shouldn't have things like IOMMU_CACHE
create different domains or trigger different /dev/ioasid's

> to describe the HW page table. Ideally a new iommu domain should
> be created only when it's impossible to share an existing page table. 
> Otherwise you'll get bad iotlb efficiency because each domain has its
> unique domain id (tagged in iotlb) then duplicated iotlb entries may
> exist even when a single page table is shared by those domains.

Right, fewer is better

> Or, can you elaborate what is the targeted usage by having a group of
> domains which all share the same page table?

You just need to have clear rule what what requires a new /dev/ioasid
FD - and if it maps to domains then great.

> Want to hear your opinion for one open here. There is no doubt that
> an ioasid represents a HW page table when the table is constructed by 
> userspace and then linked to the IOMMU through the bind/unbind
> API. But I'm not very sure about whether an ioasid should represent 
> the exact pgtable or the mapping metadata when the underlying 
> pgtable is indirectly constructed through map/unmap API. VFIO does
> the latter way, which is why it allows multiple incompatible domains
> in a single container which all share the same mapping metadata.

I think VFIO's map/unmap is way too complex and we know it has bad
performance problems. 

If /dev/ioasid is single HW page table only then I would focus on that
implementation and leave it to userspace to span different
/dev/ioasids if needed.

> OK, now I see where the disconnection comes from. In my context ioasid
> is the identifier that is actually used in the wire, but seems you treat it as 
> a sw-defined namespace purely for representing page tables. We should 
> clear this concept first before further discussing other details. 😊

There is no general HW requirement that every IO page table be
referred to by the same PASID and this API would have to support
non-PASID IO page tables as well. So I'd keep the two things
separated in the uAPI - even though the kernel today has a global
PASID pool.

> Then following your proposal, does it mean that we need another
> interface for allocating PASID? and since ioasid means different
> thing in uAPI and in-kernel API, possibly a new name is required to
> avoid confusion?

I would suggest have two ways to control the PASID

 1) Over /dev/ioasid allocate a PASID for an IOASID. All future PASID
    based usages of the IOASID will use that global PASID

 2) Over the device FD, when the IOASID is bound return the PASID that
    was selected. If the IOASID does not have a global PASID then the
    kernel is free to make something up. In this mode a single IOASID
    can have multiple PASIDs.

Simple things like DPDK can use #2 and potentially have better PASID
limits. hypervisors will most likely have to use #1, but it depends on
how their vIOMMU interface works.

I think the name IOASID is fine for the uAPI, the kernel version can
be called ioasid_id or something.

(also looking at ioasid.c, why do we need such a thin and odd wrapper
around xarray?)

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-22 17:13                                                               ` Alex Williamson
  2021-04-22 17:57                                                                 ` Jason Gunthorpe
@ 2021-04-27  4:50                                                                 ` David Gibson
  2021-04-27 17:24                                                                   ` Jason Gunthorpe
  1 sibling, 1 reply; 269+ messages in thread
From: David Gibson @ 2021-04-27  4:50 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, Liu, Yi L, Jacob Pan, Auger Eric,
	Jean-Philippe Brucker, Tian, Kevin, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Jonathan Corbet, Raj,
	Ashok, Wu, Hao, Jiang, Dave, Alexey Kardashevskiy

[-- Attachment #1: Type: text/plain, Size: 15189 bytes --]

On Thu, Apr 22, 2021 at 11:13:37AM -0600, Alex Williamson wrote:
> On Wed, 21 Apr 2021 20:03:01 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Apr 21, 2021 at 01:33:12PM -0600, Alex Williamson wrote:
> > 
> > > > I still expect that VFIO_GROUP_SET_CONTAINER will be used to connect
> > > > /dev/{ioasid,vfio} to the VFIO group and all the group and device
> > > > logic stays inside VFIO.  
> > > 
> > > But that group and device logic is also tied to the container, where
> > > the IOMMU backend is the interchangeable thing that provides the IOMMU
> > > manipulation for that container.  
> > 
> > I think that is an area where the discussion would need to be focused.
> > 
> > I don't feel very prepared to have it in details, as I haven't dug
> > into all the group and iommu micro-operation very much.
> > 
> > But, it does seem like the security concept that VFIO is creating with
> > the group also has to be present in the lower iommu layer too.
> > 
> > With different subsystems joining devices to the same ioasid's we
> > still have to enforce the security propery the vfio group is creating.
> > 
> > > If you're using VFIO_GROUP_SET_CONTAINER to associate a group to a
> > > /dev/ioasid, then you're really either taking that group outside of
> > > vfio or you're re-implementing group management in /dev/ioasid.   
> > 
> > This sounds right.
> > 
> > > > Everything can be switched to ioasid_container all down the line. If
> > > > it wasn't for PPC this looks fairly simple.  
> > > 
> > > At what point is it no longer vfio?  I'd venture to say that replacing
> > > the container rather than invoking a different IOMMU backend is that
> > > point.  
> > 
> > sorry, which is no longer vfio?
> 
> I'm suggesting that if we're replacing the container/group model with
> an ioasid then we're effectively creating a new thing that really only
> retains the vfio device uapi.
> 
> > > > Since getting rid of PPC looks a bit hard, we'd be stuck with
> > > > accepting a /dev/ioasid and then immediately wrappering it in a
> > > > vfio_container an shimming it through a vfio_iommu_ops. It is not
> > > > ideal at all, but in my look around I don't see a major problem if
> > > > type1 implementation is moved to live under /dev/ioasid.  
> > > 
> > > But type1 is \just\ an IOMMU backend, not "/dev/vfio".  Given that
> > > nobody flinched at removing NVLink support, maybe just deprecate SPAPR
> > > now and see if anyone objects ;)  
> > 
> > Would simplify this project, but I wonder :)
> > 
> > In any event, it does look like today we'd expect the SPAPR stuff
> > would be done through the normal iommu APIs, perhaps enhanced a bit,
> > which makes me suspect an enhanced type1 can implement SPAPR.
> 
> David Gibson has argued for some time that SPAPR could be handled via a
> converged type1 model.  We has mapped that out at one point,
> essentially a "type2", but neither of us had any bandwidth to pursue it.

Right.  The sPAPR TCE backend is kind of an unfortunate accident of
history.  We absolutely could do a common interface, but no-one's had
time to work on it.

> > I say this because the SPAPR looks quite a lot like PASID when it has
> > APIs for allocating multiple tables and other things. I would be
> > interested to hear someone from IBM talk about what it is doing and
> > how it doesn't fit into today's IOMMU API.

Hm.  I don't think it's really like PASID.  Just like Type1, the TCE
backend represents a single DMA address space which all devices in the
container will see at all times.  The difference is that there can be
multiple (well, 2) "windows" of valid IOVAs within that address space.
Each window can have a different TCE (page table) layout.  For kernel
drivers, a smallish translated window at IOVA 0 is used for 32-bit
devices, and a large direct mapped (no page table) window is created
at a high IOVA for better performance with 64-bit DMA capable devices.

With the VFIO backend we create (but don't populate) a similar
smallish 32-bit window, userspace can create its own secondary window
if it likes, though obvious for userspace use there will always be a
page table.  Userspace can choose the total size (but not address),
page size and to an extent the page table format of the created
window.  Note that the TCE page table format is *not* the same as the
POWER CPU core's page table format.  Userspace can also remove the
default small window and create its own.

The second wrinkle is pre-registration.  That lets userspace register
certain userspace VA ranges (*not* IOVA ranges) as being the only ones
allowed to be mapped into the IOMMU.  This is a performance
optimization, because on pre-registration we also pre-account memory
that will be effectively locked by DMA mappings, rather than doing it
at DMA map and unmap time.

This came about because POWER guests always contain a vIOMMU.  That
(combined with the smallish default IOVA window) means that DMA maps
and unmaps can become an important bottleneck, rather than being
basically a small once-off cost when qemu maps all of guest memory
into the IOMMU.  That's optimized with a special interlink between
KVM and VFIO that accelerates the guest-initiated maps/unmap
operations.  However, it's not feasible to do the accounting in that
fast path, hence the need for the pre-registration.

> 
> [Cc David, Alexey]
> 
> > It is very old and the iommu world has advanced tremendously lately,
> > maybe I'm too optimisitic?
> > 
> > > > We end up with a ioasid.h that basically has the vfio_iommu_type1 code
> > > > lightly recast into some 'struct iommu_container' and a set of
> > > > ioasid_* function entry points that follow vfio_iommu_driver_ops_type1:
> > > >   ioasid_attach_group
> > > >   ioasid_detatch_group
> > > >   ioasid_<something about user pages>
> > > >   ioasid_read/ioasid_write  
> > > 
> > > Again, this looks like a vfio IOMMU backend.  What are we accomplishing
> > > by replacing /dev/vfio with /dev/ioasid versus some manipulation of
> > > VFIO_SET_IOMMU accepting a /dev/ioasid fd?  
> > 
> > The point of all of this is to make the user api for the IOMMU
> > cross-subsystem. It is not a vfio IOMMU backend, it is moving the
> > IOMMU abstraction from VFIO into the iommu framework and giving the
> > iommu framework a re-usable user API.

I like the idea of a common DMA/IOMMU handling system across
platforms.  However in order to be efficiently usable for POWER it
will need to include multiple windows, allowing the user to change
those windows and something like pre-registration to amortize
accounting costs for heavy vIOMMU load.

Well... possibly we can do without the pre-reg now that 32-bit DMA
limited devics are less common, as are POWER8 systems.  With modern
devices and modern kernels a guest is likely to use a single large
64-bit secondary window mapping all guest RAM, so the vIOMMU
bottleneck shouldn't be such an issue.

> Right, but I don't see that implies it cannot work within the vfio
> IOMMU model.  Currently when an IOMMU is set, the /dev/vfio/vfio
> container becomes a conduit for file ops from the container to be
> forwarded to the IOMMU.  But that's in part because the user doesn't
> have another object to interact with the IOMMU.  It's entirely possible
> that with an ioasid shim, the user would continue to interact directly
> with the /dev/ioasid fd for IOMMU manipulation and only use
> VFIO_SET_IOMMU to associate a vfio container to that ioasid.
> 
> > My ideal outcome would be for VFIO to use only the new iommu/ioasid
> > API and have no iommu pluggability at all. The iommu subsystem
> > provides everything needed to VFIO, and provides it equally to VDPA
> > and everything else.
> 
> As above, we don't necessarily need to have the vfio container be the
> access mechanism for the IOMMU, it can become just an means to
> association the container with an IOMMU.  This has quite a few
> transitional benefits.
> 
> > drivers/vfio/ becomes primarily about 'struct vfio_device' and
> > everything related to its IOCTL interface.
> > 
> > drivers/iommu and ioasid.c become all about a pluggable IOMMU
> > interface, including a uAPI for it.
> > 
> > IMHO it makes a high level sense, though it may be a pipe dream.
> 
> This is where we've dissolved all but the vfio device uapi, which
> suggests the group and container model were never necessary and I'm not
> sure exactly what that uapi looks like.  We currently make use of an
> IOMMU api that is group aware, but that awareness extends out to the
> vfio uapi.
> 
> > > > If we have this, and /dev/ioasid implements the legacy IOCTLs, then
> > > > /dev/vfio == /dev/ioasid and we can compile out vfio_fops and related
> > > > from vfio.c and tell ioasid.c to create /dev/vfio instead using the
> > > > ops it owns.  
> > > 
> > > Why would we want /dev/ioasid to implement legacy ioctls instead of
> > > simply implementing an interface to allow /dev/ioasid to be used as a
> > > vfio IOMMU backend?  
> > 
> > Only to make our own migration easier. I'd imagine everyone would want
> > to sit down and design this new clear ioasid API that can co-exist on
> > /dev/ioasid with the legacy once.
> 
> vfio really just wants to be able to attach groups to an address space
> to consider them isolated, everything else about the IOMMU API could
> happen via a new ioasid file descriptor representing that context, ie.
> vfio handles the group ownership and device access, ioasid handles the
> actual mappings.
> 
> > > The pseudo code above really suggests you do want to remove
> > > /dev/vfio/vfio, but this is only one of the IOMMU backends for vfio, so
> > > I can't quite figure out if we're talking past each other.  
> > 
> > I'm not quite sure what you mean by "one of the IOMMU backends?" You
> > mean type1, right?
> >  
> > > As I expressed in another thread, type1 has a lot of shortcomings.  The
> > > mapping interface leaves userspace trying desperately to use statically
> > > mapped buffers because the map/unmap latency is too high.  We have
> > > horrible issues with duplicate locked page accounting across
> > > containers.  It suffers pretty hard from feature creep in various
> > > areas.  A new IOMMU backend is an opportunity to redesign some of these
> > > things.  
> > 
> > Sure, but also those kinds of transformational things go alot better
> > if you can smoothly go from the old to the new and have technical
> > co-existance in side the kernel. Having a shim that maps the old APIs
> > to new APIs internally to Linux helps keep the implementation from
> > becoming too bogged down with compatibility.
> 
> I'm afraid /dev/ioasid providing type1 compatibility would be just that.
> 
> > > The IOMMU group also abstracts isolation and visibility relative to
> > > DMA.  For example, in a PCIe topology a multi-function device may not
> > > have isolation between functions, but each requester ID is visible to
> > > the IOMMU.    
> > 
> > Okay, I'm glad I have this all right in my head, as I was pretty sure
> > this was what the group was about.
> > 
> > My next question is why do we have three things as a FD: group, device
> > and container (aka IOMMU interface)?
> > 
> > Do we have container because the /dev/vfio/vfio can hold only a single
> > page table so we need to swap containers sometimes?
> 
> The container represents an IOMMU address space, which can be shared by
> multiple groups, where each group may contain one or more devices.
> Swapping a container would require releasing all the devices (the user
> cannot have access to a non-isolated device), then a group could be
> moved from one container to another.
> 
> > If we start from a clean sheet and make a sketch..
> > 
> > /dev/ioasid is the IOMMU control interface. It can create multiple
> > IOASIDs that have page tables and it can manipulate those page tables.
> > Each IOASID is identified by some number.
> > 
> > struct vfio_device/vdpa_device/etc are consumers of /dev/ioasid
> > 
> > When a device attaches to an ioasid userspace gives VFIO/VDPA the
> > ioasid FD and the ioasid # in the FD.
> > 
> > The security rule for isolation is that once a device is attached to a
> > /dev/ioasid fd then all other devices in that security group must be
> > attached to the same ioasid FD or left unused.
> 
> Sounds like a group...  Note also that if those other devices are not
> isolated from the user's device, the user could manipulate "unused"
> devices via DMA.  So even unused devices should be within the same
> IOMMU context... thus attaching groups to IOMMU domains.
> 
> > Thus /dev/ioasid also becomes the unit of security and the IOMMU
> > subsystem level becomes aware of and enforces the group security
> > rules. Userspace does not need to "see" the group
> 
> What tools does userspace have to understand isolation of individual
> devices without groups?
>  
> > In sketch it would be like
> >   ioasid_fd = open("/dev/ioasid");
> >   vfio_device_fd = open("/dev/vfio/device0")
> >   vdpa_device_fd = open("/dev/vdpa/device0")
> >   ioctl(vifo_device_fd, JOIN_IOASID_FD, ioasifd)
> >   ioctl(vdpa_device_fd, JOIN_IOASID_FD, ioasifd)
> > 
> >   gpa_ioasid_id = ioctl(ioasid_fd, CREATE_IOASID, ..)
> >   ioctl(ioasid_fd, SET_IOASID_PAGE_TABLES, ..)
> > 
> >   ioctl(vfio_device, ATTACH_IOASID, gpa_ioasid_id)
> >   ioctl(vpda_device, ATTACH_IOASID, gpa_ioasid_id)
> > 
> >   .. both VDPA and VFIO see the guest physical map and the kernel has
> >      enough info that both could use the same IOMMU page table
> >      structure ..
> > 
> >   // Guest viommu turns off bypass mode for the vfio device
> >   ioctl(vfio_device, DETATCH_IOASID)
> >  
> >   // Guest viommu creates a new page table
> >   rid_ioasid_id = ioctl(ioasid_fd, CREATE_IOASID, ..)
> >   ioctl(ioasid_fd, SET_IOASID_PAGE_TABLES, ..)
> > 
> >   // Guest viommu links the new page table to the RID
> >   ioctl(vfio_device, ATTACH_IOASID, rid_ioasid_id)
> > 
> > The group security concept becomes implicit and hidden from the
> > uAPI. JOIN_IOASID_FD implicitly finds the device's group inside the
> > kernel and requires that all members of the group be joined only to
> > this ioasid_fd.
> > 
> > Essentially we discover the group from the device instead of the
> > device from the group.
> > 
> > Where does it fall down compared to the three FD version we have
> > today?
> 
> The group concept is explicit today because how does userspace learn
> about implicit dependencies between devices?  For example, if the user
> has a conventional PCI bus with a couple devices on it, how do they
> understand that those devices cannot be assigned to separate userspace
> drivers?  The group fd fills that gap.  Thanks,
> 
> Alex
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-22 23:39                                                                         ` Jason Gunthorpe
  2021-04-23 10:31                                                                           ` Tian, Kevin
  2021-04-23 16:38                                                                           ` Alex Williamson
@ 2021-04-27  5:08                                                                           ` David Gibson
  2021-04-27 17:12                                                                             ` Jason Gunthorpe
  2 siblings, 1 reply; 269+ messages in thread
From: David Gibson @ 2021-04-27  5:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Liu, Yi L, Jacob Pan, Auger Eric,
	Jean-Philippe Brucker, Tian, Kevin, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Jonathan Corbet, Raj,
	Ashok, Wu, Hao, Jiang, Dave, Alexey Kardashevskiy

[-- Attachment #1: Type: text/plain, Size: 5294 bytes --]

On Thu, Apr 22, 2021 at 08:39:50PM -0300, Jason Gunthorpe wrote:
> On Thu, Apr 22, 2021 at 04:38:08PM -0600, Alex Williamson wrote:
> 
> > Because it's fundamental to the isolation of the device?  What you're
> > proposing doesn't get around the group issue, it just makes it implicit
> > rather than explicit in the uapi.
> 
> I'm not even sure it makes it explicit or implicit, it just takes away
> the FD.
> 
> There are four group IOCTLs, I see them mapping to /dev/ioasid follows:
>  VFIO_GROUP_GET_STATUS - 
>    + VFIO_GROUP_FLAGS_CONTAINER_SET is fairly redundant
>    + VFIO_GROUP_FLAGS_VIABLE could be in a new sysfs under
>      kernel/iomm_groups, or could be an IOCTL on /dev/ioasid
>        IOASID_ALL_DEVICES_VIABLE
> 
>  VFIO_GROUP_SET_CONTAINER -
>    + This happens implicitly when the device joins the IOASID
>      so it gets moved to the vfio_device FD:
>       ioctl(vifo_device_fd, JOIN_IOASID_FD, ioasifd)
> 
>  VFIO_GROUP_UNSET_CONTAINER -
>    + Also moved to the vfio_device FD, opposite of JOIN_IOASID_FD
> 
>  VFIO_GROUP_GET_DEVICE_FD -
>    + Replaced by opening /dev/vfio/deviceX
>      Learn the deviceX which will be the cdev sysfs shows as:
>       /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/vfio/deviceX/dev
>     Open /dev/vfio/deviceX
> 
> > > How do we model the VFIO group security concept to something like
> > > VDPA?
> > 
> > Is it really a "VFIO group security concept"?  We're reflecting the
> > reality of the hardware, not all devices are fully isolated.  
> 
> Well, exactly.
> 
> /dev/ioasid should understand the group concept somehow, otherwise it
> is incomplete and maybe even security broken.
> 
> So, how do I add groups to, say, VDPA in a way that makes sense? The
> only answer I come to is broadly what I outlined here - make
> /dev/ioasid do all the group operations, and do them when we enjoin
> the VDPA device to the ioasid.
> 
> Once I have solved all the groups problems with the non-VFIO users,
> then where does that leave VFIO? Why does VFIO need a group FD if
> everyone else doesn't?
> 
> > IOMMU group.  This is the reality that any userspace driver needs to
> > play in, it doesn't magically go away because we drop the group file
> > descriptor.  
> 
> I'm not saying it does, I'm saying it makes the uAPI more regular and
> easier to fit into /dev/ioasid without the group FD.
> 
> > It only makes the uapi more difficult to use correctly because
> > userspace drivers need to go outside of the uapi to have any idea
> > that this restriction exists.  
> 
> I don't think it makes any substantive difference one way or the
> other.
> 
> With the group FD: the userspace has to read sysfs, find the list of
> devices in the group, open the group fd, create device FDs for each
> device using the name from sysfs.
> 
> Starting from a BDF the general pseudo code is
>  group_path = readlink("/sys/bus/pci/devices/BDF/iommu_group")
>  group_name = basename(group_path)
>  group_fd = open("/dev/vfio/"+group_name)
>  device_fd = ioctl(VFIO_GROUP_GET_DEVICE_FD, BDF);
> 
> Without the group FD: the userspace has to read sysfs, find the list
> of devices in the group and then open the device-specific cdev (found
> via sysfs) and link them to a /dev/ioasid FD.
> 
> Starting from a BDF the general pseudo code is:
>  device_name = first_directory_of("/sys/bus/pci/devices/BDF/vfio/")
>  device_fd = open("/dev/vfio/"+device_name)
>  ioasidfd = open("/dev/ioasid")
>  ioctl(device_fd, JOIN_IOASID_FD, ioasidfd)

This line is the problem.

[Historical aside: Alex's early drafts for the VFIO interface looked
quite similar to this.  Ben Herrenschmidt and myself persuaded him it
was a bad idea, and groups were developed instead.  I still think it's
a bad idea, and not just for POWER]

As Alex says, if this line fails because of the group restrictions,
that's not great because it's not very obvious what's gone wrong.  But
IMO, the success path on a multi-device group is kind of worse:
you've now made made a meaningful and visible change to the setup of
devices which are not mentioned in this line *at all*.  If you've
changed the DMA address space of this device you've also changed it
for everything else in the group - there's no getting around that.

For both those reasons, I absolutely agree with Alex that retaining
the explicit group model is valuable.

Yes, it makes set up more of a pain, but it's necessary complexity to
actually understand what's going on here.


> These two routes can have identical outcomes and identical security
> checks.
> 
> In both cases if userspace wants a list of BDFs in the same group as
> the BDF it is interested in:
>    readdir("/sys/bus/pci/devices/BDF/iommu_group/devices")
> 
> It seems like a very small difference to me.
> 
> I still don't see how the group restriction gets surfaced to the
> application through the group FD. The applications I looked through
> just treat the group FD as a step on their way to get the device_fd.
> 
> Jason
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-23 10:31                                                                           ` Tian, Kevin
  2021-04-23 11:57                                                                             ` Jason Gunthorpe
@ 2021-04-27  5:11                                                                             ` David Gibson
  2021-04-27 16:39                                                                               ` Jason Gunthorpe
  1 sibling, 1 reply; 269+ messages in thread
From: David Gibson @ 2021-04-27  5:11 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, Alex Williamson, Liu, Yi L, Jacob Pan,
	Auger Eric, Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Jonathan Corbet, Raj,
	Ashok, Wu, Hao, Jiang, Dave, Alexey Kardashevskiy

[-- Attachment #1: Type: text/plain, Size: 6472 bytes --]

On Fri, Apr 23, 2021 at 10:31:46AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Friday, April 23, 2021 7:40 AM
> > 
> > On Thu, Apr 22, 2021 at 04:38:08PM -0600, Alex Williamson wrote:
> > 
> > > Because it's fundamental to the isolation of the device?  What you're
> > > proposing doesn't get around the group issue, it just makes it implicit
> > > rather than explicit in the uapi.
> > 
> > I'm not even sure it makes it explicit or implicit, it just takes away
> > the FD.
> > 
> > There are four group IOCTLs, I see them mapping to /dev/ioasid follows:
> >  VFIO_GROUP_GET_STATUS -
> >    + VFIO_GROUP_FLAGS_CONTAINER_SET is fairly redundant
> >    + VFIO_GROUP_FLAGS_VIABLE could be in a new sysfs under
> >      kernel/iomm_groups, or could be an IOCTL on /dev/ioasid
> >        IOASID_ALL_DEVICES_VIABLE
> > 
> >  VFIO_GROUP_SET_CONTAINER -
> >    + This happens implicitly when the device joins the IOASID
> >      so it gets moved to the vfio_device FD:
> >       ioctl(vifo_device_fd, JOIN_IOASID_FD, ioasifd)
> > 
> >  VFIO_GROUP_UNSET_CONTAINER -
> >    + Also moved to the vfio_device FD, opposite of JOIN_IOASID_FD
> > 
> >  VFIO_GROUP_GET_DEVICE_FD -
> >    + Replaced by opening /dev/vfio/deviceX
> >      Learn the deviceX which will be the cdev sysfs shows as:
> >       /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/vfio/deviceX/dev
> >     Open /dev/vfio/deviceX
> > 
> > > > How do we model the VFIO group security concept to something like
> > > > VDPA?
> > >
> > > Is it really a "VFIO group security concept"?  We're reflecting the
> > > reality of the hardware, not all devices are fully isolated.
> > 
> > Well, exactly.
> > 
> > /dev/ioasid should understand the group concept somehow, otherwise it
> > is incomplete and maybe even security broken.
> > 
> > So, how do I add groups to, say, VDPA in a way that makes sense? The
> > only answer I come to is broadly what I outlined here - make
> > /dev/ioasid do all the group operations, and do them when we enjoin
> > the VDPA device to the ioasid.
> > 
> > Once I have solved all the groups problems with the non-VFIO users,
> > then where does that leave VFIO? Why does VFIO need a group FD if
> > everyone else doesn't?
> > 
> > > IOMMU group.  This is the reality that any userspace driver needs to
> > > play in, it doesn't magically go away because we drop the group file
> > > descriptor.
> > 
> > I'm not saying it does, I'm saying it makes the uAPI more regular and
> > easier to fit into /dev/ioasid without the group FD.
> > 
> > > It only makes the uapi more difficult to use correctly because
> > > userspace drivers need to go outside of the uapi to have any idea
> > > that this restriction exists.
> > 
> > I don't think it makes any substantive difference one way or the
> > other.
> > 
> > With the group FD: the userspace has to read sysfs, find the list of
> > devices in the group, open the group fd, create device FDs for each
> > device using the name from sysfs.
> > 
> > Starting from a BDF the general pseudo code is
> >  group_path = readlink("/sys/bus/pci/devices/BDF/iommu_group")
> >  group_name = basename(group_path)
> >  group_fd = open("/dev/vfio/"+group_name)
> >  device_fd = ioctl(VFIO_GROUP_GET_DEVICE_FD, BDF);
> > 
> > Without the group FD: the userspace has to read sysfs, find the list
> > of devices in the group and then open the device-specific cdev (found
> > via sysfs) and link them to a /dev/ioasid FD.
> > 
> > Starting from a BDF the general pseudo code is:
> >  device_name = first_directory_of("/sys/bus/pci/devices/BDF/vfio/")
> >  device_fd = open("/dev/vfio/"+device_name)
> >  ioasidfd = open("/dev/ioasid")
> >  ioctl(device_fd, JOIN_IOASID_FD, ioasidfd)
> > 
> > These two routes can have identical outcomes and identical security
> > checks.
> > 
> > In both cases if userspace wants a list of BDFs in the same group as
> > the BDF it is interested in:
> >    readdir("/sys/bus/pci/devices/BDF/iommu_group/devices")
> > 
> > It seems like a very small difference to me.
> > 
> > I still don't see how the group restriction gets surfaced to the
> > application through the group FD. The applications I looked through
> > just treat the group FD as a step on their way to get the device_fd.
> > 
> 
> So your proposal sort of moves the entire container/group/domain 
> managment into /dev/ioasid and then leaves vfio only provide device
> specific uAPI. An ioasid represents a page table (address space), thus 
> is equivalent to the scope of VFIO container.

Right.  I don't really know how /dev/iosasid is supposed to work, and
so far I don't see how it conceptually differs from a container.  What
is it adding?

> Having the device join 
> an ioasid is equivalent to attaching a device to VFIO container, and 
> here the group integrity must be enforced. Then /dev/ioasid anyway 
> needs to manage group objects and their association with ioasid and 
> underlying iommu domain thus it's pointless to keep same logic within
> VFIO. Is this understanding correct?
> 
> btw one remaining open is whether you expect /dev/ioasid to be 
> associated with a single iommu domain, or multiple. If only a single 
> domain is allowed, the ioasid_fd is equivalent to the scope of VFIO 
> container. It is supposed to have only one gpa_ioasid_id since one 
> iommu domain can only have a single 2nd level pgtable. Then all other 
> ioasids, once allocated, must be nested on this gpa_ioasid_id to fit 
> in the same domain. if a legacy vIOMMU is exposed (which disallows 
> nesting), the userspace has to open an ioasid_fd for every group. 
> This is basically the VFIO way. On the other hand if multiple domains 
> is allowed, there could be multiple ioasid_ids each holding a 2nd level 
> pgtable and an iommu domain (or a list of pgtables and domains due to
> incompatibility issue as discussed in another thread), and can be
> nested by other ioasids respectively. The application only needs
> to open /dev/ioasid once regardless of whether vIOMMU allows 
> nesting, and has a single interface for ioasid allocation. Which way
> do you prefer to?
> 
> Thanks
> Kevin
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-23 22:28                                                                             ` Jason Gunthorpe
@ 2021-04-27  5:15                                                                               ` David Gibson
  0 siblings, 0 replies; 269+ messages in thread
From: David Gibson @ 2021-04-27  5:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Liu, Yi L, Jacob Pan, Auger Eric,
	Jean-Philippe Brucker, Tian, Kevin, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Jonathan Corbet, Raj,
	Ashok, Wu, Hao, Jiang, Dave, Alexey Kardashevskiy

[-- Attachment #1: Type: text/plain, Size: 2238 bytes --]

On Fri, Apr 23, 2021 at 07:28:03PM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 23, 2021 at 10:38:51AM -0600, Alex Williamson wrote:
> > On Thu, 22 Apr 2021 20:39:50 -0300
> 
> > > /dev/ioasid should understand the group concept somehow, otherwise it
> > > is incomplete and maybe even security broken.
> > > 
> > > So, how do I add groups to, say, VDPA in a way that makes sense? The
> > > only answer I come to is broadly what I outlined here - make
> > > /dev/ioasid do all the group operations, and do them when we enjoin
> > > the VDPA device to the ioasid.
> > > 
> > > Once I have solved all the groups problems with the non-VFIO users,
> > > then where does that leave VFIO? Why does VFIO need a group FD if
> > > everyone else doesn't?
> > 
> > This assumes there's a solution for vDPA that doesn't just ignore the
> > problem and hope for the best.  I can't speak to a vDPA solution.
> 
> I don't think we can just ignore the question and succeed with
> /dev/ioasid.
> 
> Guess it should get answered as best it can for ioasid "in general"
> then we can decide if it makes sense for VFIO to use the group FD or
> not when working in ioasid mode.
> 
> Maybe a better idea will come up
> 
> > an implicit restriction.  You've listed a step in the description about
> > a "list of devices in the group", but nothing in the pseudo code
> > reflects that step.
> 
> I gave it below with the readdir() - it isn't in the pseudo code
> because the applications I looked through didn't use it, and wouldn't
> benefit from it. I tried to show what things were doing today.

And chance are they will break cryptically if you give them a device
in a multi-device group.  That's not something we want to encourage.

> 
> > I expect it would be a subtly missed by any userspace driver
> > developer unless they happen to work on a system where the grouping
> > is not ideal.
> 
> I'm still unclear - what are be the consequence if the application
> designer misses the group detail? 
> 
> Jason
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-27  5:11                                                                             ` David Gibson
@ 2021-04-27 16:39                                                                               ` Jason Gunthorpe
  2021-04-28  0:49                                                                                 ` David Gibson
  0 siblings, 1 reply; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-27 16:39 UTC (permalink / raw)
  To: David Gibson
  Cc: Tian, Kevin, Alex Williamson, Liu, Yi L, Jacob Pan, Auger Eric,
	Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Jonathan Corbet, Raj,
	Ashok, Wu, Hao, Jiang, Dave, Alexey Kardashevskiy

On Tue, Apr 27, 2021 at 03:11:25PM +1000, David Gibson wrote:

> > So your proposal sort of moves the entire container/group/domain 
> > managment into /dev/ioasid and then leaves vfio only provide device
> > specific uAPI. An ioasid represents a page table (address space), thus 
> > is equivalent to the scope of VFIO container.
> 
> Right.  I don't really know how /dev/iosasid is supposed to work, and
> so far I don't see how it conceptually differs from a container.  What
> is it adding?

There are three motivating topics:
 1) /dev/vfio/vfio is only usable by VFIO and we have many interesting
    use cases now where we need the same thing usable outside VFIO
 2) /dev/vfio/vfio does not support modern stuff like PASID and
    updating to support that is going to be a big change, like adding
    multiple IOASIDs so they can be modeled as as a tree inside a
    single FD
 3) I understand there is some desire to revise the uAPI here a bit,
    ie Alex mentioned the poor mapping performance.

I would say it is not conceptually different from what VFIO calls a
container, it is just a different uAPI with the goal to be cross
subsystem.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-27  5:08                                                                           ` David Gibson
@ 2021-04-27 17:12                                                                             ` Jason Gunthorpe
  2021-04-28  0:58                                                                               ` David Gibson
                                                                                                 ` (2 more replies)
  0 siblings, 3 replies; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-27 17:12 UTC (permalink / raw)
  To: David Gibson
  Cc: Alex Williamson, Liu, Yi L, Jacob Pan, Auger Eric,
	Jean-Philippe Brucker, Tian, Kevin, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Jonathan Corbet, Raj,
	Ashok, Wu, Hao, Jiang, Dave, Alexey Kardashevskiy

On Tue, Apr 27, 2021 at 03:08:46PM +1000, David Gibson wrote:
> > Starting from a BDF the general pseudo code is:
> >  device_name = first_directory_of("/sys/bus/pci/devices/BDF/vfio/")
> >  device_fd = open("/dev/vfio/"+device_name)
> >  ioasidfd = open("/dev/ioasid")
> >  ioctl(device_fd, JOIN_IOASID_FD, ioasidfd)
> 
> This line is the problem.
> 
> [Historical aside: Alex's early drafts for the VFIO interface looked
> quite similar to this.  Ben Herrenschmidt and myself persuaded him it
> was a bad idea, and groups were developed instead.  I still think it's
> a bad idea, and not just for POWER]

Spawning the VFIO device FD from the group FD is incredibly gross from
a kernel design perspective. Since that was done the struct
vfio_device missed out on a sysfs presence and doesn't have the
typical 'struct device' member or dedicated char device you'd expect a
FD based subsystem to have.

This basically traded normal usage of the driver core for something
that doesn't serve a technical usage. Given we are now nearly 10 years
later and see that real widely deployed applications are not doing
anything special with the group FD it makes me question the wisdom of
this choice.

> As Alex says, if this line fails because of the group restrictions,
> that's not great because it's not very obvious what's gone wrong.  

Okay, that is fair, but let's solve that problem directly. For
instance netlink has been going in the direction of adding a "extack"
from the kernel which is a descriptive error string. If the failing
ioctl returned the string:

  "cannot join this device to the IOASID because device XXX in the
   same group #10 is in use"

Would you agree it is now obvious what has gone wrong? In fact would
you agree this is a lot better user experience than what applications
do today even though they have the group FD?

> But IMO, the success path on a multi-device group is kind of worse:
> you've now made made a meaningful and visible change to the setup of
> devices which are not mentioned in this line *at all*.  

I don't think spawning a single device_fd from the guoup clearly says
there are repercussions outside that immediate, single, device.

That comes from understanding what the ioctls are doing, and reading
the documentation. The same applies to some non-group FD world.

> Yes, it makes set up more of a pain, but it's necessary complexity to
> actually understand what's going on here.

There is a real technical problem here - the VFIO group is the thing
that spawns the device_fd and that is incompatible with the idea to
centralize the group security logic in drivers/iommu/ and share it
with multiple subsystems.

We also don't have an obvious clean way to incorporate a group FD into
other subsystems (nor would I want to).

One option is VFIO can keep its group FD but nothing else will have
anthing like it. However I don't much like the idea that VFIO will
have a special and unique programming model to do that same things
other subsystem will do. That will make it harder for userspace to
implement.

But again, lets see what the draft ioasid proposal looks like and
maybe someone will see a different solution.

Jason

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-27  4:50                                                                 ` David Gibson
@ 2021-04-27 17:24                                                                   ` Jason Gunthorpe
  2021-04-28  1:23                                                                     ` David Gibson
  0 siblings, 1 reply; 269+ messages in thread
From: Jason Gunthorpe @ 2021-04-27 17:24 UTC (permalink / raw)
  To: David Gibson
  Cc: Alex Williamson, Liu, Yi L, Jacob Pan, Auger Eric,
	Jean-Philippe Brucker, Tian, Kevin, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Jonathan Corbet, Raj,
	Ashok, Wu, Hao, Jiang, Dave, Alexey Kardashevskiy

On Tue, Apr 27, 2021 at 02:50:45PM +1000, David Gibson wrote:

> > > I say this because the SPAPR looks quite a lot like PASID when it has
> > > APIs for allocating multiple tables and other things. I would be
> > > interested to hear someone from IBM talk about what it is doing and
> > > how it doesn't fit into today's IOMMU API.
> 
> Hm.  I don't think it's really like PASID.  Just like Type1, the TCE
> backend represents a single DMA address space which all devices in the
> container will see at all times.  The difference is that there can be
> multiple (well, 2) "windows" of valid IOVAs within that address space.
> Each window can have a different TCE (page table) layout.  For kernel
> drivers, a smallish translated window at IOVA 0 is used for 32-bit
> devices, and a large direct mapped (no page table) window is created
> at a high IOVA for better performance with 64-bit DMA capable devices.
>
> With the VFIO backend we create (but don't populate) a similar
> smallish 32-bit window, userspace can create its own secondary window
> if it likes, though obvious for userspace use there will always be a
> page table.  Userspace can choose the total size (but not address),
> page size and to an extent the page table format of the created
> window.  Note that the TCE page table format is *not* the same as the
> POWER CPU core's page table format.  Userspace can also remove the
> default small window and create its own.

So what do you need from the generic API? I'd suggest if userspace
passes in the required IOVA range it would benefit all the IOMMU
drivers to setup properly sized page tables and PPC could use that to
drive a single window. I notice this is all DPDK did to support TCE.

> The second wrinkle is pre-registration.  That lets userspace register
> certain userspace VA ranges (*not* IOVA ranges) as being the only ones
> allowed to be mapped into the IOMMU.  This is a performance
> optimization, because on pre-registration we also pre-account memory
> that will be effectively locked by DMA mappings, rather than doing it
> at DMA map and unmap time.

This feels like nesting IOASIDs to me, much like a vPASID.

The pre-registered VA range would be the root of the tree and the
vIOMMU created ones would be children of the tree. This could allow
the map operations of the child to refer to already prepped physical
memory held in the root IOASID avoiding the GUP/etc cost.

Seems fairly genericish, though I'm not sure about the kvm linkage..

> I like the idea of a common DMA/IOMMU handling system across
> platforms.  However in order to be efficiently usable for POWER it
> will need to include multiple windows, allowing the user to change
> those windows and something like pre-registration to amortize
> accounting costs for heavy vIOMMU load.

I have a feeling /dev/ioasid is going to end up with some HW specific
escape hatch to create some HW specific IOASID types and operate on
them in a HW specific way.

However, what I would like to see is that something simple like DPDK
can have a single implementation - POWER should implement the standard
operations and map them to something that will work for it.

As an ideal, only things like the HW specific qemu vIOMMU driver
should be reaching for all the special stuff.

In this way the kernel IOMMU driver and the qemu user vIOMMU driver
would form something of a classical split user/kernel driver pattern.

Jason


^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-27 16:39                                                                               ` Jason Gunthorpe
@ 2021-04-28  0:49                                                                                 ` David Gibson
  0 siblings, 0 replies; 269+ messages in thread
From: David Gibson @ 2021-04-28  0:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Alex Williamson, Liu, Yi L, Jacob Pan, Auger Eric,
	Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Jonathan Corbet, Raj,
	Ashok, Wu, Hao, Jiang, Dave, Alexey Kardashevskiy

[-- Attachment #1: Type: text/plain, Size: 1508 bytes --]

On Tue, Apr 27, 2021 at 01:39:54PM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 27, 2021 at 03:11:25PM +1000, David Gibson wrote:
> 
> > > So your proposal sort of moves the entire container/group/domain 
> > > managment into /dev/ioasid and then leaves vfio only provide device
> > > specific uAPI. An ioasid represents a page table (address space), thus 
> > > is equivalent to the scope of VFIO container.
> > 
> > Right.  I don't really know how /dev/iosasid is supposed to work, and
> > so far I don't see how it conceptually differs from a container.  What
> > is it adding?
> 
> There are three motivating topics:
>  1) /dev/vfio/vfio is only usable by VFIO and we have many interesting
>     use cases now where we need the same thing usable outside VFIO
>  2) /dev/vfio/vfio does not support modern stuff like PASID and
>     updating to support that is going to be a big change, like adding
>     multiple IOASIDs so they can be modeled as as a tree inside a
>     single FD
>  3) I understand there is some desire to revise the uAPI here a bit,
>     ie Alex mentioned the poor mapping performance.
> 
> I would say it is not conceptually different from what VFIO calls a
> container, it is just a different uAPI with the goal to be cross
> subsystem.

Ok, that makes sense.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-27 17:12                                                                             ` Jason Gunthorpe
@ 2021-04-28  0:58                                                                               ` David Gibson
  2021-04-28 14:56                                                                                 ` Jason Gunthorpe
  2021-04-28  6:58                                                                               ` Tian, Kevin
  2021-04-28  7:47                                                                               ` Tian, Kevin
  2 siblings, 1 reply; 269+ messages in thread
From: David Gibson @ 2021-04-28  0:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Liu, Yi L, Jacob Pan, Auger Eric,
	Jean-Philippe Brucker, Tian, Kevin, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Jonathan Corbet, Raj,
	Ashok, Wu, Hao, Jiang, Dave, Alexey Kardashevskiy

[-- Attachment #1: Type: text/plain, Size: 5257 bytes --]

On Tue, Apr 27, 2021 at 02:12:12PM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 27, 2021 at 03:08:46PM +1000, David Gibson wrote:
> > > Starting from a BDF the general pseudo code is:
> > >  device_name = first_directory_of("/sys/bus/pci/devices/BDF/vfio/")
> > >  device_fd = open("/dev/vfio/"+device_name)
> > >  ioasidfd = open("/dev/ioasid")
> > >  ioctl(device_fd, JOIN_IOASID_FD, ioasidfd)
> > 
> > This line is the problem.
> > 
> > [Historical aside: Alex's early drafts for the VFIO interface looked
> > quite similar to this.  Ben Herrenschmidt and myself persuaded him it
> > was a bad idea, and groups were developed instead.  I still think it's
> > a bad idea, and not just for POWER]
> 
> Spawning the VFIO device FD from the group FD is incredibly gross from
> a kernel design perspective. Since that was done the struct
> vfio_device missed out on a sysfs presence and doesn't have the
> typical 'struct device' member or dedicated char device you'd expect a
> FD based subsystem to have.
> 
> This basically traded normal usage of the driver core for something
> that doesn't serve a technical usage. Given we are now nearly 10 years
> later and see that real widely deployed applications are not doing
> anything special with the group FD it makes me question the wisdom of
> this choice.

I'm really not sure what "anything special" would constitute here.

> > As Alex says, if this line fails because of the group restrictions,
> > that's not great because it's not very obvious what's gone wrong.  
> 
> Okay, that is fair, but let's solve that problem directly. For
> instance netlink has been going in the direction of adding a "extack"
> from the kernel which is a descriptive error string. If the failing
> ioctl returned the string:
> 
>   "cannot join this device to the IOASID because device XXX in the
>    same group #10 is in use"

Um.. is there a sane way to return strings from an ioctl()?

> Would you agree it is now obvious what has gone wrong? In fact would
> you agree this is a lot better user experience than what applications
> do today even though they have the group FD?
> 
> > But IMO, the success path on a multi-device group is kind of worse:
> > you've now made made a meaningful and visible change to the setup of
> > devices which are not mentioned in this line *at all*.  
> 
> I don't think spawning a single device_fd from the guoup clearly says
> there are repercussions outside that immediate, single, device.

It's not the fact that the device fds are spawed from the group fd.
It's the fact that the "attach" operation - binding the group to the
container now, binding the whatever to the iosasid in future -
explicitly takes a group.  That's an operation that affects a group,
so the interface should reflect that.

Getting the device fds from the group fd kind of follows, because it's
unsafe to do basically anything on the device unless you already
control the group (which in this case means attaching it to a
container/ioasid).  I'm entirely open to ways of doing that that are
less inelegant from a sysfs integration point of view, but the point
is you must manage the group before you can do anything at all with
individual devices.

> That comes from understanding what the ioctls are doing, and reading
> the documentation. The same applies to some non-group FD world.
> 
> > Yes, it makes set up more of a pain, but it's necessary complexity to
> > actually understand what's going on here.
> 
> There is a real technical problem here - the VFIO group is the thing
> that spawns the device_fd and that is incompatible with the idea to
> centralize the group security logic in drivers/iommu/ and share it
> with multiple subsystems.

I don't see why.  I mean, sure, you don't want explicitly the *vfio*
group as such.  But IOMMU group is already a cross-subsystem concept
and you can explicitly expose that in a different way.

> We also don't have an obvious clean way to incorporate a group FD into
> other subsystems (nor would I want to).

If you don't have a group concept in other subsystems, there's a fair
chance they are broken.  There are a bunch of operations that are
inherently per-group.  Well.. per container/IOASID, but the
granularity of membership for that is the group.

> One option is VFIO can keep its group FD but nothing else will have
> anthing like it. However I don't much like the idea that VFIO will
> have a special and unique programming model to do that same things
> other subsystem will do. That will make it harder for userspace to
> implement.

Again, I realy think this is necessary complexity.  You're right that
far too little of the userspace properly understands group
restrictions.. but these come from real hardware limitations, and I
don't feel like making it *less* obvious in the interface is going to
help that.

> But again, lets see what the draft ioasid proposal looks like and
> maybe someone will see a different solution.
> 
> Jason
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-27 17:24                                                                   ` Jason Gunthorpe
@ 2021-04-28  1:23                                                                     ` David Gibson
  2021-04-29  0:21                                                                       ` Jason Gunthorpe
  0 siblings, 1 reply; 269+ messages in thread
From: David Gibson @ 2021-04-28  1:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Liu, Yi L, Jacob Pan, Auger Eric,
	Jean-Philippe Brucker, Tian, Kevin, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Jonathan Corbet, Raj,
	Ashok, Wu, Hao, Jiang, Dave, Alexey Kardashevskiy

[-- Attachment #1: Type: text/plain, Size: 5830 bytes --]

On Tue, Apr 27, 2021 at 02:24:32PM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 27, 2021 at 02:50:45PM +1000, David Gibson wrote:
> 
> > > > I say this because the SPAPR looks quite a lot like PASID when it has
> > > > APIs for allocating multiple tables and other things. I would be
> > > > interested to hear someone from IBM talk about what it is doing and
> > > > how it doesn't fit into today's IOMMU API.
> > 
> > Hm.  I don't think it's really like PASID.  Just like Type1, the TCE
> > backend represents a single DMA address space which all devices in the
> > container will see at all times.  The difference is that there can be
> > multiple (well, 2) "windows" of valid IOVAs within that address space.
> > Each window can have a different TCE (page table) layout.  For kernel
> > drivers, a smallish translated window at IOVA 0 is used for 32-bit
> > devices, and a large direct mapped (no page table) window is created
> > at a high IOVA for better performance with 64-bit DMA capable devices.
> >
> > With the VFIO backend we create (but don't populate) a similar
> > smallish 32-bit window, userspace can create its own secondary window
> > if it likes, though obvious for userspace use there will always be a
> > page table.  Userspace can choose the total size (but not address),
> > page size and to an extent the page table format of the created
> > window.  Note that the TCE page table format is *not* the same as the
> > POWER CPU core's page table format.  Userspace can also remove the
> > default small window and create its own.
> 
> So what do you need from the generic API? I'd suggest if userspace
> passes in the required IOVA range it would benefit all the IOMMU
> drivers to setup properly sized page tables and PPC could use that to
> drive a single window. I notice this is all DPDK did to support TCE.

Yes.  My proposed model for a unified interface would be that when you
create a new container/IOASID, *no* IOVAs are valid.  Before you can
map anything you would have to create a window with specified base,
size, pagesize (probably some flags for extension, too).  That could
fail if the backend IOMMU can't handle that IOVA range, it could be a
backend no-op if the requested window lies within a fixed IOVA range
the backend supports, or it could actually reprogram the back end for
the new window (such as for POWER TCEs).  Regardless of the hardware,
attempts to map outside the created window(s) would be rejected by
software.

I expect we'd need some kind of query operation to expose limitations
on the number of windows, addresses for them, available pagesizes etc.

> > The second wrinkle is pre-registration.  That lets userspace register
> > certain userspace VA ranges (*not* IOVA ranges) as being the only ones
> > allowed to be mapped into the IOMMU.  This is a performance
> > optimization, because on pre-registration we also pre-account memory
> > that will be effectively locked by DMA mappings, rather than doing it
> > at DMA map and unmap time.
> 
> This feels like nesting IOASIDs to me, much like a vPASID.
> 
> The pre-registered VA range would be the root of the tree and the
> vIOMMU created ones would be children of the tree. This could allow
> the map operations of the child to refer to already prepped physical
> memory held in the root IOASID avoiding the GUP/etc cost.

Huh... I never thought of it that way, but yeah, that sounds like it
could work.  More elegantly than the current system in fact.

> Seems fairly genericish, though I'm not sure about the kvm linkage..

I think it should be doable.  We'd basically need to give KVM a handle
on the parent AS, and the child AS, and the guest side handle (what
PAPR calls a "Logical IO Bus Number" - liobn).  KVM would then
translate H_PUT_TCE etc. hypercalls on that liobn into calls into the
IOMMU subsystem to map bits of the parent AS into the child.  We'd
probably have to have some requirements that either parent AS is
identity-mapped to a subset of the userspace AS (effectively what we
have now) or that parent AS is the same as guest physical address.
Not sure which would work better.

> > I like the idea of a common DMA/IOMMU handling system across
> > platforms.  However in order to be efficiently usable for POWER it
> > will need to include multiple windows, allowing the user to change
> > those windows and something like pre-registration to amortize
> > accounting costs for heavy vIOMMU load.
> 
> I have a feeling /dev/ioasid is going to end up with some HW specific
> escape hatch to create some HW specific IOASID types and operate on
> them in a HW specific way.
> 
> However, what I would like to see is that something simple like DPDK
> can have a single implementation - POWER should implement the standard
> operations and map them to something that will work for it.
> 
> As an ideal, only things like the HW specific qemu vIOMMU driver
> should be reaching for all the special stuff.

I'm hoping we can even avoid that, usually.  With the explicitly
created windows model I propose above, it should be able to: qemu will
create the windows according to the IOVA windows the guest platform
expects to see and they either will or won't work on the host platform
IOMMU.  If they do, generic maps/unmaps should be sufficient.  If they
don't well, the host IOMMU simply cannot emulate the vIOMMU so you're
out of luck anyway.

> In this way the kernel IOMMU driver and the qemu user vIOMMU driver
> would form something of a classical split user/kernel driver pattern.
> 
> Jason
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 269+ messages in thread

* RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-26 12:38                                                                         ` Jason Gunthorpe
@ 2021-04-28  6:34                                                                           ` Tian, Kevin
  2021-04-28 15:06                                                                             ` Alex Williamson
                                                                                               ` (2 more replies)
  0 siblings, 3 replies; 269+ messages in thread
From: Tian, Kevin @ 2021-04-28  6:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Liu, Yi L, Jacob Pan, Auger Eric,
	Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Jonathan Corbet, Raj,
	Ashok, Wu, Hao, Jiang, Dave

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Monday, April 26, 2021 8:38 PM
> 
[...]
> > Want to hear your opinion for one open here. There is no doubt that
> > an ioasid represents a HW page table when the table is constructed by
> > userspace and then linked to the IOMMU through the bind/unbind
> > API. But I'm not very sure about whether an ioasid should represent
> > the exact pgtable or the mapping metadata when the underlying
> > pgtable is indirectly constructed through map/unmap API. VFIO does
> > the latter way, which is why it allows multiple incompatible domains
> > in a single container which all share the same mapping metadata.
> 
> I think VFIO's map/unmap is way too complex and we know it has bad
> performance problems.

Can you or Alex elaborate where the complexity and performance problem
locate in VFIO map/umap? We'd like to understand more detail and see how 
to avoid it in the new interface.

> 
> If /dev/ioasid is single HW page table only then I would focus on that
> implementation and leave it to userspace to span different
> /dev/ioasids if needed.
> 
> > OK, now I see where the disconnection comes from. In my context ioasid
> > is the identifier that is actually used in the wire, but seems you treat it as
> > a sw-defined namespace purely for representing page tables. We should
> > clear this concept first before further discussing other details. 😊
> 
> There is no general HW requirement that every IO page table be
> referred to by the same PASID and this API would have to support

Yes, but what is the value of allowing multiple PASIDs referring to the
the same I/O page table (except the nesting pgtable case)? Doesn't it 
lead to poor iotlb efficiency issue similar to multiple iommu domains 
referring to the same page table?

> non-PASID IO page tables as well. So I'd keep the two things
> separated in the uAPI - even though the kernel today has a global
> PASID pool.

for non-PASID usages the allocated PASID will be wasted if we don't
separate ioasid from pasid. But it may be worthwhile given 1m available 
pasids and the simplification in the uAPI which only needs to care about 
one id space then.

> 
> > Then following your proposal, does it mean that we need another
> > interface for allocating PASID? and since ioasid means different
> > thing in uAPI and in-kernel API, possibly a new name is required to
> > avoid confusion?
> 
> I would suggest have two ways to control the PASID
> 
>  1) Over /dev/ioasid allocate a PASID for an IOASID. All future PASID
>     based usages of the IOASID will use that global PASID
> 
>  2) Over the device FD, when the IOASID is bound return the PASID that
>     was selected. If the IOASID does not have a global PASID then the
>     kernel is free to make something up. In this mode a single IOASID
>     can have multiple PASIDs.
> 
> Simple things like DPDK can use #2 and potentially have better PASID
> limits. hypervisors will most likely have to use #1, but it depends on
> how their vIOMMU interface works.

Can you elaborate why DPDK wants to use #2 i.e. not using a global
PASID?

> 
> I think the name IOASID is fine for the uAPI, the kernel version can
> be called ioasid_id or something.

ioasid is already an id and then ioasid_id just adds confusion. Another
point is that ioasid is currently used to represent both PCI PASID and
ARM substream ID in the kernel. It implies that if we want to separate
ioasid and pasid in the uAPI the 'pasid' also needs to be replaced with
another general term usable for substream ID. Are we making the
terms too confusing here? 

> 
> (also looking at ioasid.c, why do we need such a thin and odd wrapper
> around xarray?)
> 

I'll leave it to Jean and Jacob.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 269+ messages in thread

* RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-27 17:12                                                                             ` Jason Gunthorpe
  2021-04-28  0:58                                                                               ` David Gibson
@ 2021-04-28  6:58                                                                               ` Tian, Kevin
  2021-05-04 17:12                                                                                 ` Jason Gunthorpe
  2021-04-28  7:47                                                                               ` Tian, Kevin
  2 siblings, 1 reply; 269+ messages in thread
From: Tian, Kevin @ 2021-04-28  6:58 UTC (permalink / raw)
  To: Jason Gunthorpe, David Gibson
  Cc: Alex Williamson, Liu, Yi L, Jacob Pan, Auger Eric,
	Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Jonathan Corbet, Raj,
	Ashok, Wu, Hao, Jiang, Dave, Alexey Kardashevskiy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, April 28, 2021 1:12 AM
> 
[...] 
> > As Alex says, if this line fails because of the group restrictions,
> > that's not great because it's not very obvious what's gone wrong.
> 
> Okay, that is fair, but let's solve that problem directly. For
> instance netlink has been going in the direction of adding a "extack"
> from the kernel which is a descriptive error string. If the failing
> ioctl returned the string:
> 
>   "cannot join this device to the IOASID because device XXX in the
>    same group #10 is in use"
> 
> Would you agree it is now obvious what has gone wrong? In fact would
> you agree this is a lot better user experience than what applications
> do today even though they have the group FD?
> 

Currently all the discussions are around implicit vs. explicit uAPI semantics
on the group restriction. However if we look beyond group the implicit 
semantics might be inevitable when dealing with incompatible iommu
domains. An existing example of iommu incompatibility is IOMMU_
CACHE. In the future there could be other incompatibilities such as 
whether nested translation is supported. In the end the userspace has 
to do some due diligence on understanding iommu topology and attributes 
to decide how many VFIO containers or ioasid fds should be created. It 
does push some burden to userspace but it's difficult to define a group-
like kernel object to enforce such restriction for iommu compatibility. 
Then the best that the kernel can do is to return an informational error 
message in case an incompatible device is attached to the existing domain. 
If this is the perceived way to move forward anyway, I feel that removing 
explicit group FD from uAPI doesn't make userspace worse...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 269+ messages in thread

* RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
  2021-04-27 17:12                                                                             ` Jason Gunthorpe
  2021-04-28  0:58                                                                               ` David Gibson
  2021-04-28  6:58                                                                               ` Tian, Kevin
@ 2021-04-28  7:47                                                                               ` Tian, Kevin
  2021-04-28 18:41                                                                                 ` Jason Gunthorpe
  2 siblings, 1 reply; 269+ messages in thread
From: Tian, Kevin @ 2021-04-28  7:47 UTC (permalink / raw)
  To: Jason Gunthorpe, David Gibson
  Cc: Alex Williamson, Liu, Yi L, Jacob Pan, Auger Eric,
	Jean-Philippe Brucker, LKML, Joerg Roedel, Lu Baolu,
	David Woodhouse, iommu, cgroups, Tejun Heo, Li Zefan,
	Johannes Weiner, Jean-Philippe Brucker, Jonathan Corbet, Raj,
	Ashok, Wu, Hao, Jiang, Dave, Alexey Kardashevskiy