archive mirror
 help / color / mirror / Atom feed
From: Fenghua Yu <>
To: "Thomas Gleixner" <>,
	"Borislav Petkov" <>,
	"Ingo Molnar" <>, "H Peter Anvin" <>,
	"Andy Lutomirski" <>,
	"Jean-Philippe Brucker" <>,
	"Christoph Hellwig" <>,
	"Peter Zijlstra" <>,
	"David Woodhouse" <>,
	"Lu Baolu" <>,
	"Dave Hansen" <>,
	"Tony Luck" <>,
	"Randy Dunlap" <>,
	"Ashok Raj" <>,
	"Jacob Jun Pan" <>,
	"Dave Jiang" <>,
	"Sohil Mehta" <>,
	"Ravi V Shankar" <>
Cc: Fenghua Yu <>,, x86 <>,
	linux-kernel <>
Subject: [PATCH v8 3/9] Documentation/x86: Add documentation for SVA (Shared Virtual Addressing)
Date: Tue, 15 Sep 2020 09:30:07 -0700	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <>

From: Ashok Raj <>

ENQCMD and Data Streaming Accelerator (DSA) and all of their associated
features are a complicated stack with lots of interconnected pieces.
This documentation provides a big picture overview for all of the

Signed-off-by: Ashok Raj <>
Co-developed-by: Fenghua Yu <>
Signed-off-by: Fenghua Yu <>
Reviewed-by: Tony Luck <>
- Address all of comments from Boris and Randy.

- Change the doc for updating PASID by IPI and context switch (Andy).

- Replace deprecated intel_svm_bind_mm() by iommu_sva_bind_mm() (Baolu)
- Fix a couple of typos (Baolu)

- Fix the doc format and add the doc in toctree (Thomas)
- Modify the doc for better description (Thomas, Tony, Dave)

 Documentation/x86/index.rst |   1 +
 Documentation/x86/sva.rst   | 256 ++++++++++++++++++++++++++++++++++++
 2 files changed, 257 insertions(+)
 create mode 100644 Documentation/x86/sva.rst

diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst
index 265d9e9a093b..e5d5ff096685 100644
--- a/Documentation/x86/index.rst
+++ b/Documentation/x86/index.rst
@@ -30,3 +30,4 @@ x86-specific Documentation
+   sva
diff --git a/Documentation/x86/sva.rst b/Documentation/x86/sva.rst
new file mode 100644
index 000000000000..a1f008ef7dad
--- /dev/null
+++ b/Documentation/x86/sva.rst
@@ -0,0 +1,256 @@
+.. SPDX-License-Identifier: GPL-2.0
+Shared Virtual Addressing (SVA) with ENQCMD
+Shared Virtual Addressing (SVA) allows the processor and device to use the
+same virtual addresses avoiding the need for software to translate virtual
+addresses to physical addresses. SVA is what PCIe calls Shared Virtual
+Memory (SVM).
+In addition to the convenience of using application virtual addresses
+by the device, it also doesn't require pinning pages for DMA.
+PCIe Address Translation Services (ATS) along with Page Request Interface
+(PRI) allow devices to function much the same way as the CPU handling
+application page-faults. For more information please refer to the PCIe
+specification Chapter 10: ATS Specification.
+Use of SVA requires IOMMU support in the platform. IOMMU also is required
+to support PCIe features ATS and PRI. ATS allows devices to cache
+translations for virtual addresses. The IOMMU driver uses the mmu_notifier()
+support to keep the device TLB cache and the CPU cache in sync. PRI allows
+the device to request paging the virtual address by using the CPU page tables
+before accessing the address.
+Shared Hardware Workqueues
+Unlike Single Root I/O Virtualization (SR-IOV), Scalable IOV (SIOV) permits
+the use of Shared Work Queues (SWQ) by both applications and Virtual
+Machines (VM's). This allows better hardware utilization vs. hard
+partitioning resources that could result in under utilization. In order to
+allow the hardware to distinguish the context for which work is being
+executed in the hardware by SWQ interface, SIOV uses Process Address Space
+ID (PASID), which is a 20-bit number defined by the PCIe SIG.
+PASID value is encoded in all transactions from the device. This allows the
+IOMMU to track I/O on a per-PASID granularity in addition to using the PCIe
+Resource Identifier (RID) which is the Bus/Device/Function.
+ENQCMD is a new instruction on Intel platforms that atomically submits a
+work descriptor to a device. The descriptor includes the operation to be
+performed, virtual addresses of all parameters, virtual address of a completion
+record, and the PASID (process address space ID) of the current process.
+ENQCMD works with non-posted semantics and carries a status back if the
+command was accepted by hardware. This allows the submitter to know if the
+submission needs to be retried or other device specific mechanisms to
+implement fairness or ensure forward progress should be provided.
+ENQCMD is the glue that ensures applications can directly submit commands
+to the hardware and also permits hardware to be aware of application context
+to perform I/O operations via use of PASID.
+Process Address Space Tagging
+A new thread-scoped MSR (IA32_PASID) provides the connection between
+user processes and the rest of the hardware. When an application first
+accesses an SVA-capable device this MSR is initialized with a newly
+allocated PASID. The driver for the device calls an IOMMU-specific API
+that sets up the routing for DMA and page-requests.
+For example, the Intel Data Streaming Accelerator (DSA) uses
+iommu_sva_bind_device(), which will do the following:
+- Allocate the PASID, and program the process page-table (%cr3 register) in the
+  PASID context entries.
+- Register for mmu_notifier() to track any page-table invalidations to keep
+  the device TLB in sync. For example, when a page-table entry is invalidated,
+  IOMMU propagates the invalidation to the device TLB. This will force any
+  future access by the device to this virtual address to participate in
+  ATS. If the IOMMU responds with proper response that a page is not
+  present, the device would request the page to be paged in via the PCIe PRI
+  protocol before performing I/O.
+This MSR is managed with the XSAVE feature set as "supervisor state" to
+ensure the MSR is updated during context switch.
+PASID Management
+The kernel must allocate a PASID on behalf of each process which will use
+ENQCMD and program it into the new MSR to communicate the process identity to
+platform hardware.  ENQCMD uses the PASID stored in this MSR to tag requests
+from this process.  When a user submits a work descriptor to a device using the
+ENQCMD instruction, the PASID field in the descriptor is auto-filled with the
+value from MSR_IA32_PASID. Requests for DMA from the device are also tagged
+with the same PASID. The platform IOMMU uses the PASID in the transaction to
+perform address translation. The IOMMU APIs setup the corresponding PASID
+entry in IOMMU with the process address used by the CPU (e.g. %cr3 register in
+The MSR must be configured on each logical CPU before any application
+thread can interact with a device. Threads that belong to the same
+process share the same page tables, thus the same MSR value.
+PASID is cleared when a process is created. The PASID allocation and MSR
+programming may occur long after a process and its threads have been created.
+One thread must call iommu_sva_bind_device() to allocate the PASID for the
+process. If a thread uses ENQCMD without the MSR first being populated, a #GP
+will be raised. The kernel will update the PASID MSR with the PASID for all
+threads in the process. A single process PASID can be used simultaneously
+with multiple devices since they all share the same address space.
+One thread can call iommu_sva_unbind_device() to free the allocated PASID.
+The kernel will clear the PASID MSR for all threads belonging to the process.
+New threads inherit the MSR value from the parent.
+ * Each process has many threads, but only one PASID.
+ * Devices have a limited number (~10's to 1000's) of hardware workqueues.
+   The device driver manages allocating hardware workqueues.
+ * A single mmap() maps a single hardware workqueue as a "portal" and
+   each portal maps down to a single workqueue.
+ * For each device with which a process interacts, there must be
+   one or more mmap()'d portals.
+ * Many threads within a process can share a single portal to access
+   a single device.
+ * Multiple processes can separately mmap() the same portal, in
+   which case they still share one device hardware workqueue.
+ * The single process-wide PASID is used by all threads to interact
+   with all devices.  There is not, for instance, a PASID for each
+   thread or each thread<->device pair.
+* What is SVA/SVM?
+Shared Virtual Addressing (SVA) permits I/O hardware and the processor to
+work in the same address space, i.e., to share it. Some call it Shared
+Virtual Memory (SVM), but Linux community wanted to avoid confusing it with
+POSIX Shared Memory and Secure Virtual Machines which were terms already in
+* What is a PASID?
+A Process Address Space ID (PASID) is a PCIe-defined Transaction Layer Packet
+(TLP) prefix. A PASID is a 20-bit number allocated and managed by the OS.
+PASID is included in all transactions between the platform and the device.
+* How are shared workqueues different?
+Traditionally, in order for userspace applications to interact with hardware,
+there is a separate hardware instance required per process. For example,
+consider doorbells as a mechanism of informing hardware about work to process.
+Each doorbell is required to be spaced 4k (or page-size) apart for process
+isolation. This requires hardware to provision that space and reserve it in
+MMIO. This doesn't scale as the number of threads becomes quite large. The
+hardware also manages the queue depth for Shared Work Queues (SWQ), and
+consumers don't need to track queue depth. If there is no space to accept
+a command, the device will return an error indicating retry.
+A user should check Deferrable Memory Write (DMWr) capability on the device
+and only submits ENQCMD when the device supports it. In the new DMWr PCIe
+terminology, devices need to support DMWr completer capability. In addition,
+it requires all switch ports to support DMWr routing and must be enabled by
+the PCIe subsystem, much like how PCIe atomic operations are managed for
+SWQ allows hardware to provision just a single address in the device. When
+used with ENQCMD to submit work, the device can distinguish the process
+submitting the work since it will include the PASID assigned to that
+process. This helps the device scale to a large number of processes.
+* Is this the same as a user space device driver?
+Communicating with the device via the shared workqueue is much simpler
+than a full blown user space driver. The kernel driver does all the
+initialization of the hardware. User space only needs to worry about
+submitting work and processing completions.
+* Is this the same as SR-IOV?
+Single Root I/O Virtualization (SR-IOV) focuses on providing independent
+hardware interfaces for virtualizing hardware. Hence, it's required to be
+almost fully functional interface to software supporting the traditional
+BARs, space for interrupts via MSI-X, its own register layout.
+Virtual Functions (VFs) are assisted by the Physical Function (PF)
+Scalable I/O Virtualization builds on the PASID concept to create device
+instances for virtualization. SIOV requires host software to assist in
+creating virtual devices; each virtual device is represented by a PASID
+along with the bus/device/function of the device.  This allows device
+hardware to optimize device resource creation and can grow dynamically on
+demand. SR-IOV creation and management is very static in nature. Consult
+references below for more details.
+* Why not just create a virtual function for each app?
+Creating PCIe SR-IOV type Virtual Functions (VF) is expensive. VFs require
+duplicated hardware for PCI config space and interrupts such as MSI-X.
+Resources such as interrupts have to be hard partitioned between VFs at
+creation time, and cannot scale dynamically on demand. The VFs are not
+completely independent from the Physical Function (PF). Most VFs require
+some communication and assistance from the PF driver. SIOV, in contrast,
+creates a software-defined device where all the configuration and control
+aspects are mediated via the slow path. The work submission and completion
+happen without any mediation.
+* Does this support virtualization?
+ENQCMD can be used from within a guest VM. In these cases, the VMM helps
+with setting up a translation table to translate from Guest PASID to Host
+PASID. Please consult the ENQCMD instruction set reference for more
+* Does memory need to be pinned?
+When devices support SVA along with platform hardware such as IOMMU
+supporting such devices, there is no need to pin memory for DMA purposes.
+Devices that support SVA also support other PCIe features that remove the
+pinning requirement for memory.
+Device TLB support - Device requests the IOMMU to lookup an address before
+use via Address Translation Service (ATS) requests.  If the mapping exists
+but there is no page allocated by the OS, IOMMU hardware returns that no
+mapping exists.
+Device requests the virtual address to be mapped via Page Request
+Interface (PRI). Once the OS has successfully completed the mapping, it
+returns the response back to the device. The device requests again for
+a translation and continues.
+IOMMU works with the OS in managing consistency of page-tables with the
+device. When removing pages, it interacts with the device to remove any
+device TLB entry that might have been cached before removing the mappings from
+the OS.
+DSA spec:

iommu mailing list

  parent reply	other threads:[~2020-09-15 16:30 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-15 16:30 [PATCH v8 0/9] x86: tag application address space for devices Fenghua Yu
2020-09-15 16:30 ` [PATCH v8 1/9] drm, iommu: Change type of pasid to u32 Fenghua Yu
2020-09-15 16:30 ` [PATCH v8 2/9] iommu/vt-d: Change flags type to unsigned int in binding mm Fenghua Yu
2020-09-15 16:30 ` Fenghua Yu [this message]
2020-09-17  7:53   ` [PATCH v8 3/9] Documentation/x86: Add documentation for SVA (Shared Virtual Addressing) Borislav Petkov
2020-09-17 14:56     ` Raj, Ashok
2020-09-17 17:18       ` Borislav Petkov
2020-09-17 17:22         ` Raj, Ashok
2020-09-17 17:30           ` Borislav Petkov
2020-09-18 16:22             ` Fenghua Yu
2020-09-15 16:30 ` [PATCH v8 4/9] x86/cpufeatures: Enumerate ENQCMD and ENQCMDS instructions Fenghua Yu
2020-09-15 16:30 ` [PATCH v8 5/9] x86/fpu/xstate: Add supervisor PASID state for ENQCMD feature Fenghua Yu
2020-09-15 16:30 ` [PATCH v8 6/9] x86/msr-index: Define IA32_PASID MSR Fenghua Yu
2020-09-15 16:30 ` [PATCH v8 7/9] mm: Define pasid in mm Fenghua Yu
2020-09-15 16:30 ` [PATCH v8 8/9] x86/cpufeatures: Mark ENQCMD as disabled when configured out Fenghua Yu
2020-09-15 16:30 ` [PATCH v8 9/9] x86/mmu: Allocate/free PASID Fenghua Yu
2021-05-29  9:17   ` [PATCH] x86/cpufeatures: Force disable X86_FEATURE_ENQCMD and remove update_pasid() Thomas Gleixner
2021-05-31  8:43     ` Borislav Petkov
2021-05-31 10:16       ` Thomas Gleixner
2021-06-02 20:37         ` Luck, Tony
2021-06-03 17:31           ` Andy Lutomirski
2021-06-09 17:32             ` Luck, Tony
2021-06-09 23:34               ` Andy Lutomirski
2021-06-25 15:46                 ` Luck, Tony
2021-06-02 10:14     ` Borislav Petkov
2021-06-02 10:20       ` Thomas Gleixner
2021-06-03 11:20       ` Vinod Koul
2021-06-03 11:42         ` Borislav Petkov
2021-06-03 12:47           ` Vinod Koul
2021-06-03 14:33             ` Borislav Petkov
2020-09-16  8:06 ` [PATCH v8 0/9] x86: tag application address space for devices Joerg Roedel
2020-09-17 23:53   ` Fenghua Yu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).