All of lore.kernel.org
 help / color / mirror / Atom feed
* [virtio-comment] [PATCH v3] Add virtio-iommu device specification
@ 2019-04-30 13:56 Jean-Philippe Brucker
  2019-04-30 14:42 ` [virtio-comment] Re: [virtio-dev] " Jean-Philippe Brucker
  2019-05-03 17:05 ` [virtio-comment] " Michael S. Tsirkin
  0 siblings, 2 replies; 11+ messages in thread
From: Jean-Philippe Brucker @ 2019-04-30 13:56 UTC (permalink / raw)
  To: virtio-comment, virtio-dev
  Cc: joro, tnowicki, eric.auger, kevin.tian, lorenzo.pieralisi,
	bharat.bhushan, mst, bauerman

The IOMMU device allows a guest to manage DMA mappings for physical,
emulated and paravirtualized endpoints. Add device description for the
virtio-iommu device and driver. Introduce PROBE, ATTACH, DETACH, MAP and
UNMAP requests, as well as translation error reporting.

Fixes: https://github.com/oasis-tcs/virtio-spec/issues/37
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
Since v2 I rebased onto virtio v1.1-wd02, fixing a conflict in
conformance.tex and using the new \conformance command.

A PDF version is available at
https://jpbrucker.net/virtio-iommu/spec/virtio-v1.1-wd02-iommu-0.11-draft.pdf
---
 conformance.tex  |  40 ++-
 content.tex      |   1 +
 virtio-iommu.tex | 850 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 889 insertions(+), 2 deletions(-)
 create mode 100644 virtio-iommu.tex

diff --git a/conformance.tex b/conformance.tex
index 42f702a..79a3e7d 100644
--- a/conformance.tex
+++ b/conformance.tex
@@ -15,14 +15,14 @@ \section{Conformance Targets}\label{sec:Conformance / Conformance Targets}
   \begin{itemize}
     \item Clause \ref{sec:Conformance / Driver Conformance}.
     \item One of clauses \ref{sec:Conformance / Driver Conformance / PCI Driver Conformance}, \ref{sec:Conformance / Driver Conformance / MMIO Driver Conformance} or \ref{sec:Conformance / Driver Conformance / Channel I/O Driver Conformance}.
-    \item One of clauses \ref{sec:Conformance / Driver Conformance / Network Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Block Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Console Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Entropy Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Traditional Memory Balloon Driver Conformance}, \ref{sec:Conformance / Driver Conformance / SCSI Host Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Input Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Crypto Driver Conformance} or \ref{sec:Conformance / Driver Conformance / Socket Driver Conformance}.
+    \item One of clauses \ref{sec:Conformance / Driver Conformance / Network Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Block Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Console Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Entropy Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Traditional Memory Balloon Driver Conformance}, \ref{sec:Conformance / Driver Conformance / SCSI Host Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Input Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Crypto Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Socket Driver Conformance} or \ref{sec:Conformance / Driver Conformance / IOMMU Driver Conformance}.
     \item Clause \ref{sec:Conformance / Legacy Interface: Transitional Device and Transitional Driver Conformance}.
   \end{itemize}
 \item[Device] A device MUST conform to four conformance clauses:
   \begin{itemize}
     \item Clause \ref{sec:Conformance / Device Conformance}.
     \item One of clauses \ref{sec:Conformance / Device Conformance / PCI Device Conformance}, \ref{sec:Conformance / Device Conformance / MMIO Device Conformance} or \ref{sec:Conformance / Device Conformance / Channel I/O Device Conformance}.
-    \item One of clauses \ref{sec:Conformance / Device Conformance / Network Device Conformance}, \ref{sec:Conformance / Device Conformance / Block Device Conformance}, \ref{sec:Conformance / Device Conformance / Console Device Conformance}, \ref{sec:Conformance / Device Conformance / Entropy Device Conformance}, \ref{sec:Conformance / Device Conformance / Traditional Memory Balloon Device Conformance}, \ref{sec:Conformance / Device Conformance / SCSI Host Device Conformance}, \ref{sec:Conformance / Device Conformance / Input Device Conformance}, \ref{sec:Conformance / Device Conformance / Crypto Device Conformance} or \ref{sec:Conformance / Device Conformance / Socket Device Conformance}.
+    \item One of clauses \ref{sec:Conformance / Device Conformance / Network Device Conformance}, \ref{sec:Conformance / Device Conformance / Block Device Conformance}, \ref{sec:Conformance / Device Conformance / Console Device Conformance}, \ref{sec:Conformance / Device Conformance / Entropy Device Conformance}, \ref{sec:Conformance / Device Conformance / Traditional Memory Balloon Device Conformance}, \ref{sec:Conformance / Device Conformance / SCSI Host Device Conformance}, \ref{sec:Conformance / Device Conformance / Input Device Conformance}, \ref{sec:Conformance / Device Conformance / Crypto Device Conformance}, \ref{sec:Conformance / Device Conformance / Socket Device Conformance} or \ref{sec:Conformance / Device Conformance / IOMMU Device Conformance}.
     \item Clause \ref{sec:Conformance / Legacy Interface: Transitional Device and Transitional Driver Conformance}.
   \end{itemize}
 \end{description}
@@ -183,6 +183,24 @@ \section{Conformance Targets}\label{sec:Conformance / Conformance Targets}
 \item \ref{drivernormative:Device Types / Socket Device / Device Operation / Device Events}
 \end{itemize}
 
+\conformance{\subsection}{IOMMU Driver Conformance}\label{sec:Conformance / Driver Conformance / IOMMU Driver Conformance}
+
+An IOMMU driver MUST conform to the following normative statements:
+
+\begin{itemize}
+\item \ref{drivernormative:Device Types / IOMMU Device / Feature bits}
+\item \ref{drivernormative:Device Types / IOMMU Device / Device configuration layout}
+\item \ref{drivernormative:Device Types / IOMMU Device / Device Initialization}
+\item \ref{drivernormative:Device Types / IOMMU Device / Device operations}
+\item \ref{drivernormative:Device Types / IOMMU Device / Device operations / ATTACH request}
+\item \ref{drivernormative:Device Types / IOMMU Device / Device operations / DETACH request}
+\item \ref{drivernormative:Device Types / IOMMU Device / Device operations / MAP request}
+\item \ref{drivernormative:Device Types / IOMMU Device / Device operations / UNMAP request}
+\item \ref{drivernormative:Device Types / IOMMU Device / Device operations / PROBE request}
+\item \ref{drivernormative:Device Types / IOMMU Device / Device operations / PROBE properties / RESVMEM}
+\item \ref{drivernormative:Device Types / IOMMU Device / Device operations / Fault reporting}
+\end{itemize}
+
 \conformance{\section}{Device Conformance}\label{sec:Conformance / Device Conformance}
 
 A device MUST conform to the following normative statements:
@@ -336,6 +354,24 @@ \section{Conformance Targets}\label{sec:Conformance / Conformance Targets}
 \item \ref{devicenormative:Device Types / Socket Device / Device Operation / Receive and Transmit}
 \end{itemize}
 
+\conformance{\subsection}{IOMMU Device Conformance}\label{sec:Conformance / Device Conformance / IOMMU Device Conformance}
+
+An IOMMU device MUST conform to the following normative statements:
+
+\begin{itemize}
+\item \ref{devicenormative:Device Types / IOMMU Device / Feature bits}
+\item \ref{devicenormative:Device Types / IOMMU Device / Device configuration layout}
+\item \ref{devicenormative:Device Types / IOMMU Device / Device Initialization}
+\item \ref{devicenormative:Device Types / IOMMU Device / Device operations}
+\item \ref{devicenormative:Device Types / IOMMU Device / Device operations / ATTACH request}
+\item \ref{devicenormative:Device Types / IOMMU Device / Device operations / DETACH request}
+\item \ref{devicenormative:Device Types / IOMMU Device / Device operations / MAP request}
+\item \ref{devicenormative:Device Types / IOMMU Device / Device operations / UNMAP request}
+\item \ref{devicenormative:Device Types / IOMMU Device / Device operations / PROBE request}
+\item \ref{devicenormative:Device Types / IOMMU Device / Device operations / PROBE properties / RESVMEM}
+\item \ref{devicenormative:Device Types / IOMMU Device / Device operations / Fault reporting}
+\end{itemize}
+
 \conformance{\section}{Legacy Interface: Transitional Device and Transitional Driver Conformance}\label{sec:Conformance / Legacy Interface: Transitional Device and Transitional Driver Conformance}
 A conformant implementation MUST be either transitional or
 non-transitional, see \ref{intro:Legacy
diff --git a/content.tex b/content.tex
index 193b6e1..5449a46 100644
--- a/content.tex
+++ b/content.tex
@@ -5594,6 +5594,7 @@ \subsubsection{Legacy Interface: Framing Requirements}\label{sec:Device
 \input{virtio-input.tex}
 \input{virtio-crypto.tex}
 \input{virtio-vsock.tex}
+\input{virtio-iommu.tex}
 
 \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
 
diff --git a/virtio-iommu.tex b/virtio-iommu.tex
new file mode 100644
index 0000000..be3dcd3
--- /dev/null
+++ b/virtio-iommu.tex
@@ -0,0 +1,850 @@
+\section{IOMMU device}\label{sec:Device Types / IOMMU Device}
+
+The virtio-iommu device manages Direct Memory Access (DMA) from one or
+more endpoints. It may act both as a proxy for physical IOMMUs managing
+devices assigned to the guest, and as virtual IOMMU managing emulated and
+paravirtualized devices.
+
+The driver first discovers endpoints managed by the virtio-iommu device
+using standard firmware mechanisms. It then sends requests to create
+virtual address spaces and virtual-to-physical mappings for these
+endpoints. In its simplest form, the virtio-iommu supports four request
+types:
+
+\begin{enumerate}
+\item Create a domain and attach an endpoint to it.  \\
+  \texttt{attach(endpoint = 0x8, domain = 1)}
+\item Create a mapping between a range of guest-virtual and guest-physical
+  address. \\
+  \texttt{map(domain = 1, virt_start = 0x1000, virt_end = 0x1fff,
+          phys = 0xa000, flags = READ)}
+
+  Endpoint 0x8, for example a hardware PCI endpoint with BDF 00:01.0, can
+  now read at addresses 0x1000-0x1fff. These accesses are translated
+  into system-physical addresses by the IOMMU.
+
+\item Remove the mapping.\\
+  \texttt{unmap(domain = 1, virt_start = 0x1000, virt_end = 0x1fff)}
+
+  Any access to addresses 0x1000-0x1fff by endpoint 0x8 would now be
+  rejected.
+\item Detach the device and remove the domain.\\
+  \texttt{detach(endpoint = 0x8, domain = 1)}
+\end{enumerate}
+
+\subsection{Device ID}\label{sec:Device Types / IOMMU Device / Device ID}
+
+23
+
+\subsection{Virtqueues}\label{sec:Device Types / IOMMU Device / Virtqueues}
+
+\begin{description}
+\item[0] requestq
+\item[1] eventq
+\end{description}
+
+\subsection{Feature bits}\label{sec:Device Types / IOMMU Device / Feature bits}
+
+\begin{description}
+\item[VIRTIO_IOMMU_F_INPUT_RANGE (0)]
+  Available range of virtual addresses is described in \field{input_range}
+
+\item[VIRTIO_IOMMU_F_DOMAIN_BITS (1)]
+  The number of domains supported is described in \field{domain_bits}
+
+\item[VIRTIO_IOMMU_F_MAP_UNMAP (2)]
+  Map and unmap requests are available.\footnote{Future extensions may add
+  different modes of operations. At the moment, only
+  VIRTIO_IOMMU_F_MAP_UNMAP is supported.}
+
+\item[VIRTIO_IOMMU_F_BYPASS (3)]
+  When not attached to a domain, endpoints downstream of the IOMMU
+  can access the guest-physical address space.
+
+\item[VIRTIO_IOMMU_F_PROBE (4)]
+  The PROBE request is available.
+\end{description}
+
+\drivernormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device / Feature bits}
+
+The driver SHOULD accept any of the VIRTIO_IOMMU_F_INPUT_RANGE,
+VIRTIO_IOMMU_F_DOMAIN_BITS, VIRTIO_IOMMU_F_MAP_UNMAP and
+VIRTIO_IOMMU_F_PROBE feature bits if offered by the device.
+
+\devicenormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device / Feature bits}
+
+If the device offers any of VIRTIO_IOMMU_F_INPUT_RANGE,
+VIRTIO_IOMMU_F_DOMAIN_BITS, VIRTIO_IOMMU_F_PROBE or
+VIRTIO_IOMMU_F_MAP_UNMAP feature bits, and if the driver did not accept
+this feature bit, then the device MAY signal failure by failing to set
+FEATURES_OK \field{device status} bit when the driver writes it.
+
+\subsection{Device configuration layout}\label{sec:Device Types / IOMMU Device / Device configuration layout}
+
+The \field{page_size_mask} field is always present. Availability of the
+others depend on various feature bits as indicated above.
+
+\begin{lstlisting}
+struct virtio_iommu_config {
+  u64 page_size_mask;
+  struct virtio_iommu_range {
+    u64 start;
+    u64 end;
+  } input_range;
+  u8  domain_bits;
+  u8  padding[3];
+  u32 probe_size;
+};
+\end{lstlisting}
+
+\drivernormative{\subsubsection}{Device configuration layout}{Device Types / IOMMU Device / Device configuration layout}
+
+The driver MUST NOT write to device configuration fields.
+
+\devicenormative{\subsubsection}{Device configuration layout}{Device Types / IOMMU Device / Device configuration layout}
+
+The device SHOULD set \field{padding} to zero.
+
+The device MUST set at least one bit in \field{page_size_mask}, describing
+the page granularity. The device MAY set more than one bit in
+\field{page_size_mask}.
+
+\subsection{Device initialization}\label{sec:Device Types / IOMMU Device / Device initialization}
+
+When the device is reset, endpoints are not attached to any domain.
+If the VIRTIO_IOMMU_F_BYPASS feature is negotiated, all endpoints can
+access guest-physical addresses ("bypass mode"). If the feature is not
+negotiated, then any memory access from endpoints will fault. Upon
+attaching an endpoint in bypass mode to a new domain, any memory access
+from the endpoint will fault, since the domain does not contain any
+mapping.
+
+The driver chooses operating mode depending on its capabilities. In this
+version of the virtio-iommu device, the only supported mode is
+VIRTIO_IOMMU_F_MAP_UNMAP.
+
+\drivernormative{\subsubsection}{Device Initialization}{Device Types / IOMMU Device / Device Initialization}
+
+The driver MUST NOT negotiate VIRTIO_IOMMU_F_MAP_UNMAP if it is incapable
+of sending VIRTIO_IOMMU_T_MAP and VIRTIO_IOMMU_T_UNMAP requests.
+
+If the VIRTIO_IOMMU_F_PROBE feature is negotiated, the driver SHOULD send a
+VIRTIO_IOMMU_T_PROBE request for each endpoint before attaching the
+endpoint to a domain.
+
+\devicenormative{\subsubsection}{Device Initialization}{Device Types / IOMMU Device / Device Initialization}
+
+If the driver does not accept the VIRTIO_IOMMU_F_BYPASS feature, the
+device SHOULD NOT let endpoints access the guest-physical address space.
+
+\subsection{Device operations}\label{sec:Device Types / IOMMU Device / Device operations}
+
+Driver send requests on the request virtqueue, notifies the device and
+waits for the device to return the request with a status in the used ring.
+All requests are split in two parts: one device-readable, one device-
+writable.
+
+\begin{lstlisting}
+struct virtio_iommu_req_head {
+  u8   type;
+  u8   reserved[3];
+};
+
+struct virtio_iommu_req_tail {
+  u8   status;
+  u8   reserved[3];
+};
+\end{lstlisting}
+
+Type may be one of:
+
+\begin{lstlisting}
+#define VIRTIO_IOMMU_T_ATTACH     1
+#define VIRTIO_IOMMU_T_DETACH     2
+#define VIRTIO_IOMMU_T_MAP        3
+#define VIRTIO_IOMMU_T_UNMAP      4
+#define VIRTIO_IOMMU_T_PROBE      5
+\end{lstlisting}
+
+A few general-purpose status codes are defined here. Unless explicitly
+described in a \textbf{Requirements} section, these values are hints to
+make troubleshooting easier.
+
+When the device fails to parse a request, for instance if a request seems
+too small for its type and the device cannot find the tail, then it will
+be unable to set \field{status}. In that case, it should return the
+buffers without writing in them.
+
+\begin{lstlisting}
+/* All good! Carry on. */
+#define VIRTIO_IOMMU_S_OK         0
+/* Virtio communication error */
+#define VIRTIO_IOMMU_S_IOERR      1
+/* Unsupported request */
+#define VIRTIO_IOMMU_S_UNSUPP     2
+/* Internal device error */
+#define VIRTIO_IOMMU_S_DEVERR     3
+/* Invalid parameters */
+#define VIRTIO_IOMMU_S_INVAL      4
+/* Out-of-range parameters */
+#define VIRTIO_IOMMU_S_RANGE      5
+/* Entry not found */
+#define VIRTIO_IOMMU_S_NOENT      6
+/* Bad address */
+#define VIRTIO_IOMMU_S_FAULT      7
+\end{lstlisting}
+
+Range limits of some request fields are described in the device
+configuration:
+
+\begin{itemize}
+\item \field{page_size_mask} contains the bitmask of all page sizes that
+  can be mapped. The least significant bit set defines the page
+  granularity of IOMMU mappings. Other bits in the mask are hints
+  describing page sizes that the IOMMU can merge into a single mapping
+  (page blocks).
+
+  The smallest page granularity supported by the IOMMU is one byte. It is
+  legal for the driver to map one byte at a time if bit 0 of
+  \field{page_size_mask} is set.
+
+\item If the VIRTIO_IOMMU_F_DOMAIN_BITS feature is offered,
+  \field{domain_bits} contains the number of bits supported in a domain
+  ID, the identifier used in most requests. A value of 0 is valid, it
+  means that a single domain is supported and endpoints can only be
+  attached to domain 0.
+
+  If the feature is not offered, domain identifiers can use up to 32 bits.
+
+\item If the VIRTIO_IOMMU_F_INPUT_RANGE feature is offered,
+  \field{input_range} contains the virtual address range that the IOMMU is
+  able to translate. Any mapping request to virtual addresses outside of
+  this range will fail.
+
+  If the feature is not offered, virtual mappings span over the whole
+  64-bit address space (\texttt{start = 0, end = 0xffffffff ffffffff})
+\end{itemize}
+
+\drivernormative{\subsubsection}{Device operations}{Device Types / IOMMU Device / Device operations}
+
+The driver SHOULD set field \field{reserved} of
+\verb+struct virtio_iommu_req_head+ to zero.
+
+When a device returns a complete request in the used queue without having
+written to it, the driver SHOULD interpret it as a failure from the device
+to parse the request.
+
+If the VIRTIO_IOMMU_F_INPUT_RANGE feature is negotiated, the driver SHOULD
+NOT send requests with \field{virt_start} less than
+\field{input_range.start} or \field{virt_end} greater than
+\field{input_range.end}.
+
+If the VIRTIO_IOMMU_F_DOMAIN_BITS feature is negotiated, the driver SHOULD
+NOT send requests with \field{domain} greater than the size described by
+\field{domain_bits}.
+
+The driver SHOULD NOT use multiple descriptor chains for a single request.
+
+\devicenormative{\subsubsection}{Device operations}{Device Types / IOMMU Device / Device operations}
+
+The device SHOULD NOT set \field{status} to VIRTIO_IOMMU_S_OK if a request
+didn't succeed.
+
+If a request \field{type} is not recognized, the device SHOULD return the
+buffers on the used ring and set the \field{len} field of the used element
+to zero.
+
+The device SHOULD ignore field \field{reserved} of
+\verb+struct virtio_iommu_req_head+ and SHOULD set field \field{reserved}
+of \verb+struct virtio_iommu_req_tail+ to zero.
+
+If the VIRTIO_IOMMU_F_INPUT_RANGE feature is negotiated and the range
+described by fields \field{virt_start} and \field{virt_end} doesn't fit in
+the range described by \field{input_range}, the device MAY set
+\field{status} to VIRTIO_IOMMU_S_RANGE and ignore the request.
+
+If the VIRTIO_IOMMU_F_DOMAIN_BITS is negotiated and bits above
+\field{domain_bits} are set in field \field{domain}, the device MAY set
+\field{status} to VIRTIO_IOMMU_S_RANGE and ignore the request.
+
+\subsubsection{ATTACH request}\label{sec:Device Types / IOMMU Device / Device operations / ATTACH request}
+
+\begin{lstlisting}
+struct virtio_iommu_req_attach {
+  struct virtio_iommu_req_head head;
+  le32 domain;
+  le32 endpoint;
+  u8   reserved[8];
+  struct virtio_iommu_req_tail tail;
+};
+\end{lstlisting}
+
+Attach an endpoint to a domain. \field{domain} is an identifier unique to
+the virtio-iommu device. The \field{domain} number doesn't have a meaning
+outside of virtio-iommu. If the domain doesn't exist in the device, it is
+created. \field{endpoint} is an identifier unique to the virtio-iommu
+device. The host communicates these unique endpoint IDs to the guest using
+methods outside the scope of this specification, but the following rules
+apply:
+
+\begin{itemize}
+\item The endpoint ID is unique from the virtio-iommu point of view.
+  Multiple endpoints whose DMA transactions are not translated by the same
+  virtio-iommu may have the same endpoint ID. Endpoints whose DMA
+  transactions may be translated by the same virtio-iommu must have
+  different endpoint IDs.
+
+\item Sometimes the host cannot completely isolate two endpoints from each
+  others. For example on a legacy PCI bus, endpoints can snoop DMA
+  transactions from their neighbours. In this case, the host must
+  communicate to the guest that it cannot isolate these endpoints from
+  each others, or that the physical IOMMU cannot distinguish transactions
+  coming from these endpoints. The method used to communicate this is
+  outside the scope of this specification.
+\end{itemize}
+
+Multiple endpoints may be attached to the same domain. An endpoint cannot
+be attached to multiple domains at the same time.
+
+\drivernormative{\paragraph}{ATTACH request}{Device Types / IOMMU Device / Device operations / ATTACH request}
+
+The driver SHOULD set \field{reserved} to zero.
+
+The driver SHOULD ensure that endpoints that cannot be isolated by the
+host are attached to the same domain.
+
+\devicenormative{\paragraph}{ATTACH request}{Device Types / IOMMU Device / Device operations / ATTACH request}
+
+If the \field{reserved} field of an ATTACH request is not zero, the device
+SHOULD set the request \field{status} to VIRTIO_IOMMU_S_INVAL and SHOULD
+NOT attach the endpoint to the domain. \footnote{The device should
+validate input of ATTACH requests in case the driver attempts to attach in
+a mode that is unimplemented by the device, and would be incompatible with
+the modes implemented by the device.}
+
+If the endpoint identified by \field{endpoint} doesn't exist, then the
+device SHOULD set the request \field{status} to VIRTIO_IOMMU_S_NOENT.
+
+If another endpoint is already attached to the domain identified by
+\field{domain}, then the device MAY attach the endpoint identified by
+\field{endpoint} to the domain. If it cannot do so, the device
+MUST set the request \field{status} to VIRTIO_IOMMU_S_UNSUPP.
+
+If the endpoint identified by \field{endpoint} is already attached to
+another domain, then the device SHOULD first detach it from that domain
+and attach it to the one identified by \field{domain}. In that case the
+device behaves as if the driver issued a DETACH request with this
+\field{endpoint}, followed by the ATTACH request. If the device cannot do
+so, it MUST set the request \field{status} to VIRTIO_IOMMU_S_UNSUPP.
+
+If properties of the endpoint (obtained with a PROBE request) are
+incompatible with properties of other endpoints already attached to the
+requested domain, the device MAY attach the endpoint. If it cannot do so, the
+device SHOULD set the request \field{status} to VIRTIO_IOMMU_S_UNSUPP.
+\footnote{In general it is simpler and safer to reject attach when two devices
+have differing values in a property, for example two reserved regions of
+different types that would overlap. Depending on the property, device
+implementation can try to merge them and accept the attach.}
+
+\subsubsection{DETACH request}
+
+\begin{lstlisting}
+struct virtio_iommu_req_detach {
+  struct virtio_iommu_req_head head;
+  le32 domain;
+  le32 endpoint;
+  u8   reserved[8];
+  struct virtio_iommu_req_tail tail;
+};
+\end{lstlisting}
+
+Detach an endpoint from a domain. When this request completes, the
+endpoint cannot access any mapping from that domain anymore. If feature
+VIRTIO_IOMMU_F_BYPASS has been negotiated, then the endpoint accesses the
+guest-physical address space once this request completes.
+
+After all endpoints have been successfully detached from a domain, it
+ceases to exist and its ID can be reused by the driver for another domain.
+
+\drivernormative{\paragraph}{DETACH request}{Device Types / IOMMU Device / Device operations / DETACH request}
+
+The driver SHOULD set \field{reserved} to zero.
+
+\devicenormative{\paragraph}{DETACH request}{Device Types / IOMMU Device / Device operations / DETACH request}
+
+If the \field{reserved} field of a DETACH request is not zero, the device
+MAY set the request \field{status} to VIRTIO_IOMMU_S_INVAL, in which case
+the device MAY still perform the DETACH operation.
+
+If the endpoint identified by \field{endpoint} doesn't exist, then the
+device SHOULD set the request \field{status} to VIRTIO_IOMMU_S_NOENT.
+
+If the domain identified by \field{domain} doesn't exist, or if the
+endpoint identified by \field{endpoint} isn't attached to this domain,
+then the device MAY set the request \field{status} to
+VIRTIO_IOMMU_S_INVAL.
+
+The device MUST ensure that after being detached from a domain, the
+endpoint cannot access any mapping from that domain.
+
+\subsubsection{MAP request}\label{sec:Device Types / IOMMU Device / Device operations / MAP request}
+
+\begin{lstlisting}
+struct virtio_iommu_req_map {
+  struct virtio_iommu_req_head head;
+  le32  domain;
+  le64  virt_start;
+  le64  virt_end;
+  le64  phys_start;
+  le32  flags;
+  struct virtio_iommu_req_tail tail;
+};
+
+/* Flags are: */
+#define VIRTIO_IOMMU_MAP_F_READ   (1 << 0)
+#define VIRTIO_IOMMU_MAP_F_WRITE  (1 << 1)
+#define VIRTIO_IOMMU_MAP_F_EXEC   (1 << 2)
+#define VIRTIO_IOMMU_MAP_F_MMIO   (1 << 3)
+\end{lstlisting}
+
+Map a range of virtually-contiguous addresses to a range of
+physically-contiguous addresses of the same size. After the request
+succeeds, all endpoints attached to this domain can access memory in the
+range $[virt\_start; virt\_end]$ (inclusive). For example, if an endpoint
+accesses address $VA \in [virt\_start; virt\_end]$, the device (or the
+physical IOMMU) translates the address: $PA = VA - virt\_start +
+phys\_start$. If the access parameters are compatible with \field{flags}
+(for instance, the access is write and \field{flags} are
+VIRTIO_IOMMU_MAP_F_READ | VIRTIO_IOMMU_MAP_F_WRITE) then the IOMMU allows
+the access to reach $PA$.
+
+The range defined by \field{virt_start} and \field{virt_end} should be
+within the limits specified by \field{input_range}. Given $phys\_end =
+phys\_start + virt\_end - virt\_start$, the range defined by
+\field{phys_start} and phys_end should be within the guest-physical
+address space. This includes upper and lower limits, as well as any
+carving of guest-physical addresses for use by the host. Guest physical
+boundaries are set by the host using a firmware mechanism outside the
+scope of this specification.
+
+Availability and allowed combinations of \field{flags} depend on the
+underlying IOMMU architectures. VIRTIO_IOMMU_MAP_F_READ and
+VIRTIO_IOMMU_MAP_F_WRITE are usually implemented, although READ is
+sometimes implied by WRITE. VIRTIO_IOMMU_MAP_F_EXEC might not be
+available. In addition combinations such as "WRITE and not READ" or "WRITE
+and EXEC" might not be supported.
+
+The VIRTIO_IOMMU_MAP_F_MMIO flag is a memory type rather than a protection
+flag. It may be used, for example, to map Message Signaled Interrupt
+doorbells when a VIRTIO_IOMMU_RESV_MEM_T_MSI region isn't available. To
+trigger interrupts the endpoint performs a direct memory write to another
+peripheral, the IRQ chip. Since it is a signal, the write must not be
+buffered, elided, or combined with other writes by the memory
+interconnect. The precise meaning of the MMIO flag depends on the
+underlying memory architecture (for example on Armv8-A it corresponds to
+the "Device-nGnRE" memory type). Unless needed by mapped MSIs, the device
+isn't required to support the MMIO flag.
+
+This request is only available when VIRTIO_IOMMU_F_MAP_UNMAP has been
+negotiated.
+
+\drivernormative{\paragraph}{MAP request}{Device Types / IOMMU Device / Device operations / MAP request}
+
+The driver SHOULD set undefined \field{flags} bits to zero.
+
+\field{virt_end} MUST be strictly greater than \field{virt_start}.
+
+The driver SHOULD set the VIRTIO_IOMMU_MAP_F_MMIO flag when the physical
+range corresponds to memory-mapped device registers. The physical range
+SHOULD have a single memory type: either normal memory or memory-mapped
+I/O.
+
+\devicenormative{\paragraph}{MAP request}{Device Types / IOMMU Device / Device operations / MAP request}
+
+If \field{virt_start}, \field{phys_start} or (\field{virt_end} + 1) is
+not aligned on the page granularity, the device SHOULD set the request
+\field{status} to VIRTIO_IOMMU_S_RANGE and SHOULD NOT create the mapping.
+
+If a mapping already exists in the requested range, the device SHOULD set
+the request \field{status} to VIRTIO_IOMMU_S_INVAL and SHOULD NOT change
+any mapping.
+
+If the device doesn't recognize a \field{flags} bit, it SHOULD set the
+request \field{status} to VIRTIO_IOMMU_S_INVAL. In this case the device
+SHOULD NOT create the mapping. \footnote{Validating the input is important
+here, because the driver might be attempting to map with special flags
+that the device doesn't recognize. Creating the mapping with incompatible
+flags may result in loss of coherency and security hazards.}
+
+If a flag or combination of flag isn't supported, the device MAY set the
+request \field{status} to VIRTIO_IOMMU_S_UNSUPP.
+
+The device MUST NOT allow writes to a range mapped without the
+VIRTIO_IOMMU_MAP_F_WRITE flag. However, if the underlying architecture
+does not support write-only mappings, the device MAY allow reads to a
+range mapped with VIRTIO_IOMMU_MAP_F_WRITE but not
+VIRTIO_IOMMU_MAP_F_READ.
+
+If \field{domain} does not exist, the device SHOULD set the request
+\field{status} to VIRTIO_IOMMU_S_NOENT.
+
+\subsubsection{UNMAP request}\label{sec:Device Types / IOMMU Device / Device operations / UNMAP request}
+
+\begin{lstlisting}
+struct virtio_iommu_req_unmap {
+  struct virtio_iommu_req_head head;
+  le32  domain;
+  le64  virt_start;
+  le64  virt_end;
+  u8    reserved[4];
+  struct virtio_iommu_req_tail tail;
+};
+\end{lstlisting}
+
+Unmap a range of addresses mapped with VIRTIO_IOMMU_T_MAP. We define here
+a mapping as a virtual region created with a single MAP request. All
+mappings covered by the range $[virt\_start; virt\_end]$ (inclusive) are
+removed.
+
+The semantics of unmapping are specified in \ref{drivernormative:Device
+Types / IOMMU Device / Device operations / UNMAP request} and
+\ref{devicenormative:Device Types / IOMMU Device / Device operations /
+UNMAP request}, and illustrated with the following requests, assuming each
+example sequence starts with a blank address space. We define two
+pseudocode functions \texttt{map(virt_start, virt_end) -> mapping} and
+\texttt{unmap(virt_start, virt_end)}.
+
+\begin{lstlisting}
+(1) unmap(virt_start=0,
+          virt_end=4)            -> succeeds, doesn't unmap anything
+
+(2) a = map(virt_start=0,
+            virt_end=9);
+    unmap(0, 9)                  -> succeeds, unmaps a
+
+(3) a = map(0, 4);
+    b = map(5, 9);
+    unmap(0, 9)                  -> succeeds, unmaps a and b
+
+(4) a = map(0, 9);
+    unmap(0, 4)                  -> faults, doesn't unmap anything
+
+(5) a = map(0, 4);
+    b = map(5, 9);
+    unmap(0, 4)                  -> succeeds, unmaps a
+
+(6) a = map(0, 4);
+    unmap(0, 9)                  -> succeeds, unmaps a
+
+(7) a = map(0, 4);
+    b = map(10, 14);
+    unmap(0, 14)                 -> succeeds, unmaps a and b
+\end{lstlisting}
+
+This request is only available when VIRTIO_IOMMU_F_MAP_UNMAP has been
+negotiated.
+
+\drivernormative{\paragraph}{UNMAP request}{Device Types / IOMMU Device / Device operations / UNMAP request}
+
+The driver SHOULD set the \field{reserved} field to zero.
+
+The range, defined by \field{virt_start} and \field{virt_end}, SHOULD
+cover one or more contiguous mappings created with MAP requests. The range
+MAY spill over unmapped virtual addresses.
+
+The first address of a range SHOULD either be the first address of a
+mapping or be outside any mapping. The last address of a range SHOULD
+either be the last address of a mapping or be outside any mapping.
+
+\devicenormative{\paragraph}{UNMAP request}{Device Types / IOMMU Device / Device operations / UNMAP request}
+
+If the \field{reserved} field of an UNMAP request is not zero, the device
+MAY set the request \field{status} to VIRTIO_IOMMU_S_INVAL, in which case
+the device MAY perform the UNMAP operation.
+
+If \field{domain} does not exist, the device SHOULD set the request
+\field{status} to VIRTIO_IOMMU_S_NOENT.
+
+If a mapping affected by the range is not covered in its entirety by the
+range (the UNMAP request would split the mapping), then the device SHOULD
+set the request \field{status} to VIRTIO_IOMMU_S_RANGE, and SHOULD NOT
+remove any mapping.
+
+If part of the range or the full range is not covered by an existing
+mapping, then the device SHOULD remove all mappings affected by the range
+and set the request \field{status} to VIRTIO_IOMMU_S_OK.
+
+\subsubsection{PROBE request}\label{sec:Device Types / IOMMU Device / Device operations / PROBE request}
+
+If the VIRTIO_IOMMU_F_PROBE feature bit is present, the driver sends a
+VIRTIO_IOMMU_T_PROBE request for each endpoint that the virtio-iommu
+device manages. This probe is performed before attaching the endpoint to
+a domain.
+
+\begin{lstlisting}
+struct virtio_iommu_req_probe {
+  struct virtio_iommu_req_head head;
+  /* Device-readable */
+  le32  endpoint;
+  u8    reserved[64];
+
+  /* Device-writable */
+  u8    properties[probe_size];
+  struct virtio_iommu_req_tail tail;
+};
+\end{lstlisting}
+
+\begin{description}
+\item[\field{endpoint}] has the same meaning as in ATTACH and DETACH
+  requests.
+
+\item[\field{reserved}] is used as padding, so that future extensions can
+  add fields to the device-readable part.
+
+\item[\field{properties}] contains a list of properties of the
+  \field{endpoint}, filled by the device. The length of the
+  \field{properties} field is \field{probe_size} bytes. Each property is
+  described with a \verb+struct virtio_iommu_probe_property+ header, which
+  may be followed by a value of size \field{length}.
+
+\begin{lstlisting}
+#define VIRTIO_IOMMU_PROBE_T_MASK 0xfff
+
+struct virtio_iommu_probe_property {
+  le16  type;
+  le16  length;
+};
+\end{lstlisting}
+
+\end{description}
+
+The driver allocates a buffer of adequate size for the probe request,
+writes \field{endpoint} and adds the buffer to the request queue. The
+device fills the \field{properties} field with a list of properties for
+this endpoint.
+
+The driver parses the first property by reading \field{type}, then
+\field{length}. If the driver recognizes \field{type}, it reads and
+handles the rest of the property. The driver then reads the next property,
+that is located $(\field{length} + 4)$ bytes after the beginning of the
+first one, and so on. The driver parses all properties until it reaches a
+NONE property or the end of \field{properties}.
+
+The upper nibble of \field{type} is reserved for future extensions.
+Therefore only 4096 types are available. The actual type of a property is
+extracted like this:
+
+\begin{lstlisting}
+u16 type = le16_to_cpu(property.type) & VIRTIO_IOMMU_PROBE_T_MASK;
+\end{lstlisting}
+
+Available property types are described in section
+\ref{sec:Device Types / IOMMU Device / Device operations / PROBE properties}.
+
+\drivernormative{\paragraph}{PROBE request}{Device Types / IOMMU Device / Device operations / PROBE request}
+
+The size of \field{properties} MUST be \field{probe_size} bytes.
+
+The driver SHOULD set \field{reserved} to zero.
+
+If the driver doesn't recognize the \field{type} of a property, it SHOULD
+ignore the property and continue parsing the list.
+
+The driver SHOULD NOT deduce the property length from \field{type}.
+
+The driver SHOULD ignore bits[15:12] of \field{type}.
+
+\devicenormative{\paragraph}{PROBE request}{Device Types / IOMMU Device / Device operations / PROBE request}
+
+If the \field{reserved} field of a PROBE request is not zero, the device
+MAY set the request \field{status} to VIRTIO_IOMMU_S_INVAL.
+
+If the endpoint identified by \field{endpoint} doesn't exist, then the
+device SHOULD set the request \field{status} to VIRTIO_IOMMU_S_NOENT.
+
+If the device does not offer the VIRTIO_IOMMU_F_PROBE feature, and if the
+driver sends a VIRTIO_IOMMU_T_PROBE request, then the device SHOULD return
+the buffers on the used ring and set the \field{len} field of the used
+element to zero.
+
+The device SHOULD set bits [15:12] of property \field{type} to zero.
+
+The device MUST write the size of the property without the
+\verb+struct virtio_iommu_probe_property+ header, in bytes, into
+\field{length}.
+
+When two properties follow each others, the device MUST put the second
+property exactly $(\field{length} + 4)$ bytes after the beginning of the
+first one.
+
+If the \field{properties} list is smaller than \field{probe_size}, then
+the device SHOULD NOT write any property and SHOULD set the request
+\field{status} to VIRTIO_IOMMU_S_INVAL.
+
+If the device doesn't fill all \field{probe_size} bytes with properties,
+it SHOULD fill the remaining bytes of \field{properties} with zeroes.
+
+\subsubsection{PROBE properties}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties}
+
+\begin{lstlisting}
+#define VIRTIO_IOMMU_PROBE_T_NONE       0
+#define VIRTIO_IOMMU_PROBE_T_RESV_MEM   1
+\end{lstlisting}
+
+\paragraph{Property NONE}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties / NONE}
+
+Marks the end of the property list. This property doesn't have any value,
+and should have \field{length} 0.
+
+\paragraph{Property RESV_MEM}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties / RESVMEM}
+
+The RESV_MEM property describes a chunk of reserved virtual memory. It may
+be used by the device to describe virtual address ranges that shouldn't be
+allocated by the driver, or that are special.
+
+\begin{lstlisting}
+struct virtio_iommu_probe_resv_mem {
+  struct virtio_iommu_probe_property head;
+  u8    subtype;
+  u8    reserved[3];
+  le64  start;
+  le64  end;
+};
+\end{lstlisting}
+
+Fields \field{start} and \field{end} describe the range of reserved virtual
+addresses. \field{subtype} may be one of:
+
+\begin{description}
+  \item[VIRTIO_IOMMU_RESV_MEM_T_RESERVED (0)]
+    Accesses to virtual addresses in this region have undefined behavior.
+    They may be aborted by the device, bypass it, or never even reach it.
+    The region may also be used for host mappings, for example Message
+    Signaled Interrupts.
+
+    The guest should neither use these virtual addresses in a MAP request
+    nor instruct endpoints to perform DMA on them.
+
+  \item[VIRTIO_IOMMU_RESV_MEM_T_MSI (1)]
+    This region is a doorbell for Message Signaled Interrupts (MSIs). It
+    is similar to VIRTIO_IOMMU_RESV_MEM_T_RESERVED, in that the driver
+    should not map virtual addresses described by the property.
+
+    In addition it tells the guest how to handle MSI doorbells. If the
+    endpoint doesn't have a VIRTIO_IOMMU_RESV_MEM_T_MSI property
+    corresponding to the doorbell of a virtual MSI controller, then the
+    guest should create a mapping for it.
+\end{description}
+
+\drivernormative{\subparagraph}{Property RESV_MEM}{Device Types / IOMMU Device / Device operations / PROBE properties / RESVMEM}
+
+The driver SHOULD NOT map any virtual address described by a
+VIRTIO_IOMMU_RESV_MEM_T_RESERVED or VIRTIO_IOMMU_RESV_MEM_T_MSI property.
+
+The driver SHOULD ignore \field{reserved}.
+
+The driver SHOULD treat any \field{subtype} it doesn't recognize as if it
+was VIRTIO_IOMMU_RESV_MEM_T_RESERVED.
+
+\devicenormative{\subparagraph}{Property RESV_MEM}{Device Types / IOMMU Device / Device operations / PROBE properties / RESVMEM}
+
+The device SHOULD set \field{reserved} to zero.
+
+The device SHOULD NOT present more than one VIRTIO_IOMMU_RESV_MEM_T_MSI
+property per endpoint.
+
+The device SHOULD NOT present RESV_MEM properties that overlap each others
+for the same endpoint.
+
+\subsubsection{Fault reporting}\label{sev:Device Types / IOMMU Device / Device operations / Fault reporting}
+
+The device can report translation faults and other significant asynchronous
+events on the event virtqueue. The driver initially populates the queue with
+empty report buffers. When the device needs to report an event, it fills a
+buffer and notifies the driver with an interrupt. The driver consumes the
+report and moves the buffer back onto the queue.
+
+If no buffer is available, the device may either wait for one to be consumed,
+or drop the event.
+
+\begin{lstlisting}
+struct virtio_iommu_fault {
+  u8    reason;
+  u8    reserved[3];
+  le32  flags;
+  le32  endpoint;
+  le32  reserved1;
+  le64  address;
+};
+
+#define VIRTIO_IOMMU_FAULT_F_READ     (1 << 0)
+#define VIRTIO_IOMMU_FAULT_F_WRITE    (1 << 1)
+#define VIRTIO_IOMMU_FAULT_F_EXEC     (1 << 2)
+#define VIRTIO_IOMMU_FAULT_F_ADDRESS  (1 << 8)
+\end{lstlisting}
+
+\begin{description}
+  \item[\field{reason}] The reason for this report. It may have the
+    following values:
+    \begin{description}
+      \item[VIRTIO_IOMMU_FAULT_R_UNKNOWN (0)] An internal error happened, or
+        an error that cannot be described with the following reasons.
+      \item[VIRTIO_IOMMU_FAULT_R_DOMAIN (1)] The endpoint attempted to
+        access \field{address} without being attached to a domain.
+      \item[VIRTIO_IOMMU_FAULT_R_MAPPING (2)] The endpoint attempted to
+        access \field{address}, which wasn't mapped in the domain or
+        didn't have the correct protection flags.
+    \end{description}
+  \item[\field{flags}] Information about the fault context.
+  \item[\field{endpoint}] The endpoint causing the fault.
+  \item[\field{reserved} and \field{reserved1}] Should be zero.
+  \item[\field{address}] If VIRTIO_IOMMU_FAULT_F_ADDRESS is set, the
+    address causing the fault.
+\end{description}
+
+These faults are not recoverable\footnote{This means that the PRI
+extension to PCI, for example, that allows recoverable faults, isn't
+supported for the moment.}. The guest has to do its best to
+prevent any future fault from happening, by stopping or resetting the
+endpoint.
+
+When the fault is reported by a physical IOMMU, the fault reasons may not
+match exactly the reason of the original fault report. The device should
+try its best to find the closest match.
+
+If the device encounters a fault that wasn't caused by a specific
+endpoint, it is unlikely that the driver would be able to do anything else
+than print the fault and stop using the device, so reporting the fault on
+the event queue isn't useful. In that case, we recommend using the
+DEVICE_NEEDS_RESET status bit.
+
+\drivernormative{\paragraph}{Fault reporting}{Device Types / IOMMU Device / Device operations / Fault reporting}
+
+If the \field{reserved} field is not zero, the driver SHOULD ignore the
+fault report.\footnote{A future format may implement events that are not
+faults, which would be differentiated by a type field in place of
+\field{reserved}.}
+
+The driver SHOULD ignore undefined \field{flags}.
+
+If the driver doesn't recognize \field{reason}, it SHOULD treat the fault
+as if it was VIRTIO_IOMMU_FAULT_R_UNKNOWN.
+
+\devicenormative{\paragraph}{Fault reporting}{Device Types / IOMMU Device / Device operations / Fault reporting}
+
+The device SHOULD set \field{reserved} and \field{reserved1} to zero.
+
+The device SHOULD set undefined \field{flags} to zero.
+
+The device SHOULD write a valid endpoint ID in \field{endpoint}.
+
+The device MAY omit setting VIRTIO_IOMMU_FAULT_F_ADDRESS and writing
+\field{address} in any fault report, regardless of the \field{reason}.
+
+If a buffer is too small to contain the fault report\footnotemark, the
+device SHOULD NOT use multiple buffers to describe it. The device MAY fall
+back to using an older fault report format that fits in the buffer.
+
+\footnotetext{This would happen for example if the device implements a
+more recent version of this specification, whose fault report contains
+additional fields.}
-- 
2.21.0



This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [virtio-comment] Re: [virtio-dev] [PATCH v3] Add virtio-iommu device specification
  2019-04-30 13:56 [virtio-comment] [PATCH v3] Add virtio-iommu device specification Jean-Philippe Brucker
@ 2019-04-30 14:42 ` Jean-Philippe Brucker
  2019-05-03 19:47   ` Michael S. Tsirkin
  2019-05-03 17:05 ` [virtio-comment] " Michael S. Tsirkin
  1 sibling, 1 reply; 11+ messages in thread
From: Jean-Philippe Brucker @ 2019-04-30 14:42 UTC (permalink / raw)
  To: virtio-comment, virtio-dev
  Cc: joro, tnowicki, eric.auger, kevin.tian, lorenzo.pieralisi,
	bharat.bhushan, mst, bauerman

On 30/04/2019 14:56, Jean-Philippe Brucker wrote:
> The IOMMU device allows a guest to manage DMA mappings for physical,
> emulated and paravirtualized endpoints. Add device description for the
> virtio-iommu device and driver. Introduce PROBE, ATTACH, DETACH, MAP and
> UNMAP requests, as well as translation error reporting.
> 
> Fixes: https://github.com/oasis-tcs/virtio-spec/issues/37
> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>

I'd like to request a vote for this version (the github issue is now up
to date and the patch applies cleanly)

Thanks,
Jean

> ---
> Since v2 I rebased onto virtio v1.1-wd02, fixing a conflict in
> conformance.tex and using the new \conformance command.
> 
> A PDF version is available at
> https://jpbrucker.net/virtio-iommu/spec/virtio-v1.1-wd02-iommu-0.11-draft.pdf
> ---
>  conformance.tex  |  40 ++-
>  content.tex      |   1 +
>  virtio-iommu.tex | 850 +++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 889 insertions(+), 2 deletions(-)
>  create mode 100644 virtio-iommu.tex
> 
> diff --git a/conformance.tex b/conformance.tex
> index 42f702a..79a3e7d 100644
> --- a/conformance.tex
> +++ b/conformance.tex
> @@ -15,14 +15,14 @@ \section{Conformance Targets}\label{sec:Conformance / Conformance Targets}
>    \begin{itemize}
>      \item Clause \ref{sec:Conformance / Driver Conformance}.
>      \item One of clauses \ref{sec:Conformance / Driver Conformance / PCI Driver Conformance}, \ref{sec:Conformance / Driver Conformance / MMIO Driver Conformance} or \ref{sec:Conformance / Driver Conformance / Channel I/O Driver Conformance}.
> -    \item One of clauses \ref{sec:Conformance / Driver Conformance / Network Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Block Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Console Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Entropy Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Traditional Memory Balloon Driver Conformance}, \ref{sec:Conformance / Driver Conformance / SCSI Host Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Input Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Crypto Driver Conformance} or \ref{sec:Conformance / Driver Conformance / Socket Driver Conformance}.
> +    \item One of clauses \ref{sec:Conformance / Driver Conformance / Network Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Block Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Console Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Entropy Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Traditional Memory Balloon Driver Conformance}, \ref{sec:Conformance / Driver Conformance / SCSI Host Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Input Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Crypto Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Socket Driver Conformance} or \ref{sec:Conformance / Driver Conformance / IOMMU Driver Conformance}.
>      \item Clause \ref{sec:Conformance / Legacy Interface: Transitional Device and Transitional Driver Conformance}.
>    \end{itemize}
>  \item[Device] A device MUST conform to four conformance clauses:
>    \begin{itemize}
>      \item Clause \ref{sec:Conformance / Device Conformance}.
>      \item One of clauses \ref{sec:Conformance / Device Conformance / PCI Device Conformance}, \ref{sec:Conformance / Device Conformance / MMIO Device Conformance} or \ref{sec:Conformance / Device Conformance / Channel I/O Device Conformance}.
> -    \item One of clauses \ref{sec:Conformance / Device Conformance / Network Device Conformance}, \ref{sec:Conformance / Device Conformance / Block Device Conformance}, \ref{sec:Conformance / Device Conformance / Console Device Conformance}, \ref{sec:Conformance / Device Conformance / Entropy Device Conformance}, \ref{sec:Conformance / Device Conformance / Traditional Memory Balloon Device Conformance}, \ref{sec:Conformance / Device Conformance / SCSI Host Device Conformance}, \ref{sec:Conformance / Device Conformance / Input Device Conformance}, \ref{sec:Conformance / Device Conformance / Crypto Device Conformance} or \ref{sec:Conformance / Device Conformance / Socket Device Conformance}.
> +    \item One of clauses \ref{sec:Conformance / Device Conformance / Network Device Conformance}, \ref{sec:Conformance / Device Conformance / Block Device Conformance}, \ref{sec:Conformance / Device Conformance / Console Device Conformance}, \ref{sec:Conformance / Device Conformance / Entropy Device Conformance}, \ref{sec:Conformance / Device Conformance / Traditional Memory Balloon Device Conformance}, \ref{sec:Conformance / Device Conformance / SCSI Host Device Conformance}, \ref{sec:Conformance / Device Conformance / Input Device Conformance}, \ref{sec:Conformance / Device Conformance / Crypto Device Conformance}, \ref{sec:Conformance / Device Conformance / Socket Device Conformance} or \ref{sec:Conformance / Device Conformance / IOMMU Device Conformance}.
>      \item Clause \ref{sec:Conformance / Legacy Interface: Transitional Device and Transitional Driver Conformance}.
>    \end{itemize}
>  \end{description}
> @@ -183,6 +183,24 @@ \section{Conformance Targets}\label{sec:Conformance / Conformance Targets}
>  \item \ref{drivernormative:Device Types / Socket Device / Device Operation / Device Events}
>  \end{itemize}
>  
> +\conformance{\subsection}{IOMMU Driver Conformance}\label{sec:Conformance / Driver Conformance / IOMMU Driver Conformance}
> +
> +An IOMMU driver MUST conform to the following normative statements:
> +
> +\begin{itemize}
> +\item \ref{drivernormative:Device Types / IOMMU Device / Feature bits}
> +\item \ref{drivernormative:Device Types / IOMMU Device / Device configuration layout}
> +\item \ref{drivernormative:Device Types / IOMMU Device / Device Initialization}
> +\item \ref{drivernormative:Device Types / IOMMU Device / Device operations}
> +\item \ref{drivernormative:Device Types / IOMMU Device / Device operations / ATTACH request}
> +\item \ref{drivernormative:Device Types / IOMMU Device / Device operations / DETACH request}
> +\item \ref{drivernormative:Device Types / IOMMU Device / Device operations / MAP request}
> +\item \ref{drivernormative:Device Types / IOMMU Device / Device operations / UNMAP request}
> +\item \ref{drivernormative:Device Types / IOMMU Device / Device operations / PROBE request}
> +\item \ref{drivernormative:Device Types / IOMMU Device / Device operations / PROBE properties / RESVMEM}
> +\item \ref{drivernormative:Device Types / IOMMU Device / Device operations / Fault reporting}
> +\end{itemize}
> +
>  \conformance{\section}{Device Conformance}\label{sec:Conformance / Device Conformance}
>  
>  A device MUST conform to the following normative statements:
> @@ -336,6 +354,24 @@ \section{Conformance Targets}\label{sec:Conformance / Conformance Targets}
>  \item \ref{devicenormative:Device Types / Socket Device / Device Operation / Receive and Transmit}
>  \end{itemize}
>  
> +\conformance{\subsection}{IOMMU Device Conformance}\label{sec:Conformance / Device Conformance / IOMMU Device Conformance}
> +
> +An IOMMU device MUST conform to the following normative statements:
> +
> +\begin{itemize}
> +\item \ref{devicenormative:Device Types / IOMMU Device / Feature bits}
> +\item \ref{devicenormative:Device Types / IOMMU Device / Device configuration layout}
> +\item \ref{devicenormative:Device Types / IOMMU Device / Device Initialization}
> +\item \ref{devicenormative:Device Types / IOMMU Device / Device operations}
> +\item \ref{devicenormative:Device Types / IOMMU Device / Device operations / ATTACH request}
> +\item \ref{devicenormative:Device Types / IOMMU Device / Device operations / DETACH request}
> +\item \ref{devicenormative:Device Types / IOMMU Device / Device operations / MAP request}
> +\item \ref{devicenormative:Device Types / IOMMU Device / Device operations / UNMAP request}
> +\item \ref{devicenormative:Device Types / IOMMU Device / Device operations / PROBE request}
> +\item \ref{devicenormative:Device Types / IOMMU Device / Device operations / PROBE properties / RESVMEM}
> +\item \ref{devicenormative:Device Types / IOMMU Device / Device operations / Fault reporting}
> +\end{itemize}
> +
>  \conformance{\section}{Legacy Interface: Transitional Device and Transitional Driver Conformance}\label{sec:Conformance / Legacy Interface: Transitional Device and Transitional Driver Conformance}
>  A conformant implementation MUST be either transitional or
>  non-transitional, see \ref{intro:Legacy
> diff --git a/content.tex b/content.tex
> index 193b6e1..5449a46 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -5594,6 +5594,7 @@ \subsubsection{Legacy Interface: Framing Requirements}\label{sec:Device
>  \input{virtio-input.tex}
>  \input{virtio-crypto.tex}
>  \input{virtio-vsock.tex}
> +\input{virtio-iommu.tex}
>  
>  \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
>  
> diff --git a/virtio-iommu.tex b/virtio-iommu.tex
> new file mode 100644
> index 0000000..be3dcd3
> --- /dev/null
> +++ b/virtio-iommu.tex
> @@ -0,0 +1,850 @@
> +\section{IOMMU device}\label{sec:Device Types / IOMMU Device}
> +
> +The virtio-iommu device manages Direct Memory Access (DMA) from one or
> +more endpoints. It may act both as a proxy for physical IOMMUs managing
> +devices assigned to the guest, and as virtual IOMMU managing emulated and
> +paravirtualized devices.
> +
> +The driver first discovers endpoints managed by the virtio-iommu device
> +using standard firmware mechanisms. It then sends requests to create
> +virtual address spaces and virtual-to-physical mappings for these
> +endpoints. In its simplest form, the virtio-iommu supports four request
> +types:
> +
> +\begin{enumerate}
> +\item Create a domain and attach an endpoint to it.  \\
> +  \texttt{attach(endpoint = 0x8, domain = 1)}
> +\item Create a mapping between a range of guest-virtual and guest-physical
> +  address. \\
> +  \texttt{map(domain = 1, virt_start = 0x1000, virt_end = 0x1fff,
> +          phys = 0xa000, flags = READ)}
> +
> +  Endpoint 0x8, for example a hardware PCI endpoint with BDF 00:01.0, can
> +  now read at addresses 0x1000-0x1fff. These accesses are translated
> +  into system-physical addresses by the IOMMU.
> +
> +\item Remove the mapping.\\
> +  \texttt{unmap(domain = 1, virt_start = 0x1000, virt_end = 0x1fff)}
> +
> +  Any access to addresses 0x1000-0x1fff by endpoint 0x8 would now be
> +  rejected.
> +\item Detach the device and remove the domain.\\
> +  \texttt{detach(endpoint = 0x8, domain = 1)}
> +\end{enumerate}
> +
> +\subsection{Device ID}\label{sec:Device Types / IOMMU Device / Device ID}
> +
> +23
> +
> +\subsection{Virtqueues}\label{sec:Device Types / IOMMU Device / Virtqueues}
> +
> +\begin{description}
> +\item[0] requestq
> +\item[1] eventq
> +\end{description}
> +
> +\subsection{Feature bits}\label{sec:Device Types / IOMMU Device / Feature bits}
> +
> +\begin{description}
> +\item[VIRTIO_IOMMU_F_INPUT_RANGE (0)]
> +  Available range of virtual addresses is described in \field{input_range}
> +
> +\item[VIRTIO_IOMMU_F_DOMAIN_BITS (1)]
> +  The number of domains supported is described in \field{domain_bits}
> +
> +\item[VIRTIO_IOMMU_F_MAP_UNMAP (2)]
> +  Map and unmap requests are available.\footnote{Future extensions may add
> +  different modes of operations. At the moment, only
> +  VIRTIO_IOMMU_F_MAP_UNMAP is supported.}
> +
> +\item[VIRTIO_IOMMU_F_BYPASS (3)]
> +  When not attached to a domain, endpoints downstream of the IOMMU
> +  can access the guest-physical address space.
> +
> +\item[VIRTIO_IOMMU_F_PROBE (4)]
> +  The PROBE request is available.
> +\end{description}
> +
> +\drivernormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device / Feature bits}
> +
> +The driver SHOULD accept any of the VIRTIO_IOMMU_F_INPUT_RANGE,
> +VIRTIO_IOMMU_F_DOMAIN_BITS, VIRTIO_IOMMU_F_MAP_UNMAP and
> +VIRTIO_IOMMU_F_PROBE feature bits if offered by the device.
> +
> +\devicenormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device / Feature bits}
> +
> +If the device offers any of VIRTIO_IOMMU_F_INPUT_RANGE,
> +VIRTIO_IOMMU_F_DOMAIN_BITS, VIRTIO_IOMMU_F_PROBE or
> +VIRTIO_IOMMU_F_MAP_UNMAP feature bits, and if the driver did not accept
> +this feature bit, then the device MAY signal failure by failing to set
> +FEATURES_OK \field{device status} bit when the driver writes it.
> +
> +\subsection{Device configuration layout}\label{sec:Device Types / IOMMU Device / Device configuration layout}
> +
> +The \field{page_size_mask} field is always present. Availability of the
> +others depend on various feature bits as indicated above.
> +
> +\begin{lstlisting}
> +struct virtio_iommu_config {
> +  u64 page_size_mask;
> +  struct virtio_iommu_range {
> +    u64 start;
> +    u64 end;
> +  } input_range;
> +  u8  domain_bits;
> +  u8  padding[3];
> +  u32 probe_size;
> +};
> +\end{lstlisting}
> +
> +\drivernormative{\subsubsection}{Device configuration layout}{Device Types / IOMMU Device / Device configuration layout}
> +
> +The driver MUST NOT write to device configuration fields.
> +
> +\devicenormative{\subsubsection}{Device configuration layout}{Device Types / IOMMU Device / Device configuration layout}
> +
> +The device SHOULD set \field{padding} to zero.
> +
> +The device MUST set at least one bit in \field{page_size_mask}, describing
> +the page granularity. The device MAY set more than one bit in
> +\field{page_size_mask}.
> +
> +\subsection{Device initialization}\label{sec:Device Types / IOMMU Device / Device initialization}
> +
> +When the device is reset, endpoints are not attached to any domain.
> +If the VIRTIO_IOMMU_F_BYPASS feature is negotiated, all endpoints can
> +access guest-physical addresses ("bypass mode"). If the feature is not
> +negotiated, then any memory access from endpoints will fault. Upon
> +attaching an endpoint in bypass mode to a new domain, any memory access
> +from the endpoint will fault, since the domain does not contain any
> +mapping.
> +
> +The driver chooses operating mode depending on its capabilities. In this
> +version of the virtio-iommu device, the only supported mode is
> +VIRTIO_IOMMU_F_MAP_UNMAP.
> +
> +\drivernormative{\subsubsection}{Device Initialization}{Device Types / IOMMU Device / Device Initialization}
> +
> +The driver MUST NOT negotiate VIRTIO_IOMMU_F_MAP_UNMAP if it is incapable
> +of sending VIRTIO_IOMMU_T_MAP and VIRTIO_IOMMU_T_UNMAP requests.
> +
> +If the VIRTIO_IOMMU_F_PROBE feature is negotiated, the driver SHOULD send a
> +VIRTIO_IOMMU_T_PROBE request for each endpoint before attaching the
> +endpoint to a domain.
> +
> +\devicenormative{\subsubsection}{Device Initialization}{Device Types / IOMMU Device / Device Initialization}
> +
> +If the driver does not accept the VIRTIO_IOMMU_F_BYPASS feature, the
> +device SHOULD NOT let endpoints access the guest-physical address space.
> +
> +\subsection{Device operations}\label{sec:Device Types / IOMMU Device / Device operations}
> +
> +Driver send requests on the request virtqueue, notifies the device and
> +waits for the device to return the request with a status in the used ring.
> +All requests are split in two parts: one device-readable, one device-
> +writable.
> +
> +\begin{lstlisting}
> +struct virtio_iommu_req_head {
> +  u8   type;
> +  u8   reserved[3];
> +};
> +
> +struct virtio_iommu_req_tail {
> +  u8   status;
> +  u8   reserved[3];
> +};
> +\end{lstlisting}
> +
> +Type may be one of:
> +
> +\begin{lstlisting}
> +#define VIRTIO_IOMMU_T_ATTACH     1
> +#define VIRTIO_IOMMU_T_DETACH     2
> +#define VIRTIO_IOMMU_T_MAP        3
> +#define VIRTIO_IOMMU_T_UNMAP      4
> +#define VIRTIO_IOMMU_T_PROBE      5
> +\end{lstlisting}
> +
> +A few general-purpose status codes are defined here. Unless explicitly
> +described in a \textbf{Requirements} section, these values are hints to
> +make troubleshooting easier.
> +
> +When the device fails to parse a request, for instance if a request seems
> +too small for its type and the device cannot find the tail, then it will
> +be unable to set \field{status}. In that case, it should return the
> +buffers without writing in them.
> +
> +\begin{lstlisting}
> +/* All good! Carry on. */
> +#define VIRTIO_IOMMU_S_OK         0
> +/* Virtio communication error */
> +#define VIRTIO_IOMMU_S_IOERR      1
> +/* Unsupported request */
> +#define VIRTIO_IOMMU_S_UNSUPP     2
> +/* Internal device error */
> +#define VIRTIO_IOMMU_S_DEVERR     3
> +/* Invalid parameters */
> +#define VIRTIO_IOMMU_S_INVAL      4
> +/* Out-of-range parameters */
> +#define VIRTIO_IOMMU_S_RANGE      5
> +/* Entry not found */
> +#define VIRTIO_IOMMU_S_NOENT      6
> +/* Bad address */
> +#define VIRTIO_IOMMU_S_FAULT      7
> +\end{lstlisting}
> +
> +Range limits of some request fields are described in the device
> +configuration:
> +
> +\begin{itemize}
> +\item \field{page_size_mask} contains the bitmask of all page sizes that
> +  can be mapped. The least significant bit set defines the page
> +  granularity of IOMMU mappings. Other bits in the mask are hints
> +  describing page sizes that the IOMMU can merge into a single mapping
> +  (page blocks).
> +
> +  The smallest page granularity supported by the IOMMU is one byte. It is
> +  legal for the driver to map one byte at a time if bit 0 of
> +  \field{page_size_mask} is set.
> +
> +\item If the VIRTIO_IOMMU_F_DOMAIN_BITS feature is offered,
> +  \field{domain_bits} contains the number of bits supported in a domain
> +  ID, the identifier used in most requests. A value of 0 is valid, it
> +  means that a single domain is supported and endpoints can only be
> +  attached to domain 0.
> +
> +  If the feature is not offered, domain identifiers can use up to 32 bits.
> +
> +\item If the VIRTIO_IOMMU_F_INPUT_RANGE feature is offered,
> +  \field{input_range} contains the virtual address range that the IOMMU is
> +  able to translate. Any mapping request to virtual addresses outside of
> +  this range will fail.
> +
> +  If the feature is not offered, virtual mappings span over the whole
> +  64-bit address space (\texttt{start = 0, end = 0xffffffff ffffffff})
> +\end{itemize}
> +
> +\drivernormative{\subsubsection}{Device operations}{Device Types / IOMMU Device / Device operations}
> +
> +The driver SHOULD set field \field{reserved} of
> +\verb+struct virtio_iommu_req_head+ to zero.
> +
> +When a device returns a complete request in the used queue without having
> +written to it, the driver SHOULD interpret it as a failure from the device
> +to parse the request.
> +
> +If the VIRTIO_IOMMU_F_INPUT_RANGE feature is negotiated, the driver SHOULD
> +NOT send requests with \field{virt_start} less than
> +\field{input_range.start} or \field{virt_end} greater than
> +\field{input_range.end}.
> +
> +If the VIRTIO_IOMMU_F_DOMAIN_BITS feature is negotiated, the driver SHOULD
> +NOT send requests with \field{domain} greater than the size described by
> +\field{domain_bits}.
> +
> +The driver SHOULD NOT use multiple descriptor chains for a single request.
> +
> +\devicenormative{\subsubsection}{Device operations}{Device Types / IOMMU Device / Device operations}
> +
> +The device SHOULD NOT set \field{status} to VIRTIO_IOMMU_S_OK if a request
> +didn't succeed.
> +
> +If a request \field{type} is not recognized, the device SHOULD return the
> +buffers on the used ring and set the \field{len} field of the used element
> +to zero.
> +
> +The device SHOULD ignore field \field{reserved} of
> +\verb+struct virtio_iommu_req_head+ and SHOULD set field \field{reserved}
> +of \verb+struct virtio_iommu_req_tail+ to zero.
> +
> +If the VIRTIO_IOMMU_F_INPUT_RANGE feature is negotiated and the range
> +described by fields \field{virt_start} and \field{virt_end} doesn't fit in
> +the range described by \field{input_range}, the device MAY set
> +\field{status} to VIRTIO_IOMMU_S_RANGE and ignore the request.
> +
> +If the VIRTIO_IOMMU_F_DOMAIN_BITS is negotiated and bits above
> +\field{domain_bits} are set in field \field{domain}, the device MAY set
> +\field{status} to VIRTIO_IOMMU_S_RANGE and ignore the request.
> +
> +\subsubsection{ATTACH request}\label{sec:Device Types / IOMMU Device / Device operations / ATTACH request}
> +
> +\begin{lstlisting}
> +struct virtio_iommu_req_attach {
> +  struct virtio_iommu_req_head head;
> +  le32 domain;
> +  le32 endpoint;
> +  u8   reserved[8];
> +  struct virtio_iommu_req_tail tail;
> +};
> +\end{lstlisting}
> +
> +Attach an endpoint to a domain. \field{domain} is an identifier unique to
> +the virtio-iommu device. The \field{domain} number doesn't have a meaning
> +outside of virtio-iommu. If the domain doesn't exist in the device, it is
> +created. \field{endpoint} is an identifier unique to the virtio-iommu
> +device. The host communicates these unique endpoint IDs to the guest using
> +methods outside the scope of this specification, but the following rules
> +apply:
> +
> +\begin{itemize}
> +\item The endpoint ID is unique from the virtio-iommu point of view.
> +  Multiple endpoints whose DMA transactions are not translated by the same
> +  virtio-iommu may have the same endpoint ID. Endpoints whose DMA
> +  transactions may be translated by the same virtio-iommu must have
> +  different endpoint IDs.
> +
> +\item Sometimes the host cannot completely isolate two endpoints from each
> +  others. For example on a legacy PCI bus, endpoints can snoop DMA
> +  transactions from their neighbours. In this case, the host must
> +  communicate to the guest that it cannot isolate these endpoints from
> +  each others, or that the physical IOMMU cannot distinguish transactions
> +  coming from these endpoints. The method used to communicate this is
> +  outside the scope of this specification.
> +\end{itemize}
> +
> +Multiple endpoints may be attached to the same domain. An endpoint cannot
> +be attached to multiple domains at the same time.
> +
> +\drivernormative{\paragraph}{ATTACH request}{Device Types / IOMMU Device / Device operations / ATTACH request}
> +
> +The driver SHOULD set \field{reserved} to zero.
> +
> +The driver SHOULD ensure that endpoints that cannot be isolated by the
> +host are attached to the same domain.
> +
> +\devicenormative{\paragraph}{ATTACH request}{Device Types / IOMMU Device / Device operations / ATTACH request}
> +
> +If the \field{reserved} field of an ATTACH request is not zero, the device
> +SHOULD set the request \field{status} to VIRTIO_IOMMU_S_INVAL and SHOULD
> +NOT attach the endpoint to the domain. \footnote{The device should
> +validate input of ATTACH requests in case the driver attempts to attach in
> +a mode that is unimplemented by the device, and would be incompatible with
> +the modes implemented by the device.}
> +
> +If the endpoint identified by \field{endpoint} doesn't exist, then the
> +device SHOULD set the request \field{status} to VIRTIO_IOMMU_S_NOENT.
> +
> +If another endpoint is already attached to the domain identified by
> +\field{domain}, then the device MAY attach the endpoint identified by
> +\field{endpoint} to the domain. If it cannot do so, the device
> +MUST set the request \field{status} to VIRTIO_IOMMU_S_UNSUPP.
> +
> +If the endpoint identified by \field{endpoint} is already attached to
> +another domain, then the device SHOULD first detach it from that domain
> +and attach it to the one identified by \field{domain}. In that case the
> +device behaves as if the driver issued a DETACH request with this
> +\field{endpoint}, followed by the ATTACH request. If the device cannot do
> +so, it MUST set the request \field{status} to VIRTIO_IOMMU_S_UNSUPP.
> +
> +If properties of the endpoint (obtained with a PROBE request) are
> +incompatible with properties of other endpoints already attached to the
> +requested domain, the device MAY attach the endpoint. If it cannot do so, the
> +device SHOULD set the request \field{status} to VIRTIO_IOMMU_S_UNSUPP.
> +\footnote{In general it is simpler and safer to reject attach when two devices
> +have differing values in a property, for example two reserved regions of
> +different types that would overlap. Depending on the property, device
> +implementation can try to merge them and accept the attach.}
> +
> +\subsubsection{DETACH request}
> +
> +\begin{lstlisting}
> +struct virtio_iommu_req_detach {
> +  struct virtio_iommu_req_head head;
> +  le32 domain;
> +  le32 endpoint;
> +  u8   reserved[8];
> +  struct virtio_iommu_req_tail tail;
> +};
> +\end{lstlisting}
> +
> +Detach an endpoint from a domain. When this request completes, the
> +endpoint cannot access any mapping from that domain anymore. If feature
> +VIRTIO_IOMMU_F_BYPASS has been negotiated, then the endpoint accesses the
> +guest-physical address space once this request completes.
> +
> +After all endpoints have been successfully detached from a domain, it
> +ceases to exist and its ID can be reused by the driver for another domain.
> +
> +\drivernormative{\paragraph}{DETACH request}{Device Types / IOMMU Device / Device operations / DETACH request}
> +
> +The driver SHOULD set \field{reserved} to zero.
> +
> +\devicenormative{\paragraph}{DETACH request}{Device Types / IOMMU Device / Device operations / DETACH request}
> +
> +If the \field{reserved} field of a DETACH request is not zero, the device
> +MAY set the request \field{status} to VIRTIO_IOMMU_S_INVAL, in which case
> +the device MAY still perform the DETACH operation.
> +
> +If the endpoint identified by \field{endpoint} doesn't exist, then the
> +device SHOULD set the request \field{status} to VIRTIO_IOMMU_S_NOENT.
> +
> +If the domain identified by \field{domain} doesn't exist, or if the
> +endpoint identified by \field{endpoint} isn't attached to this domain,
> +then the device MAY set the request \field{status} to
> +VIRTIO_IOMMU_S_INVAL.
> +
> +The device MUST ensure that after being detached from a domain, the
> +endpoint cannot access any mapping from that domain.
> +
> +\subsubsection{MAP request}\label{sec:Device Types / IOMMU Device / Device operations / MAP request}
> +
> +\begin{lstlisting}
> +struct virtio_iommu_req_map {
> +  struct virtio_iommu_req_head head;
> +  le32  domain;
> +  le64  virt_start;
> +  le64  virt_end;
> +  le64  phys_start;
> +  le32  flags;
> +  struct virtio_iommu_req_tail tail;
> +};
> +
> +/* Flags are: */
> +#define VIRTIO_IOMMU_MAP_F_READ   (1 << 0)
> +#define VIRTIO_IOMMU_MAP_F_WRITE  (1 << 1)
> +#define VIRTIO_IOMMU_MAP_F_EXEC   (1 << 2)
> +#define VIRTIO_IOMMU_MAP_F_MMIO   (1 << 3)
> +\end{lstlisting}
> +
> +Map a range of virtually-contiguous addresses to a range of
> +physically-contiguous addresses of the same size. After the request
> +succeeds, all endpoints attached to this domain can access memory in the
> +range $[virt\_start; virt\_end]$ (inclusive). For example, if an endpoint
> +accesses address $VA \in [virt\_start; virt\_end]$, the device (or the
> +physical IOMMU) translates the address: $PA = VA - virt\_start +
> +phys\_start$. If the access parameters are compatible with \field{flags}
> +(for instance, the access is write and \field{flags} are
> +VIRTIO_IOMMU_MAP_F_READ | VIRTIO_IOMMU_MAP_F_WRITE) then the IOMMU allows
> +the access to reach $PA$.
> +
> +The range defined by \field{virt_start} and \field{virt_end} should be
> +within the limits specified by \field{input_range}. Given $phys\_end =
> +phys\_start + virt\_end - virt\_start$, the range defined by
> +\field{phys_start} and phys_end should be within the guest-physical
> +address space. This includes upper and lower limits, as well as any
> +carving of guest-physical addresses for use by the host. Guest physical
> +boundaries are set by the host using a firmware mechanism outside the
> +scope of this specification.
> +
> +Availability and allowed combinations of \field{flags} depend on the
> +underlying IOMMU architectures. VIRTIO_IOMMU_MAP_F_READ and
> +VIRTIO_IOMMU_MAP_F_WRITE are usually implemented, although READ is
> +sometimes implied by WRITE. VIRTIO_IOMMU_MAP_F_EXEC might not be
> +available. In addition combinations such as "WRITE and not READ" or "WRITE
> +and EXEC" might not be supported.
> +
> +The VIRTIO_IOMMU_MAP_F_MMIO flag is a memory type rather than a protection
> +flag. It may be used, for example, to map Message Signaled Interrupt
> +doorbells when a VIRTIO_IOMMU_RESV_MEM_T_MSI region isn't available. To
> +trigger interrupts the endpoint performs a direct memory write to another
> +peripheral, the IRQ chip. Since it is a signal, the write must not be
> +buffered, elided, or combined with other writes by the memory
> +interconnect. The precise meaning of the MMIO flag depends on the
> +underlying memory architecture (for example on Armv8-A it corresponds to
> +the "Device-nGnRE" memory type). Unless needed by mapped MSIs, the device
> +isn't required to support the MMIO flag.
> +
> +This request is only available when VIRTIO_IOMMU_F_MAP_UNMAP has been
> +negotiated.
> +
> +\drivernormative{\paragraph}{MAP request}{Device Types / IOMMU Device / Device operations / MAP request}
> +
> +The driver SHOULD set undefined \field{flags} bits to zero.
> +
> +\field{virt_end} MUST be strictly greater than \field{virt_start}.
> +
> +The driver SHOULD set the VIRTIO_IOMMU_MAP_F_MMIO flag when the physical
> +range corresponds to memory-mapped device registers. The physical range
> +SHOULD have a single memory type: either normal memory or memory-mapped
> +I/O.
> +
> +\devicenormative{\paragraph}{MAP request}{Device Types / IOMMU Device / Device operations / MAP request}
> +
> +If \field{virt_start}, \field{phys_start} or (\field{virt_end} + 1) is
> +not aligned on the page granularity, the device SHOULD set the request
> +\field{status} to VIRTIO_IOMMU_S_RANGE and SHOULD NOT create the mapping.
> +
> +If a mapping already exists in the requested range, the device SHOULD set
> +the request \field{status} to VIRTIO_IOMMU_S_INVAL and SHOULD NOT change
> +any mapping.
> +
> +If the device doesn't recognize a \field{flags} bit, it SHOULD set the
> +request \field{status} to VIRTIO_IOMMU_S_INVAL. In this case the device
> +SHOULD NOT create the mapping. \footnote{Validating the input is important
> +here, because the driver might be attempting to map with special flags
> +that the device doesn't recognize. Creating the mapping with incompatible
> +flags may result in loss of coherency and security hazards.}
> +
> +If a flag or combination of flag isn't supported, the device MAY set the
> +request \field{status} to VIRTIO_IOMMU_S_UNSUPP.
> +
> +The device MUST NOT allow writes to a range mapped without the
> +VIRTIO_IOMMU_MAP_F_WRITE flag. However, if the underlying architecture
> +does not support write-only mappings, the device MAY allow reads to a
> +range mapped with VIRTIO_IOMMU_MAP_F_WRITE but not
> +VIRTIO_IOMMU_MAP_F_READ.
> +
> +If \field{domain} does not exist, the device SHOULD set the request
> +\field{status} to VIRTIO_IOMMU_S_NOENT.
> +
> +\subsubsection{UNMAP request}\label{sec:Device Types / IOMMU Device / Device operations / UNMAP request}
> +
> +\begin{lstlisting}
> +struct virtio_iommu_req_unmap {
> +  struct virtio_iommu_req_head head;
> +  le32  domain;
> +  le64  virt_start;
> +  le64  virt_end;
> +  u8    reserved[4];
> +  struct virtio_iommu_req_tail tail;
> +};
> +\end{lstlisting}
> +
> +Unmap a range of addresses mapped with VIRTIO_IOMMU_T_MAP. We define here
> +a mapping as a virtual region created with a single MAP request. All
> +mappings covered by the range $[virt\_start; virt\_end]$ (inclusive) are
> +removed.
> +
> +The semantics of unmapping are specified in \ref{drivernormative:Device
> +Types / IOMMU Device / Device operations / UNMAP request} and
> +\ref{devicenormative:Device Types / IOMMU Device / Device operations /
> +UNMAP request}, and illustrated with the following requests, assuming each
> +example sequence starts with a blank address space. We define two
> +pseudocode functions \texttt{map(virt_start, virt_end) -> mapping} and
> +\texttt{unmap(virt_start, virt_end)}.
> +
> +\begin{lstlisting}
> +(1) unmap(virt_start=0,
> +          virt_end=4)            -> succeeds, doesn't unmap anything
> +
> +(2) a = map(virt_start=0,
> +            virt_end=9);
> +    unmap(0, 9)                  -> succeeds, unmaps a
> +
> +(3) a = map(0, 4);
> +    b = map(5, 9);
> +    unmap(0, 9)                  -> succeeds, unmaps a and b
> +
> +(4) a = map(0, 9);
> +    unmap(0, 4)                  -> faults, doesn't unmap anything
> +
> +(5) a = map(0, 4);
> +    b = map(5, 9);
> +    unmap(0, 4)                  -> succeeds, unmaps a
> +
> +(6) a = map(0, 4);
> +    unmap(0, 9)                  -> succeeds, unmaps a
> +
> +(7) a = map(0, 4);
> +    b = map(10, 14);
> +    unmap(0, 14)                 -> succeeds, unmaps a and b
> +\end{lstlisting}
> +
> +This request is only available when VIRTIO_IOMMU_F_MAP_UNMAP has been
> +negotiated.
> +
> +\drivernormative{\paragraph}{UNMAP request}{Device Types / IOMMU Device / Device operations / UNMAP request}
> +
> +The driver SHOULD set the \field{reserved} field to zero.
> +
> +The range, defined by \field{virt_start} and \field{virt_end}, SHOULD
> +cover one or more contiguous mappings created with MAP requests. The range
> +MAY spill over unmapped virtual addresses.
> +
> +The first address of a range SHOULD either be the first address of a
> +mapping or be outside any mapping. The last address of a range SHOULD
> +either be the last address of a mapping or be outside any mapping.
> +
> +\devicenormative{\paragraph}{UNMAP request}{Device Types / IOMMU Device / Device operations / UNMAP request}
> +
> +If the \field{reserved} field of an UNMAP request is not zero, the device
> +MAY set the request \field{status} to VIRTIO_IOMMU_S_INVAL, in which case
> +the device MAY perform the UNMAP operation.
> +
> +If \field{domain} does not exist, the device SHOULD set the request
> +\field{status} to VIRTIO_IOMMU_S_NOENT.
> +
> +If a mapping affected by the range is not covered in its entirety by the
> +range (the UNMAP request would split the mapping), then the device SHOULD
> +set the request \field{status} to VIRTIO_IOMMU_S_RANGE, and SHOULD NOT
> +remove any mapping.
> +
> +If part of the range or the full range is not covered by an existing
> +mapping, then the device SHOULD remove all mappings affected by the range
> +and set the request \field{status} to VIRTIO_IOMMU_S_OK.
> +
> +\subsubsection{PROBE request}\label{sec:Device Types / IOMMU Device / Device operations / PROBE request}
> +
> +If the VIRTIO_IOMMU_F_PROBE feature bit is present, the driver sends a
> +VIRTIO_IOMMU_T_PROBE request for each endpoint that the virtio-iommu
> +device manages. This probe is performed before attaching the endpoint to
> +a domain.
> +
> +\begin{lstlisting}
> +struct virtio_iommu_req_probe {
> +  struct virtio_iommu_req_head head;
> +  /* Device-readable */
> +  le32  endpoint;
> +  u8    reserved[64];
> +
> +  /* Device-writable */
> +  u8    properties[probe_size];
> +  struct virtio_iommu_req_tail tail;
> +};
> +\end{lstlisting}
> +
> +\begin{description}
> +\item[\field{endpoint}] has the same meaning as in ATTACH and DETACH
> +  requests.
> +
> +\item[\field{reserved}] is used as padding, so that future extensions can
> +  add fields to the device-readable part.
> +
> +\item[\field{properties}] contains a list of properties of the
> +  \field{endpoint}, filled by the device. The length of the
> +  \field{properties} field is \field{probe_size} bytes. Each property is
> +  described with a \verb+struct virtio_iommu_probe_property+ header, which
> +  may be followed by a value of size \field{length}.
> +
> +\begin{lstlisting}
> +#define VIRTIO_IOMMU_PROBE_T_MASK 0xfff
> +
> +struct virtio_iommu_probe_property {
> +  le16  type;
> +  le16  length;
> +};
> +\end{lstlisting}
> +
> +\end{description}
> +
> +The driver allocates a buffer of adequate size for the probe request,
> +writes \field{endpoint} and adds the buffer to the request queue. The
> +device fills the \field{properties} field with a list of properties for
> +this endpoint.
> +
> +The driver parses the first property by reading \field{type}, then
> +\field{length}. If the driver recognizes \field{type}, it reads and
> +handles the rest of the property. The driver then reads the next property,
> +that is located $(\field{length} + 4)$ bytes after the beginning of the
> +first one, and so on. The driver parses all properties until it reaches a
> +NONE property or the end of \field{properties}.
> +
> +The upper nibble of \field{type} is reserved for future extensions.
> +Therefore only 4096 types are available. The actual type of a property is
> +extracted like this:
> +
> +\begin{lstlisting}
> +u16 type = le16_to_cpu(property.type) & VIRTIO_IOMMU_PROBE_T_MASK;
> +\end{lstlisting}
> +
> +Available property types are described in section
> +\ref{sec:Device Types / IOMMU Device / Device operations / PROBE properties}.
> +
> +\drivernormative{\paragraph}{PROBE request}{Device Types / IOMMU Device / Device operations / PROBE request}
> +
> +The size of \field{properties} MUST be \field{probe_size} bytes.
> +
> +The driver SHOULD set \field{reserved} to zero.
> +
> +If the driver doesn't recognize the \field{type} of a property, it SHOULD
> +ignore the property and continue parsing the list.
> +
> +The driver SHOULD NOT deduce the property length from \field{type}.
> +
> +The driver SHOULD ignore bits[15:12] of \field{type}.
> +
> +\devicenormative{\paragraph}{PROBE request}{Device Types / IOMMU Device / Device operations / PROBE request}
> +
> +If the \field{reserved} field of a PROBE request is not zero, the device
> +MAY set the request \field{status} to VIRTIO_IOMMU_S_INVAL.
> +
> +If the endpoint identified by \field{endpoint} doesn't exist, then the
> +device SHOULD set the request \field{status} to VIRTIO_IOMMU_S_NOENT.
> +
> +If the device does not offer the VIRTIO_IOMMU_F_PROBE feature, and if the
> +driver sends a VIRTIO_IOMMU_T_PROBE request, then the device SHOULD return
> +the buffers on the used ring and set the \field{len} field of the used
> +element to zero.
> +
> +The device SHOULD set bits [15:12] of property \field{type} to zero.
> +
> +The device MUST write the size of the property without the
> +\verb+struct virtio_iommu_probe_property+ header, in bytes, into
> +\field{length}.
> +
> +When two properties follow each others, the device MUST put the second
> +property exactly $(\field{length} + 4)$ bytes after the beginning of the
> +first one.
> +
> +If the \field{properties} list is smaller than \field{probe_size}, then
> +the device SHOULD NOT write any property and SHOULD set the request
> +\field{status} to VIRTIO_IOMMU_S_INVAL.
> +
> +If the device doesn't fill all \field{probe_size} bytes with properties,
> +it SHOULD fill the remaining bytes of \field{properties} with zeroes.
> +
> +\subsubsection{PROBE properties}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties}
> +
> +\begin{lstlisting}
> +#define VIRTIO_IOMMU_PROBE_T_NONE       0
> +#define VIRTIO_IOMMU_PROBE_T_RESV_MEM   1
> +\end{lstlisting}
> +
> +\paragraph{Property NONE}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties / NONE}
> +
> +Marks the end of the property list. This property doesn't have any value,
> +and should have \field{length} 0.
> +
> +\paragraph{Property RESV_MEM}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties / RESVMEM}
> +
> +The RESV_MEM property describes a chunk of reserved virtual memory. It may
> +be used by the device to describe virtual address ranges that shouldn't be
> +allocated by the driver, or that are special.
> +
> +\begin{lstlisting}
> +struct virtio_iommu_probe_resv_mem {
> +  struct virtio_iommu_probe_property head;
> +  u8    subtype;
> +  u8    reserved[3];
> +  le64  start;
> +  le64  end;
> +};
> +\end{lstlisting}
> +
> +Fields \field{start} and \field{end} describe the range of reserved virtual
> +addresses. \field{subtype} may be one of:
> +
> +\begin{description}
> +  \item[VIRTIO_IOMMU_RESV_MEM_T_RESERVED (0)]
> +    Accesses to virtual addresses in this region have undefined behavior.
> +    They may be aborted by the device, bypass it, or never even reach it.
> +    The region may also be used for host mappings, for example Message
> +    Signaled Interrupts.
> +
> +    The guest should neither use these virtual addresses in a MAP request
> +    nor instruct endpoints to perform DMA on them.
> +
> +  \item[VIRTIO_IOMMU_RESV_MEM_T_MSI (1)]
> +    This region is a doorbell for Message Signaled Interrupts (MSIs). It
> +    is similar to VIRTIO_IOMMU_RESV_MEM_T_RESERVED, in that the driver
> +    should not map virtual addresses described by the property.
> +
> +    In addition it tells the guest how to handle MSI doorbells. If the
> +    endpoint doesn't have a VIRTIO_IOMMU_RESV_MEM_T_MSI property
> +    corresponding to the doorbell of a virtual MSI controller, then the
> +    guest should create a mapping for it.
> +\end{description}
> +
> +\drivernormative{\subparagraph}{Property RESV_MEM}{Device Types / IOMMU Device / Device operations / PROBE properties / RESVMEM}
> +
> +The driver SHOULD NOT map any virtual address described by a
> +VIRTIO_IOMMU_RESV_MEM_T_RESERVED or VIRTIO_IOMMU_RESV_MEM_T_MSI property.
> +
> +The driver SHOULD ignore \field{reserved}.
> +
> +The driver SHOULD treat any \field{subtype} it doesn't recognize as if it
> +was VIRTIO_IOMMU_RESV_MEM_T_RESERVED.
> +
> +\devicenormative{\subparagraph}{Property RESV_MEM}{Device Types / IOMMU Device / Device operations / PROBE properties / RESVMEM}
> +
> +The device SHOULD set \field{reserved} to zero.
> +
> +The device SHOULD NOT present more than one VIRTIO_IOMMU_RESV_MEM_T_MSI
> +property per endpoint.
> +
> +The device SHOULD NOT present RESV_MEM properties that overlap each others
> +for the same endpoint.
> +
> +\subsubsection{Fault reporting}\label{sev:Device Types / IOMMU Device / Device operations / Fault reporting}
> +
> +The device can report translation faults and other significant asynchronous
> +events on the event virtqueue. The driver initially populates the queue with
> +empty report buffers. When the device needs to report an event, it fills a
> +buffer and notifies the driver with an interrupt. The driver consumes the
> +report and moves the buffer back onto the queue.
> +
> +If no buffer is available, the device may either wait for one to be consumed,
> +or drop the event.
> +
> +\begin{lstlisting}
> +struct virtio_iommu_fault {
> +  u8    reason;
> +  u8    reserved[3];
> +  le32  flags;
> +  le32  endpoint;
> +  le32  reserved1;
> +  le64  address;
> +};
> +
> +#define VIRTIO_IOMMU_FAULT_F_READ     (1 << 0)
> +#define VIRTIO_IOMMU_FAULT_F_WRITE    (1 << 1)
> +#define VIRTIO_IOMMU_FAULT_F_EXEC     (1 << 2)
> +#define VIRTIO_IOMMU_FAULT_F_ADDRESS  (1 << 8)
> +\end{lstlisting}
> +
> +\begin{description}
> +  \item[\field{reason}] The reason for this report. It may have the
> +    following values:
> +    \begin{description}
> +      \item[VIRTIO_IOMMU_FAULT_R_UNKNOWN (0)] An internal error happened, or
> +        an error that cannot be described with the following reasons.
> +      \item[VIRTIO_IOMMU_FAULT_R_DOMAIN (1)] The endpoint attempted to
> +        access \field{address} without being attached to a domain.
> +      \item[VIRTIO_IOMMU_FAULT_R_MAPPING (2)] The endpoint attempted to
> +        access \field{address}, which wasn't mapped in the domain or
> +        didn't have the correct protection flags.
> +    \end{description}
> +  \item[\field{flags}] Information about the fault context.
> +  \item[\field{endpoint}] The endpoint causing the fault.
> +  \item[\field{reserved} and \field{reserved1}] Should be zero.
> +  \item[\field{address}] If VIRTIO_IOMMU_FAULT_F_ADDRESS is set, the
> +    address causing the fault.
> +\end{description}
> +
> +These faults are not recoverable\footnote{This means that the PRI
> +extension to PCI, for example, that allows recoverable faults, isn't
> +supported for the moment.}. The guest has to do its best to
> +prevent any future fault from happening, by stopping or resetting the
> +endpoint.
> +
> +When the fault is reported by a physical IOMMU, the fault reasons may not
> +match exactly the reason of the original fault report. The device should
> +try its best to find the closest match.
> +
> +If the device encounters a fault that wasn't caused by a specific
> +endpoint, it is unlikely that the driver would be able to do anything else
> +than print the fault and stop using the device, so reporting the fault on
> +the event queue isn't useful. In that case, we recommend using the
> +DEVICE_NEEDS_RESET status bit.
> +
> +\drivernormative{\paragraph}{Fault reporting}{Device Types / IOMMU Device / Device operations / Fault reporting}
> +
> +If the \field{reserved} field is not zero, the driver SHOULD ignore the
> +fault report.\footnote{A future format may implement events that are not
> +faults, which would be differentiated by a type field in place of
> +\field{reserved}.}
> +
> +The driver SHOULD ignore undefined \field{flags}.
> +
> +If the driver doesn't recognize \field{reason}, it SHOULD treat the fault
> +as if it was VIRTIO_IOMMU_FAULT_R_UNKNOWN.
> +
> +\devicenormative{\paragraph}{Fault reporting}{Device Types / IOMMU Device / Device operations / Fault reporting}
> +
> +The device SHOULD set \field{reserved} and \field{reserved1} to zero.
> +
> +The device SHOULD set undefined \field{flags} to zero.
> +
> +The device SHOULD write a valid endpoint ID in \field{endpoint}.
> +
> +The device MAY omit setting VIRTIO_IOMMU_FAULT_F_ADDRESS and writing
> +\field{address} in any fault report, regardless of the \field{reason}.
> +
> +If a buffer is too small to contain the fault report\footnotemark, the
> +device SHOULD NOT use multiple buffers to describe it. The device MAY fall
> +back to using an older fault report format that fits in the buffer.
> +
> +\footnotetext{This would happen for example if the device implements a
> +more recent version of this specification, whose fault report contains
> +additional fields.}
> 



This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [virtio-comment] Re: [PATCH v3] Add virtio-iommu device specification
  2019-04-30 13:56 [virtio-comment] [PATCH v3] Add virtio-iommu device specification Jean-Philippe Brucker
  2019-04-30 14:42 ` [virtio-comment] Re: [virtio-dev] " Jean-Philippe Brucker
@ 2019-05-03 17:05 ` Michael S. Tsirkin
  2019-05-10 12:07   ` [virtio-comment] Re: [virtio-dev] " Jean-Philippe Brucker
  1 sibling, 1 reply; 11+ messages in thread
From: Michael S. Tsirkin @ 2019-05-03 17:05 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: virtio-comment, virtio-dev, joro, tnowicki, eric.auger,
	kevin.tian, lorenzo.pieralisi, bharat.bhushan, bauerman

On Tue, Apr 30, 2019 at 02:56:48PM +0100, Jean-Philippe Brucker wrote:
> The IOMMU device allows a guest to manage DMA mappings for physical,
> emulated and paravirtualized endpoints. Add device description for the
> virtio-iommu device and driver. Introduce PROBE, ATTACH, DETACH, MAP and
> UNMAP requests, as well as translation error reporting.
> 
> Fixes: https://github.com/oasis-tcs/virtio-spec/issues/37
> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> ---
> Since v2 I rebased onto virtio v1.1-wd02, fixing a conflict in
> conformance.tex and using the new \conformance command.
> 
> A PDF version is available at
> https://jpbrucker.net/virtio-iommu/spec/virtio-v1.1-wd02-iommu-0.11-draft.pdf
> ---
>  conformance.tex  |  40 ++-
>  content.tex      |   1 +
>  virtio-iommu.tex | 850 +++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 889 insertions(+), 2 deletions(-)
>  create mode 100644 virtio-iommu.tex
> 
> diff --git a/conformance.tex b/conformance.tex
> index 42f702a..79a3e7d 100644
> --- a/conformance.tex
> +++ b/conformance.tex
> @@ -15,14 +15,14 @@ \section{Conformance Targets}\label{sec:Conformance / Conformance Targets}
>    \begin{itemize}
>      \item Clause \ref{sec:Conformance / Driver Conformance}.
>      \item One of clauses \ref{sec:Conformance / Driver Conformance / PCI Driver Conformance}, \ref{sec:Conformance / Driver Conformance / MMIO Driver Conformance} or \ref{sec:Conformance / Driver Conformance / Channel I/O Driver Conformance}.
> -    \item One of clauses \ref{sec:Conformance / Driver Conformance / Network Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Block Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Console Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Entropy Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Traditional Memory Balloon Driver Conformance}, \ref{sec:Conformance / Driver Conformance / SCSI Host Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Input Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Crypto Driver Conformance} or \ref{sec:Conformance / Driver Conformance / Socket Driver Conformance}.
> +    \item One of clauses \ref{sec:Conformance / Driver Conformance / Network Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Block Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Console Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Entropy Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Traditional Memory Balloon Driver Conformance}, \ref{sec:Conformance / Driver Conformance / SCSI Host Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Input Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Crypto Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Socket Driver Conformance} or \ref{sec:Conformance / Driver Conformance / IOMMU Driver Conformance}.
>      \item Clause \ref{sec:Conformance / Legacy Interface: Transitional Device and Transitional Driver Conformance}.
>    \end{itemize}
>  \item[Device] A device MUST conform to four conformance clauses:
>    \begin{itemize}
>      \item Clause \ref{sec:Conformance / Device Conformance}.
>      \item One of clauses \ref{sec:Conformance / Device Conformance / PCI Device Conformance}, \ref{sec:Conformance / Device Conformance / MMIO Device Conformance} or \ref{sec:Conformance / Device Conformance / Channel I/O Device Conformance}.
> -    \item One of clauses \ref{sec:Conformance / Device Conformance / Network Device Conformance}, \ref{sec:Conformance / Device Conformance / Block Device Conformance}, \ref{sec:Conformance / Device Conformance / Console Device Conformance}, \ref{sec:Conformance / Device Conformance / Entropy Device Conformance}, \ref{sec:Conformance / Device Conformance / Traditional Memory Balloon Device Conformance}, \ref{sec:Conformance / Device Conformance / SCSI Host Device Conformance}, \ref{sec:Conformance / Device Conformance / Input Device Conformance}, \ref{sec:Conformance / Device Conformance / Crypto Device Conformance} or \ref{sec:Conformance / Device Conformance / Socket Device Conformance}.
> +    \item One of clauses \ref{sec:Conformance / Device Conformance / Network Device Conformance}, \ref{sec:Conformance / Device Conformance / Block Device Conformance}, \ref{sec:Conformance / Device Conformance / Console Device Conformance}, \ref{sec:Conformance / Device Conformance / Entropy Device Conformance}, \ref{sec:Conformance / Device Conformance / Traditional Memory Balloon Device Conformance}, \ref{sec:Conformance / Device Conformance / SCSI Host Device Conformance}, \ref{sec:Conformance / Device Conformance / Input Device Conformance}, \ref{sec:Conformance / Device Conformance / Crypto Device Conformance}, \ref{sec:Conformance / Device Conformance / Socket Device Conformance} or \ref{sec:Conformance / Device Conformance / IOMMU Device Conformance}.
>      \item Clause \ref{sec:Conformance / Legacy Interface: Transitional Device and Transitional Driver Conformance}.
>    \end{itemize}
>  \end{description}
> @@ -183,6 +183,24 @@ \section{Conformance Targets}\label{sec:Conformance / Conformance Targets}
>  \item \ref{drivernormative:Device Types / Socket Device / Device Operation / Device Events}
>  \end{itemize}
>  
> +\conformance{\subsection}{IOMMU Driver Conformance}\label{sec:Conformance / Driver Conformance / IOMMU Driver Conformance}
> +
> +An IOMMU driver MUST conform to the following normative statements:
> +
> +\begin{itemize}
> +\item \ref{drivernormative:Device Types / IOMMU Device / Feature bits}
> +\item \ref{drivernormative:Device Types / IOMMU Device / Device configuration layout}
> +\item \ref{drivernormative:Device Types / IOMMU Device / Device Initialization}
> +\item \ref{drivernormative:Device Types / IOMMU Device / Device operations}
> +\item \ref{drivernormative:Device Types / IOMMU Device / Device operations / ATTACH request}
> +\item \ref{drivernormative:Device Types / IOMMU Device / Device operations / DETACH request}
> +\item \ref{drivernormative:Device Types / IOMMU Device / Device operations / MAP request}
> +\item \ref{drivernormative:Device Types / IOMMU Device / Device operations / UNMAP request}
> +\item \ref{drivernormative:Device Types / IOMMU Device / Device operations / PROBE request}
> +\item \ref{drivernormative:Device Types / IOMMU Device / Device operations / PROBE properties / RESVMEM}
> +\item \ref{drivernormative:Device Types / IOMMU Device / Device operations / Fault reporting}
> +\end{itemize}
> +
>  \conformance{\section}{Device Conformance}\label{sec:Conformance / Device Conformance}
>  
>  A device MUST conform to the following normative statements:
> @@ -336,6 +354,24 @@ \section{Conformance Targets}\label{sec:Conformance / Conformance Targets}
>  \item \ref{devicenormative:Device Types / Socket Device / Device Operation / Receive and Transmit}
>  \end{itemize}
>  
> +\conformance{\subsection}{IOMMU Device Conformance}\label{sec:Conformance / Device Conformance / IOMMU Device Conformance}
> +
> +An IOMMU device MUST conform to the following normative statements:
> +
> +\begin{itemize}
> +\item \ref{devicenormative:Device Types / IOMMU Device / Feature bits}
> +\item \ref{devicenormative:Device Types / IOMMU Device / Device configuration layout}
> +\item \ref{devicenormative:Device Types / IOMMU Device / Device Initialization}
> +\item \ref{devicenormative:Device Types / IOMMU Device / Device operations}
> +\item \ref{devicenormative:Device Types / IOMMU Device / Device operations / ATTACH request}
> +\item \ref{devicenormative:Device Types / IOMMU Device / Device operations / DETACH request}
> +\item \ref{devicenormative:Device Types / IOMMU Device / Device operations / MAP request}
> +\item \ref{devicenormative:Device Types / IOMMU Device / Device operations / UNMAP request}
> +\item \ref{devicenormative:Device Types / IOMMU Device / Device operations / PROBE request}
> +\item \ref{devicenormative:Device Types / IOMMU Device / Device operations / PROBE properties / RESVMEM}
> +\item \ref{devicenormative:Device Types / IOMMU Device / Device operations / Fault reporting}
> +\end{itemize}
> +
>  \conformance{\section}{Legacy Interface: Transitional Device and Transitional Driver Conformance}\label{sec:Conformance / Legacy Interface: Transitional Device and Transitional Driver Conformance}
>  A conformant implementation MUST be either transitional or
>  non-transitional, see \ref{intro:Legacy
> diff --git a/content.tex b/content.tex
> index 193b6e1..5449a46 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -5594,6 +5594,7 @@ \subsubsection{Legacy Interface: Framing Requirements}\label{sec:Device
>  \input{virtio-input.tex}
>  \input{virtio-crypto.tex}
>  \input{virtio-vsock.tex}
> +\input{virtio-iommu.tex}
>  
>  \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
>  
> diff --git a/virtio-iommu.tex b/virtio-iommu.tex
> new file mode 100644
> index 0000000..be3dcd3
> --- /dev/null
> +++ b/virtio-iommu.tex
> @@ -0,0 +1,850 @@
> +\section{IOMMU device}\label{sec:Device Types / IOMMU Device}
> +
> +The virtio-iommu device manages Direct Memory Access (DMA) from one or
> +more endpoints. It may act both as a proxy for physical IOMMUs managing
> +devices assigned to the guest, and as virtual IOMMU managing emulated and
> +paravirtualized devices.
> +
> +The driver first discovers endpoints managed by the virtio-iommu device
> +using standard firmware mechanisms.

Let's say "platform specific mechanisms" here?

> It then sends requests to create
> +virtual address spaces and virtual-to-physical mappings for these
> +endpoints. In its simplest form, the virtio-iommu supports four request
> +types:
> +
> +\begin{enumerate}
> +\item Create a domain and attach an endpoint to it.  \\
> +  \texttt{attach(endpoint = 0x8, domain = 1)}
> +\item Create a mapping between a range of guest-virtual and guest-physical
> +  address. \\
> +  \texttt{map(domain = 1, virt_start = 0x1000, virt_end = 0x1fff,
> +          phys = 0xa000, flags = READ)}
> +
> +  Endpoint 0x8, for example a hardware PCI endpoint with BDF 00:01.0, can
> +  now read at addresses 0x1000-0x1fff. These accesses are translated
> +  into system-physical addresses by the IOMMU.
> +
> +\item Remove the mapping.\\
> +  \texttt{unmap(domain = 1, virt_start = 0x1000, virt_end = 0x1fff)}
> +
> +  Any access to addresses 0x1000-0x1fff by endpoint 0x8 would now be
> +  rejected.
> +\item Detach the device and remove the domain.\\
> +  \texttt{detach(endpoint = 0x8, domain = 1)}
> +\end{enumerate}
> +
> +\subsection{Device ID}\label{sec:Device Types / IOMMU Device / Device ID}
> +
> +23
> +
> +\subsection{Virtqueues}\label{sec:Device Types / IOMMU Device / Virtqueues}
> +
> +\begin{description}
> +\item[0] requestq
> +\item[1] eventq
> +\end{description}
> +
> +\subsection{Feature bits}\label{sec:Device Types / IOMMU Device / Feature bits}
> +
> +\begin{description}
> +\item[VIRTIO_IOMMU_F_INPUT_RANGE (0)]
> +  Available range of virtual addresses is described in \field{input_range}
> +
> +\item[VIRTIO_IOMMU_F_DOMAIN_BITS (1)]
> +  The number of domains supported is described in \field{domain_bits}

I would rename this to domain_range or something like this.
E.g. there's an option of 0 which is special.

> +
> +\item[VIRTIO_IOMMU_F_MAP_UNMAP (2)]
> +  Map and unmap requests are available.\footnote{Future extensions may add
> +  different modes of operations. At the moment, only
> +  VIRTIO_IOMMU_F_MAP_UNMAP is supported.}
> +
> +\item[VIRTIO_IOMMU_F_BYPASS (3)]
> +  When not attached to a domain, endpoints downstream of the IOMMU
> +  can access the guest-physical address space.
> +
> +\item[VIRTIO_IOMMU_F_PROBE (4)]
> +  The PROBE request is available.
> +\end{description}
> +
> +\drivernormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device / Feature bits}
> +
> +The driver SHOULD accept any of the VIRTIO_IOMMU_F_INPUT_RANGE,
> +VIRTIO_IOMMU_F_DOMAIN_BITS, VIRTIO_IOMMU_F_MAP_UNMAP and
> +VIRTIO_IOMMU_F_PROBE feature bits if offered by the device.

Why does it have to accept MAP_UNMAP if it does not want to use it?

> +
> +\devicenormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device / Feature bits}
> +
> +If the device offers any of VIRTIO_IOMMU_F_INPUT_RANGE,
> +VIRTIO_IOMMU_F_DOMAIN_BITS, VIRTIO_IOMMU_F_PROBE or
> +VIRTIO_IOMMU_F_MAP_UNMAP feature bits, and if the driver did not accept
> +this feature bit, then the device MAY signal failure by failing to set
> +FEATURES_OK \field{device status} bit when the driver writes it.

I'm not sure what value does this statement add.
I would say devices should offer MAP_UNMAP

> +
> +\subsection{Device configuration layout}\label{sec:Device Types / IOMMU Device / Device configuration layout}
> +
> +The \field{page_size_mask} field is always present. Availability of the
> +others depend on various feature bits as indicated above.

replace above with link to a chapter please.

> +
> +\begin{lstlisting}
> +struct virtio_iommu_config {
> +  u64 page_size_mask;
> +  struct virtio_iommu_range {
> +    u64 start;
> +    u64 end;
> +  } input_range;
> +  u8  domain_bits;
> +  u8  padding[3];
> +  u32 probe_size;
> +};
> +\end{lstlisting}

These should be LE, right?


> +
> +\drivernormative{\subsubsection}{Device configuration layout}{Device Types / IOMMU Device / Device configuration layout}
> +
> +The driver MUST NOT write to device configuration fields.
> +
> +\devicenormative{\subsubsection}{Device configuration layout}{Device Types / IOMMU Device / Device configuration layout}
> +
> +The device SHOULD set \field{padding} to zero.
> +
> +The device MUST set at least one bit in \field{page_size_mask}, describing
> +the page granularity. The device MAY set more than one bit in
> +\field{page_size_mask}.
> +
> +\subsection{Device initialization}\label{sec:Device Types / IOMMU Device / Device initialization}
> +
> +When the device is reset, endpoints are not attached to any domain.

I'd add a paragraph break here.

> +If the VIRTIO_IOMMU_F_BYPASS feature is negotiated, all endpoints can
> +access guest-physical addresses ("bypass mode").

I think this means actually "all unattached endpoints"?


So how does this interact with e.g. encrypted memory?
I think it's actually weaker. This bit IMHO just means that
all accesses are allowed by the iommu and all addresses
are translated 1:1.

Also what about other endpoints? With bypass
does an endpoint not in any domain get access to all other
endpoints?

> If the feature is not
> +negotiated, then any memory access from endpoints will fault.

fault -> fail?

So all accesses are disallowed, right?

> Upon
> +attaching an endpoint in bypass mode to a new domain, any memory access
> +from the endpoint will fault, since the domain does not contain any
> +mapping.
> +
> +The driver chooses operating mode depending on its capabilities. In this
> +version of the virtio-iommu device, the only supported mode is
> +VIRTIO_IOMMU_F_MAP_UNMAP.

I would drop this paragraph. Let's instead explain what
is expected. E.g.

Future devices might support more modes of operation
besides MAP/UNMAP. To fail gracefully on such devices,
drivers should verify that devices set VIRTIO_IOMMU_F_MAP_UNMAP
and fail gracefully if they don't.

> +
> +\drivernormative{\subsubsection}{Device Initialization}{Device Types / IOMMU Device / Device Initialization}
> +
> +The driver MUST NOT negotiate VIRTIO_IOMMU_F_MAP_UNMAP if it is incapable
> +of sending VIRTIO_IOMMU_T_MAP and VIRTIO_IOMMU_T_UNMAP requests.
> +
> +If the VIRTIO_IOMMU_F_PROBE feature is negotiated, the driver SHOULD send a
> +VIRTIO_IOMMU_T_PROBE request for each endpoint before attaching the
> +endpoint to a domain.
> +
> +\devicenormative{\subsubsection}{Device Initialization}{Device Types / IOMMU Device / Device Initialization}
> +
> +If the driver does not accept the VIRTIO_IOMMU_F_BYPASS feature, the
> +device SHOULD NOT let endpoints access the guest-physical address space.
> +
> +\subsection{Device operations}\label{sec:Device Types / IOMMU Device / Device operations}
> +
> +Driver send requests on the request virtqueue, notifies the device and
> +waits for the device to return the request with a status in the used ring.
> +All requests are split in two parts: one device-readable, one device-
> +writable.
> +
> +\begin{lstlisting}
> +struct virtio_iommu_req_head {
> +  u8   type;
> +  u8   reserved[3];
> +};
> +
> +struct virtio_iommu_req_tail {
> +  u8   status;
> +  u8   reserved[3];
> +};
> +\end{lstlisting}
> +
> +Type may be one of:
> +
> +\begin{lstlisting}
> +#define VIRTIO_IOMMU_T_ATTACH     1
> +#define VIRTIO_IOMMU_T_DETACH     2
> +#define VIRTIO_IOMMU_T_MAP        3
> +#define VIRTIO_IOMMU_T_UNMAP      4
> +#define VIRTIO_IOMMU_T_PROBE      5
> +\end{lstlisting}
> +
> +A few general-purpose status codes are defined here. Unless explicitly
> +described in a \textbf{Requirements} section,

Replace by a link pls.

> these values are hints to
> +make troubleshooting easier.
> +
> +When the device fails to parse a request, for instance if a request seems

seems -> is?

> +too small for its type and the device cannot find the tail, then it will
> +be unable to set \field{status}. In that case, it should return the
> +buffers without writing in them.
> +
> +\begin{lstlisting}
> +/* All good! Carry on. */
> +#define VIRTIO_IOMMU_S_OK         0
> +/* Virtio communication error */
> +#define VIRTIO_IOMMU_S_IOERR      1
> +/* Unsupported request */
> +#define VIRTIO_IOMMU_S_UNSUPP     2
> +/* Internal device error */
> +#define VIRTIO_IOMMU_S_DEVERR     3
> +/* Invalid parameters */
> +#define VIRTIO_IOMMU_S_INVAL      4
> +/* Out-of-range parameters */
> +#define VIRTIO_IOMMU_S_RANGE      5
> +/* Entry not found */
> +#define VIRTIO_IOMMU_S_NOENT      6
> +/* Bad address */
> +#define VIRTIO_IOMMU_S_FAULT      7
> +\end{lstlisting}
> +
> +Range limits of some request fields are described in the device
> +configuration:
> +
> +\begin{itemize}
> +\item \field{page_size_mask} contains the bitmask of all page sizes that
> +  can be mapped. The least significant bit set defines the page
> +  granularity of IOMMU mappings.

so this refers to map/unmap requests only really?



> Other bits in the mask are hints
> +  describing page sizes that the IOMMU can merge into a single mapping
> +  (page blocks).

what does this merging refer to? I don't find it anywhere in the spec.

> +
> +  The smallest page granularity supported by the IOMMU is one byte. It is
> +  legal for the driver to map one byte at a time if bit 0 of
> +  \field{page_size_mask} is set.

With single byte mappings, can't guest quickly exhaust
device memory? what guidance can we provide for handling
this kind of issue? how do device and driver avoid it?

> +
> +\item If the VIRTIO_IOMMU_F_DOMAIN_BITS feature is offered,
> +  \field{domain_bits} contains the number of bits supported in a domain
> +  ID, the identifier used in most requests.

what does "most" refer to here? I don't see domain ID. I see
\field{domain} I guess this is what is meant.

> A value of 0 is valid, it
> +  means that a single domain is supported and endpoints can only be
> +  attached to domain 0.

What about values > 32?

> +
> +  If the feature is not offered, domain identifiers can use up to 32 bits.

I'd just say "any domain value is valid".

> +
> +\item If the VIRTIO_IOMMU_F_INPUT_RANGE feature is offered,
> +  \field{input_range} contains the virtual address range that the IOMMU is
> +  able to translate. Any mapping request to virtual addresses outside of
> +  this range will fail.
> +
> +  If the feature is not offered, virtual mappings span over the whole
> +  64-bit address space (\texttt{start = 0, end = 0xffffffff ffffffff})
> +\end{itemize}
> +
> +\drivernormative{\subsubsection}{Device operations}{Device Types / IOMMU Device / Device operations}
> +
> +The driver SHOULD set field \field{reserved} of
> +\verb+struct virtio_iommu_req_head+ to zero.

We generally don't use \\verb with these.
I'm not against using \\verb for structs but then I think
it should be everywhere.

> +
> +When a device returns a complete request in the used queue


Do you mean "when device uses a buffer"

> without having
> +written to it 

I'd add "i.e. used length is 0"?

>, the driver SHOULD interpret it as a failure from the device
> +to parse the request.

Parse specifically or can this be any error?

> +
> +If the VIRTIO_IOMMU_F_INPUT_RANGE feature is negotiated, the driver SHOULD
> +NOT send requests with \field{virt_start} less than
> +\field{input_range.start} or \field{virt_end} greater than
> +\field{input_range.end}.

Not MUST_NOT? Is device expected to check and recover from a violation?

> +
> +If the VIRTIO_IOMMU_F_DOMAIN_BITS feature is negotiated, the driver SHOULD
> +NOT

Not MUST_NOT? Is device expected to check and recover from a violation?

> send requests with \field{domain} greater than the size described by
> +\field{domain_bits}.

I'd say "

> +
> +The driver SHOULD NOT use multiple descriptor chains for a single request.

Oh. Generlly we try to avoid limits on buffer layout. Why do you
want to add this limitation?
Also, chains is not the only way to have s/g.
Is indirect also illegal?

I would drop this.
Or if it's really needed I think we should work on a generic
field that limits the # of descriptors.


> +
> +\devicenormative{\subsubsection}{Device operations}{Device Types / IOMMU Device / Device operations}
> +
> +The device SHOULD NOT set \field{status} to VIRTIO_IOMMU_S_OK if a request
> +didn't succeed.

Flip it and say what it must do?

> +
> +If a request \field{type} is not recognized, the device SHOULD return the
> +buffers on the used ring and set the \field{len} field of the used element
> +to zero.

Just say device must not write into buffer, and set used length to 0.

> +
> +The device SHOULD ignore field \field{reserved} of
> +\verb+struct virtio_iommu_req_head+ and SHOULD set field \field{reserved}
> +of \verb+struct virtio_iommu_req_tail+ to zero.

should because you want to use it in the future?
if we don't make at least ignoring a MUST then it won't work.

> +
> +If the VIRTIO_IOMMU_F_INPUT_RANGE feature is negotiated and the range
> +described by fields \field{virt_start} and \field{virt_end} doesn't fit in
> +the range described by \field{input_range}, the device MAY set
> +\field{status} to VIRTIO_IOMMU_S_RANGE and ignore the request.

what are other options?
Also ignore - meaning what? And what is the used length btw?
size of struct virtio_iommu_req_tail ?

> +
> +If the VIRTIO_IOMMU_F_DOMAIN_BITS is negotiated and bits above
> +\field{domain_bits} are set in field \field{domain},

I'd just say "domain is invalid".

> the device MAY set
> +\field{status} to VIRTIO_IOMMU_S_RANGE and ignore the request.


> +
> +\subsubsection{ATTACH request}\label{sec:Device Types / IOMMU Device / Device operations / ATTACH request}
> +
> +\begin{lstlisting}
> +struct virtio_iommu_req_attach {
> +  struct virtio_iommu_req_head head;
> +  le32 domain;
> +  le32 endpoint;
> +  u8   reserved[8];
> +  struct virtio_iommu_req_tail tail;
> +};
> +\end{lstlisting}
> +
> +Attach an endpoint to a domain. \field{domain} is an identifier unique to
> +the virtio-iommu device.

what does this mean? do you mean that it uniquely identifies
a domain? And I guess the assumption is that endpoints in different domains
are isolated from each other?

> The \field{domain} number doesn't have a meaning
> +outside of virtio-iommu. If the domain doesn't exist in the device, it is
> +created.
> \field{endpoint} is an identifier unique to the virtio-iommu
> +device.

what does this mean? did you mean it uniquely identifies an
endpoint? I'd just drop this.

> The host communicates these unique endpoint IDs to the guest
> using
> +methods outside the scope of this specification,


I'd say "semantics of the endpoint ID are platform specific".
Rest of above text seems to confuse more than it clarifies.


> but the following rules
> +apply:
> +
> +\begin{itemize}
> +\item The endpoint ID is unique from the virtio-iommu point of view.

Let's add "The endpoint ID identifies an endpoint"?

> +  Multiple endpoints whose DMA transactions are not translated by the same
> +  virtio-iommu may have the same endpoint ID. Endpoints whose DMA
> +  transactions may be translated by the same virtio-iommu must have
> +  different endpoint IDs.
> +
> +\item Sometimes the host cannot completely isolate two endpoints from each
> +  others.

Might not be a host thing either. How about
"On some platforms, it might not be possible to
isolate"...

> For example on a legacy PCI bus,

official name is conventional PCI

> endpoints

> can snoop DMA
> +  transactions from their neighbours

Other endpoints on the same bus?

>. In this case, the host must
> +  communicate to the guest that it cannot isolate these endpoints from
> +  each others

each other

>, or that the physical IOMMU cannot distinguish transactions
> +  coming from these endpoints. The method used to communicate this is
> +  outside the scope of this specification.


Let's rewrite in a way that does not involve host guest or physical
IOMMU. E.g.


	Such limitations need to be communicated in a platform specific
	way.


> +\end{itemize}
> +
> +Multiple endpoints may be attached to the same domain. An endpoint cannot
> +be attached to multiple domains at the same time.

Let's rephrase positively:
	a single endpoint to a single domain?

> +
> +\drivernormative{\paragraph}{ATTACH request}{Device Types / IOMMU Device / Device operations / ATTACH request}
> +
> +The driver SHOULD set \field{reserved} to zero.
> +
> +The driver SHOULD ensure that endpoints that cannot be isolated by the
> +host are attached to the same domain.

Conformance clauses can't use terms like "host". Just driver,
device and platform.


> +
> +\devicenormative{\paragraph}{ATTACH request}{Device Types / IOMMU Device / Device operations / ATTACH request}
> +
> +If the \field{reserved} field of an ATTACH request is not zero, the device
> +SHOULD set the request \field{status} to VIRTIO_IOMMU_S_INVAL and SHOULD
> +NOT attach the endpoint to the domain. \footnote{The device should

what does should mean here? Is expected to? Or is this part of
the conformance clause? the SHOULD and take it out of the footnote.

> +validate input of ATTACH requests in case the driver attempts to attach in
> +a mode that is unimplemented by the device, and would be incompatible with
> +the modes implemented by the device.}

Where does mode come from?

> +
> +If the endpoint identified by \field{endpoint} doesn't exist, then the
> +device SHOULD set the request \field{status} to VIRTIO_IOMMU_S_NOENT.

I guess an alternative is some other error status?

> +
> +If another endpoint is already attached to the domain identified by
> +\field{domain}, then the device MAY attach the endpoint identified by
> +\field{endpoint} to the domain. If it cannot do so, the device
> +MUST set the request \field{status} to VIRTIO_IOMMU_S_UNSUPP.
> +
> +If the endpoint identified by \field{endpoint} is already attached to
> +another domain, then the device SHOULD first detach it from that domain
> +and attach it to the one identified by \field{domain}. In that case the
> +device behaves

SHOULD behave?

> as if the driver issued a DETACH request with this
> +\field{endpoint}, followed by the ATTACH request. If the device cannot do
> +so, it MUST set the request \field{status} to VIRTIO_IOMMU_S_UNSUPP.
> +
> +If properties of the endpoint (obtained with a PROBE request) are
> +incompatible with properties of other endpoints

Let's start with a good case: if properties a compatible then ...
otherwise ...

> already attached to the
> +requested domain, the device MAY attach the endpoint.
> If it cannot do so,
> the
> +device SHOULD set the request \field{status} to VIRTIO_IOMMU_S_UNSUPP.

Confusing IMHO. What is the preferred mode of operation then?

How about:

device SHOULD reject the request and set status ....

A device that does not reject the request MUST
attach the endpoint.



> +\footnote{In general it is simpler and safer
> to reject attach when two devices

two endpoints?

> +have differing values in a property, for example two reserved regions of
> +different types that would overlap.

this example doesn't really clarify much. reserved regions
are not mentioned anywhere else. drop or provide a
better example?

> Depending on the property, device
> +implementation can try to merge them

merge what?

> and accept the attach.}
> +
> +\subsubsection{DETACH request}
> +
> +\begin{lstlisting}
> +struct virtio_iommu_req_detach {
> +  struct virtio_iommu_req_head head;
> +  le32 domain;
> +  le32 endpoint;
> +  u8   reserved[8];
> +  struct virtio_iommu_req_tail tail;
> +};
> +\end{lstlisting}
> +
> +Detach an endpoint from a domain. When this request completes, the
> +endpoint cannot access any mapping from that domain anymore. If feature
> +VIRTIO_IOMMU_F_BYPASS has been negotiated, then the endpoint accesses the
> +guest-physical address space once this request completes.

again something like 1:1 mapping would be appropriate here.

> +
> +After all endpoints have been successfully detached from a domain, it
> +ceases to exist and its ID can be reused by the driver for another domain.

what happens with the merging as described above?

> +
> +\drivernormative{\paragraph}{DETACH request}{Device Types / IOMMU Device / Device operations / DETACH request}
> +
> +The driver SHOULD set \field{reserved} to zero.
> +
> +\devicenormative{\paragraph}{DETACH request}{Device Types / IOMMU Device / Device operations / DETACH request}
> +
> +If the \field{reserved} field of a DETACH request is not zero, the device
> +MAY set the request \field{status} to VIRTIO_IOMMU_S_INVAL, in which case
> +the device MAY still perform the DETACH operation.

why so complex? let's just ask devices to ignore this field?

> +
> +If the endpoint identified by \field{endpoint} doesn't exist, then the
> +device SHOULD set the request \field{status} to VIRTIO_IOMMU_S_NOENT.
> +
> +If the domain identified by \field{domain} doesn't exist, or if the
> +endpoint identified by \field{endpoint} isn't attached to this domain,
> +then the device MAY set the request \field{status} to
> +VIRTIO_IOMMU_S_INVAL.
> +
> +The device MUST ensure that after being detached from a domain, the
> +endpoint cannot access any mapping from that domain.
> +
> +\subsubsection{MAP request}\label{sec:Device Types / IOMMU Device / Device operations / MAP request}
> +
> +\begin{lstlisting}
> +struct virtio_iommu_req_map {
> +  struct virtio_iommu_req_head head;
> +  le32  domain;
> +  le64  virt_start;
> +  le64  virt_end;
> +  le64  phys_start;
> +  le32  flags;
> +  struct virtio_iommu_req_tail tail;
> +};
> +
> +/* Flags are: */
> +#define VIRTIO_IOMMU_MAP_F_READ   (1 << 0)
> +#define VIRTIO_IOMMU_MAP_F_WRITE  (1 << 1)
> +#define VIRTIO_IOMMU_MAP_F_EXEC   (1 << 2)

what is exec? pls add some comments near all flags.

> +#define VIRTIO_IOMMU_MAP_F_MMIO   (1 << 3)
> +\end{lstlisting}
> +
> +Map a range of virtually-contiguous addresses to a range of
> +physically-contiguous addresses of the same size. After the request
> +succeeds, all endpoints attached to this domain can access memory in the
> +range $[virt\_start; virt\_end]$ (inclusive). For example, if an endpoint
> +accesses address $VA \in [virt\_start; virt\_end]$, the device (or the
> +physical IOMMU) translates the address: $PA = VA - virt\_start +
> +phys\_start$. If the access parameters are compatible with \field{flags}
> +(for instance, the access is write and \field{flags} are
> +VIRTIO_IOMMU_MAP_F_READ | VIRTIO_IOMMU_MAP_F_WRITE) then the IOMMU allows
> +the access to reach $PA$.
> +
> +The range defined by \field{virt_start} and \field{virt_end} should be
> +within the limits specified by \field{input_range}. Given $phys\_end =
> +phys\_start + virt\_end - virt\_start$, the range defined by
> +\field{phys_start} and phys_end should be within the guest-physical
> +address space. This includes upper and lower limits, as well as any
> +carving of guest-physical addresses for use by the host. Guest physical
> +boundaries are set by the host using a firmware mechanism outside the
> +scope of this specification.
> +
> +Availability and allowed combinations of \field{flags} depend on the
> +underlying IOMMU architectures.

on the platform? and how are these discovered? also platform
specific? why not feature bits?

> VIRTIO_IOMMU_MAP_F_READ and
> +VIRTIO_IOMMU_MAP_F_WRITE are usually implemented, although READ is
> +sometimes implied by WRITE. VIRTIO_IOMMU_MAP_F_EXEC might not be
> +available. In addition combinations such as "WRITE and not READ" or "WRITE
> +and EXEC" might not be supported.
> +
> +The VIRTIO_IOMMU_MAP_F_MMIO flag is a memory type rather than a protection
> +flag. It may be used, for example, to map Message Signaled Interrupt
> +doorbells when a VIRTIO_IOMMU_RESV_MEM_T_MSI region isn't available. To
> +trigger interrupts the endpoint performs a direct memory write to another
> +peripheral, the IRQ chip. Since it is a signal, the write must not be
> +buffered, elided, or combined with other writes by the memory
> +interconnect.

by anyone really right?

> The precise meaning of the MMIO flag depends on the
> +underlying memory architecture (for example on Armv8-A it corresponds to
> +the "Device-nGnRE" memory type). Unless needed by mapped MSIs, the device
> +isn't required to support the MMIO flag.

I really dislike how this part is written. A platform specific bit that
does whatever platform wants? How does one use it based on this?
Platform documentation is hoghly unlikely to explain what does
VIRTIO_IOMMU_MAP_F_MMIO mean.

Does this mean "not buffered, elided, or combined with other writes"?

Also maybe add a feature bit so platforms can declare support.

> +
> +This request is only available when VIRTIO_IOMMU_F_MAP_UNMAP has been
> +negotiated.
> +
> +\drivernormative{\paragraph}{MAP request}{Device Types / IOMMU Device / Device operations / MAP request}
> +
> +The driver SHOULD set undefined \field{flags} bits to zero.
> +
> +\field{virt_end} MUST be strictly greater than \field{virt_start}.
> +
> +The driver SHOULD set the VIRTIO_IOMMU_MAP_F_MMIO flag when the physical
> +range corresponds to memory-mapped device registers. The physical range
> +SHOULD have a single memory type: either normal memory or memory-mapped
> +I/O.
> +
> +\devicenormative{\paragraph}{MAP request}{Device Types / IOMMU Device / Device operations / MAP request}
> +
> +If \field{virt_start}, \field{phys_start} or (\field{virt_end} + 1) is
> +not aligned on the page granularity, the device SHOULD set the request
> +\field{status} to VIRTIO_IOMMU_S_RANGE and SHOULD NOT create the mapping.
> +
> +If a mapping already exists in the requested range, the device SHOULD set
> +the request \field{status} to VIRTIO_IOMMU_S_INVAL and SHOULD NOT change
> +any mapping.
> +
> +If the device doesn't recognize a \field{flags} bit, it SHOULD set the
> +request \field{status} to VIRTIO_IOMMU_S_INVAL. In this case the device
> +SHOULD NOT create the mapping. \footnote{Validating the input is important
> +here, because the driver might be attempting to map with special flags
> +that the device doesn't recognize. Creating the mapping with incompatible
> +flags may result in loss of coherency and security hazards.}
> +
> +If a flag or combination of flag


of flags?

> isn't supported, the device MAY set the
> +request \field{status} to VIRTIO_IOMMU_S_UNSUPP.
> +
> +The device MUST NOT allow writes to a range mapped without the
> +VIRTIO_IOMMU_MAP_F_WRITE flag. However, if the underlying architecture
> +does not support write-only mappings, the device MAY allow reads to a
> +range mapped with VIRTIO_IOMMU_MAP_F_WRITE but not
> +VIRTIO_IOMMU_MAP_F_READ.

do drivers care? if yes how about a way for it to know?

> +
> +If \field{domain} does not exist, the device SHOULD set the request
> +\field{status} to VIRTIO_IOMMU_S_NOENT.
> +
> +\subsubsection{UNMAP request}\label{sec:Device Types / IOMMU Device / Device operations / UNMAP request}
> +
> +\begin{lstlisting}
> +struct virtio_iommu_req_unmap {
> +  struct virtio_iommu_req_head head;
> +  le32  domain;
> +  le64  virt_start;
> +  le64  virt_end;
> +  u8    reserved[4];
> +  struct virtio_iommu_req_tail tail;
> +};
> +\end{lstlisting}
> +
> +Unmap a range of addresses mapped with VIRTIO_IOMMU_T_MAP. We define here
> +a mapping as a virtual region created with a single MAP request. All
> +mappings covered by the range $[virt\_start; virt\_end]$ (inclusive) are
> +removed.
> +
> +The semantics of unmapping are specified in \ref{drivernormative:Device
> +Types / IOMMU Device / Device operations / UNMAP request} and
> +\ref{devicenormative:Device Types / IOMMU Device / Device operations /
> +UNMAP request}, and illustrated with the following requests, assuming each
> +example sequence starts with a blank address space. We define two
> +pseudocode functions \texttt{map(virt_start, virt_end) -> mapping} and
> +\texttt{unmap(virt_start, virt_end)}.
> +
> +\begin{lstlisting}
> +(1) unmap(virt_start=0,
> +          virt_end=4)            -> succeeds, doesn't unmap anything
> +
> +(2) a = map(virt_start=0,
> +            virt_end=9);
> +    unmap(0, 9)                  -> succeeds, unmaps a
> +
> +(3) a = map(0, 4);
> +    b = map(5, 9);
> +    unmap(0, 9)                  -> succeeds, unmaps a and b
> +
> +(4) a = map(0, 9);
> +    unmap(0, 4)                  -> faults, doesn't unmap anything

fails?

> +
> +(5) a = map(0, 4);
> +    b = map(5, 9);
> +    unmap(0, 4)                  -> succeeds, unmaps a
> +
> +(6) a = map(0, 4);
> +    unmap(0, 9)                  -> succeeds, unmaps a
> +
> +(7) a = map(0, 4);
> +    b = map(10, 14);
> +    unmap(0, 14)                 -> succeeds, unmaps a and b
> +\end{lstlisting}
> +

what about a partial unmap? e.g. unmap 0, 3?

> +This request is only available when VIRTIO_IOMMU_F_MAP_UNMAP has been
> +negotiated.
> +
> +\drivernormative{\paragraph}{UNMAP request}{Device Types / IOMMU Device / Device operations / UNMAP request}
> +
> +The driver SHOULD set the \field{reserved} field to zero.
> +
> +The range, defined by \field{virt_start} and \field{virt_end}, SHOULD
> +cover one or more contiguous mappings created with MAP requests. The range
> +MAY spill over unmapped virtual addresses.
> +
> +The first address of a range SHOULD either be the first address of a
> +mapping or be outside any mapping. The last address of a range SHOULD
> +either be the last address of a mapping or be outside any mapping.


above really MUST?

> +
> +\devicenormative{\paragraph}{UNMAP request}{Device Types / IOMMU Device / Device operations / UNMAP request}
> +
> +If the \field{reserved} field of an UNMAP request is not zero, the device
> +MAY set the request \field{status} to VIRTIO_IOMMU_S_INVAL, in which case
> +the device MAY perform the UNMAP operation.
> +
> +If \field{domain} does not exist, the device SHOULD set the request
> +\field{status} to VIRTIO_IOMMU_S_NOENT.
> +
> +If a mapping affected by the range is not covered in its entirety by the
> +range (the UNMAP request would split the mapping), then the device SHOULD
> +set the request \field{status} to VIRTIO_IOMMU_S_RANGE, and SHOULD NOT
> +remove any mapping.
> +
> +If part of the range or the full range is not covered by an existing
> +mapping, then the device SHOULD remove all mappings affected by the range
> +and set the request \field{status} to VIRTIO_IOMMU_S_OK.
> +


So based on above, it seems clear that device must remember any number
of mappings sent to it in device memory. Seems like an easy way
for guest to use any amount of device memory.

If any limits are allowed then I think device needs a way
to communicate them to driver.




> +\subsubsection{PROBE request}\label{sec:Device Types / IOMMU Device / Device operations / PROBE request}
> +
> +If the VIRTIO_IOMMU_F_PROBE feature bit is present, the driver sends a
> +VIRTIO_IOMMU_T_PROBE request for each endpoint that the virtio-iommu
> +device manages. This probe is performed before attaching the endpoint to
> +a domain.
> +
> +\begin{lstlisting}
> +struct virtio_iommu_req_probe {
> +  struct virtio_iommu_req_head head;
> +  /* Device-readable */
> +  le32  endpoint;
> +  u8    reserved[64];
> +
> +  /* Device-writable */
> +  u8    properties[probe_size];
> +  struct virtio_iommu_req_tail tail;
> +};
> +\end{lstlisting}
> +
> +\begin{description}
> +\item[\field{endpoint}] has the same meaning as in ATTACH and DETACH
> +  requests.
> +
> +\item[\field{reserved}] is used as padding, so that future extensions can
> +  add fields to the device-readable part.
> +
> +\item[\field{properties}] contains a list of properties of the
> +  \field{endpoint}, filled by the device. The length of the
> +  \field{properties} field is \field{probe_size} bytes. Each property is
> +  described with a \verb+struct virtio_iommu_probe_property+ header, which
> +  may be followed by a value of size \field{length}.
> +
> +\begin{lstlisting}
> +#define VIRTIO_IOMMU_PROBE_T_MASK 0xfff

I would move it to where it's actually used.

> +
> +struct virtio_iommu_probe_property {
> +  le16  type;
> +  le16  length;
> +};
> +\end{lstlisting}
> +
> +\end{description}
> +
> +The driver allocates a buffer of adequate size

how do we know what is adequate?

> for the probe request,
> +writes \field{endpoint} and adds the buffer to the request queue. The
> +device fills the \field{properties} field with a list of properties for
> +this endpoint.
> +
> +The driver parses the first property by reading \field{type}, then
> +\field{length}. If the driver recognizes \field{type}, it reads and
> +handles the rest of the property. The driver then reads the next property,
> +that is located $(\field{length} + 4)$ bytes after the beginning of the
> +first one, and so on. The driver parses all properties until it reaches a
> +NONE property or the end of \field{properties}.

I guess device must make sure length fits within the buffer?

> +
> +The upper nibble of \field{type} is reserved for future extensions.
> +Therefore only 4096 types are available. The actual type of a property is
> +extracted like this:
> +
> +\begin{lstlisting}
> +u16 type = le16_to_cpu(property.type) & VIRTIO_IOMMU_PROBE_T_MASK;
> +\end{lstlisting}

We have a bitfield notation to document format. This might be more appropriate here.

> +
> +Available property types are described in section
> +\ref{sec:Device Types / IOMMU Device / Device operations / PROBE properties}.
> +
> +\drivernormative{\paragraph}{PROBE request}{Device Types / IOMMU Device / Device operations / PROBE request}
> +
> +The size of \field{properties} MUST be \field{probe_size} bytes.
> +
> +The driver SHOULD set \field{reserved} to zero.
> +
> +If the driver doesn't recognize the \field{type} of a property, it SHOULD
> +ignore the property and continue parsing the list.
> +
> +The driver SHOULD NOT deduce the property length from \field{type}.
> +
> +The driver SHOULD ignore bits[15:12] of \field{type}.

Why not ignore the whole property is high bits are not 0?
Seems safer ...

> +
> +\devicenormative{\paragraph}{PROBE request}{Device Types / IOMMU Device / Device operations / PROBE request}
> +
> +If the \field{reserved} field of a PROBE request is not zero, the device
> +MAY set the request \field{status} to VIRTIO_IOMMU_S_INVAL.
> +
> +If the endpoint identified by \field{endpoint} doesn't exist, then the
> +device SHOULD set the request \field{status} to VIRTIO_IOMMU_S_NOENT.
> +
> +If the device does not offer the VIRTIO_IOMMU_F_PROBE feature, and if the
> +driver sends a VIRTIO_IOMMU_T_PROBE request, then the device SHOULD return
> +the buffers on the used ring and set the \field{len} field of the used
> +element to zero.
> +
> +The device SHOULD set bits [15:12] of property \field{type} to zero.
> +
> +The device MUST write the size of the property without the
> +\verb+struct virtio_iommu_probe_property+ header, in bytes, into
> +\field{length}.
> +
> +When two properties follow each others

each other

>, the device MUST put the second
> +property exactly $(\field{length} + 4)$ bytes after the beginning of the
> +first one.
> +
> +If the \field{properties} list is smaller than \field{probe_size}, then
> +the device SHOULD NOT write any property and SHOULD set the request
> +\field{status} to VIRTIO_IOMMU_S_INVAL.
> +
> +If the device doesn't fill all \field{probe_size} bytes with properties,
> +it SHOULD fill the remaining bytes of \field{properties} with zeroes.
> +
> +\subsubsection{PROBE properties}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties}
> +
> +\begin{lstlisting}
> +#define VIRTIO_IOMMU_PROBE_T_NONE       0
> +#define VIRTIO_IOMMU_PROBE_T_RESV_MEM   1
> +\end{lstlisting}
> +
> +\paragraph{Property NONE}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties / NONE}
> +
> +Marks the end of the property list. This property doesn't have any value,
> +and should have \field{length} 0.

why is this needed btw?

> +
> +\paragraph{Property RESV_MEM}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties / RESVMEM}
> +
> +The RESV_MEM property describes a chunk of reserved virtual memory. It may
> +be used by the device to describe virtual address ranges that shouldn't be
> +allocated by the driver, or that are special.

avoid shouldn't outside conformance clauses. maybe add a conformance
clause that driver must avoid these ranges?
what if driver ignores this property? I guess device must validate
the addresses?

> +
> +\begin{lstlisting}
> +struct virtio_iommu_probe_resv_mem {
> +  struct virtio_iommu_probe_property head;
> +  u8    subtype;
> +  u8    reserved[3];
> +  le64  start;
> +  le64  end;
> +};
> +\end{lstlisting}
> +
> +Fields \field{start} and \field{end} describe the range of reserved virtual
> +addresses. \field{subtype} may be one of:
> +
> +\begin{description}
> +  \item[VIRTIO_IOMMU_RESV_MEM_T_RESERVED (0)]
> +    Accesses to virtual addresses in this region have undefined behavior.

undefined?  how can platforms with such a property coexist with untrusted
devices?

> +    They may be aborted by the device,

the device as in the iommu device?

> bypass it,

bypass might be a big security problem.

> or never even reach it.

that's ok

> +    The region may also be used for host mappings, for example Message
> +    Signaled Interrupts.

Let's avoid mentioning host if we can.

> +
> +    The guest should neither use these virtual addresses in a MAP request
> +    nor instruct endpoints to perform DMA on them.

avoid should outside conformance clauses.

and what it guest does not obey this request?

I think we must fix this and require that such accesses
are ignored.

> +
> +  \item[VIRTIO_IOMMU_RESV_MEM_T_MSI (1)]
> +    This region is a doorbell for Message Signaled Interrupts (MSIs). It
> +    is similar to VIRTIO_IOMMU_RESV_MEM_T_RESERVED, in that the driver
> +    should not map virtual addresses described by the property.
> +
> +    In addition it tells the guest how to handle MSI doorbells. If the
> +    endpoint doesn't have a VIRTIO_IOMMU_RESV_MEM_T_MSI property
> +    corresponding to the doorbell of a virtual MSI controller, then the
> +    guest should create a mapping for it.

1st paragraph says should not map, second to create a mapping ...
confusing.


> +\end{description}
> +
> +\drivernormative{\subparagraph}{Property RESV_MEM}{Device Types / IOMMU Device / Device operations / PROBE properties / RESVMEM}
> +
> +The driver SHOULD NOT map any virtual address described by a
> +VIRTIO_IOMMU_RESV_MEM_T_RESERVED or VIRTIO_IOMMU_RESV_MEM_T_MSI property.
> +
> +The driver SHOULD ignore \field{reserved}.
> +
> +The driver SHOULD treat any \field{subtype} it doesn't recognize as if it
> +was VIRTIO_IOMMU_RESV_MEM_T_RESERVED.

why is that a good idea?

> +
> +\devicenormative{\subparagraph}{Property RESV_MEM}{Device Types / IOMMU Device / Device operations / PROBE properties / RESVMEM}
> +
> +The device SHOULD set \field{reserved} to zero.
> +
> +The device SHOULD NOT present more than one VIRTIO_IOMMU_RESV_MEM_T_MSI
> +property per endpoint.
> +
> +The device SHOULD NOT present

multiple?

> RESV_MEM properties that overlap each others

each other

> +for the same endpoint.
> +
> +\subsubsection{Fault reporting}\label{sev:Device Types / IOMMU Device / Device operations / Fault reporting}
> +
> +The device can report translation faults and other significant asynchronous
> +events on the event virtqueue. The driver initially populates the queue with
> +empty report buffers.

write buffers? or read buffers?

> When the device needs to report an event, it fills a
> +buffer and notifies the driver with an interrupt.

interrupts are transport things. pls review notification section
and use consistent wording.

> The driver consumes the
> +report and moves the buffer back onto the queue.

add a buffer to the virtqueue?

> +
> +If no buffer is available, the device may either wait for one to be consumed,
> +or drop the event.
> +
> +\begin{lstlisting}
> +struct virtio_iommu_fault {
> +  u8    reason;
> +  u8    reserved[3];
> +  le32  flags;
> +  le32  endpoint;
> +  le32  reserved1;
> +  le64  address;
> +};
> +
> +#define VIRTIO_IOMMU_FAULT_F_READ     (1 << 0)
> +#define VIRTIO_IOMMU_FAULT_F_WRITE    (1 << 1)
> +#define VIRTIO_IOMMU_FAULT_F_EXEC     (1 << 2)
> +#define VIRTIO_IOMMU_FAULT_F_ADDRESS  (1 << 8)
> +\end{lstlisting}
> +
> +\begin{description}
> +  \item[\field{reason}] The reason for this report. It may have the
> +    following values:
> +    \begin{description}
> +      \item[VIRTIO_IOMMU_FAULT_R_UNKNOWN (0)] An internal error happened, or
> +        an error that cannot be described with the following reasons.
> +      \item[VIRTIO_IOMMU_FAULT_R_DOMAIN (1)] The endpoint attempted to
> +        access \field{address} without being attached to a domain.
> +      \item[VIRTIO_IOMMU_FAULT_R_MAPPING (2)] The endpoint attempted to
> +        access \field{address}, which wasn't mapped in the domain or
> +        didn't have the correct protection flags.
> +    \end{description}
> +  \item[\field{flags}] Information about the fault context.
> +  \item[\field{endpoint}] The endpoint causing the fault.
> +  \item[\field{reserved} and \field{reserved1}] Should be zero.
> +  \item[\field{address}] If VIRTIO_IOMMU_FAULT_F_ADDRESS is set, the
> +    address causing the fault.
> +\end{description}
> +
> +These faults are not recoverable\footnote{This means that the PRI
> +extension to PCI, for example, that allows recoverable faults, isn't
> +supported for the moment.}. The guest has to do its best to
> +prevent any future fault from happening, by stopping or resetting the
> +endpoint.

what does recoverable mean from virtio iommu pov?
nothing right?
it's an endpoint issue.

> +
> +When the fault is reported by a physical IOMMU, the fault reasons may not
> +match exactly the reason of the original fault report. The device should
> +try its best to find the closest match.
> +
> +If the device encounters a fault

device doesn't right? endpoint does ...

> that wasn't caused by a specific
> +endpoint, it is unlikely that the driver would be able to do anything else
> +than print the fault and stop using the device, so reporting the fault on
> +the event queue isn't useful. In that case, we recommend using the
> +DEVICE_NEEDS_RESET status bit.
> +
> +\drivernormative{\paragraph}{Fault reporting}{Device Types / IOMMU Device / Device operations / Fault reporting}
> +
> +If the \field{reserved} field is not zero, the driver SHOULD ignore the
> +fault report.\footnote{A future format may implement events that are not
> +faults, which would be differentiated by a type field in place of
> +\field{reserved}.}
> +
> +The driver SHOULD ignore undefined \field{flags}.
> +
> +If the driver doesn't recognize \field{reason}, it SHOULD treat the fault
> +as if it was VIRTIO_IOMMU_FAULT_R_UNKNOWN.
> +
> +\devicenormative{\paragraph}{Fault reporting}{Device Types / IOMMU Device / Device operations / Fault reporting}
> +
> +The device SHOULD set \field{reserved} and \field{reserved1} to zero.
> +
> +The device SHOULD set undefined \field{flags} to zero.
> +
> +The device SHOULD write a valid endpoint ID in \field{endpoint}.
> +
> +The device MAY omit setting VIRTIO_IOMMU_FAULT_F_ADDRESS and writing
> +\field{address} in any fault report, regardless of the \field{reason}.
> +
> +If a buffer is too small to contain the fault report\footnotemark, the
> +device SHOULD NOT use multiple buffers to describe it. The device MAY fall
> +back to using an older fault report format that fits in the buffer.
> +
> +\footnotetext{This would happen for example if the device implements a
> +more recent version of this specification, whose fault report contains
> +additional fields.}
> -- 
> 2.21.0

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [virtio-comment] Re: [virtio-dev] [PATCH v3] Add virtio-iommu device specification
  2019-04-30 14:42 ` [virtio-comment] Re: [virtio-dev] " Jean-Philippe Brucker
@ 2019-05-03 19:47   ` Michael S. Tsirkin
  2019-05-05 13:33     ` Jean-Philippe Brucker
  0 siblings, 1 reply; 11+ messages in thread
From: Michael S. Tsirkin @ 2019-05-03 19:47 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: virtio-comment, virtio-dev, joro, tnowicki, eric.auger,
	kevin.tian, lorenzo.pieralisi, bharat.bhushan, bauerman

On Tue, Apr 30, 2019 at 03:42:14PM +0100, Jean-Philippe Brucker wrote:
> On 30/04/2019 14:56, Jean-Philippe Brucker wrote:
> > The IOMMU device allows a guest to manage DMA mappings for physical,
> > emulated and paravirtualized endpoints. Add device description for the
> > virtio-iommu device and driver. Introduce PROBE, ATTACH, DETACH, MAP and
> > UNMAP requests, as well as translation error reporting.
> > 
> > Fixes: https://github.com/oasis-tcs/virtio-spec/issues/37
> > Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> 
> I'd like to request a vote for this version (the github issue is now up
> to date and the patch applies cleanly)
> 
> Thanks,
> Jean

I sent some comments that might be worth addressing before the vote.  If
you don't agree I'm fine with starting voting now if you like and we'll
let the TC decide, pls let me know.

> > ---
> > Since v2 I rebased onto virtio v1.1-wd02, fixing a conflict in
> > conformance.tex and using the new \conformance command.
> > 
> > A PDF version is available at
> > https://jpbrucker.net/virtio-iommu/spec/virtio-v1.1-wd02-iommu-0.11-draft.pdf
> > ---
> >  conformance.tex  |  40 ++-
> >  content.tex      |   1 +
> >  virtio-iommu.tex | 850 +++++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 889 insertions(+), 2 deletions(-)
> >  create mode 100644 virtio-iommu.tex
> > 
> > diff --git a/conformance.tex b/conformance.tex
> > index 42f702a..79a3e7d 100644
> > --- a/conformance.tex
> > +++ b/conformance.tex
> > @@ -15,14 +15,14 @@ \section{Conformance Targets}\label{sec:Conformance / Conformance Targets}
> >    \begin{itemize}
> >      \item Clause \ref{sec:Conformance / Driver Conformance}.
> >      \item One of clauses \ref{sec:Conformance / Driver Conformance / PCI Driver Conformance}, \ref{sec:Conformance / Driver Conformance / MMIO Driver Conformance} or \ref{sec:Conformance / Driver Conformance / Channel I/O Driver Conformance}.
> > -    \item One of clauses \ref{sec:Conformance / Driver Conformance / Network Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Block Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Console Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Entropy Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Traditional Memory Balloon Driver Conformance}, \ref{sec:Conformance / Driver Conformance / SCSI Host Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Input Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Crypto Driver Conformance} or \ref{sec:Conformance / Driver Conformance / Socket Driver Conformance}.
> > +    \item One of clauses \ref{sec:Conformance / Driver Conformance / Network Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Block Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Console Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Entropy Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Traditional Memory Balloon Driver Conformance}, \ref{sec:Conformance / Driver Conformance / SCSI Host Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Input Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Crypto Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Socket Driver Conformance} or \ref{sec:Conformance / Driver Conformance / IOMMU Driver Conformance}.
> >      \item Clause \ref{sec:Conformance / Legacy Interface: Transitional Device and Transitional Driver Conformance}.
> >    \end{itemize}
> >  \item[Device] A device MUST conform to four conformance clauses:
> >    \begin{itemize}
> >      \item Clause \ref{sec:Conformance / Device Conformance}.
> >      \item One of clauses \ref{sec:Conformance / Device Conformance / PCI Device Conformance}, \ref{sec:Conformance / Device Conformance / MMIO Device Conformance} or \ref{sec:Conformance / Device Conformance / Channel I/O Device Conformance}.
> > -    \item One of clauses \ref{sec:Conformance / Device Conformance / Network Device Conformance}, \ref{sec:Conformance / Device Conformance / Block Device Conformance}, \ref{sec:Conformance / Device Conformance / Console Device Conformance}, \ref{sec:Conformance / Device Conformance / Entropy Device Conformance}, \ref{sec:Conformance / Device Conformance / Traditional Memory Balloon Device Conformance}, \ref{sec:Conformance / Device Conformance / SCSI Host Device Conformance}, \ref{sec:Conformance / Device Conformance / Input Device Conformance}, \ref{sec:Conformance / Device Conformance / Crypto Device Conformance} or \ref{sec:Conformance / Device Conformance / Socket Device Conformance}.
> > +    \item One of clauses \ref{sec:Conformance / Device Conformance / Network Device Conformance}, \ref{sec:Conformance / Device Conformance / Block Device Conformance}, \ref{sec:Conformance / Device Conformance / Console Device Conformance}, \ref{sec:Conformance / Device Conformance / Entropy Device Conformance}, \ref{sec:Conformance / Device Conformance / Traditional Memory Balloon Device Conformance}, \ref{sec:Conformance / Device Conformance / SCSI Host Device Conformance}, \ref{sec:Conformance / Device Conformance / Input Device Conformance}, \ref{sec:Conformance / Device Conformance / Crypto Device Conformance}, \ref{sec:Conformance / Device Conformance / Socket Device Conformance} or \ref{sec:Conformance / Device Conformance / IOMMU Device Conformance}.
> >      \item Clause \ref{sec:Conformance / Legacy Interface: Transitional Device and Transitional Driver Conformance}.
> >    \end{itemize}
> >  \end{description}
> > @@ -183,6 +183,24 @@ \section{Conformance Targets}\label{sec:Conformance / Conformance Targets}
> >  \item \ref{drivernormative:Device Types / Socket Device / Device Operation / Device Events}
> >  \end{itemize}
> >  
> > +\conformance{\subsection}{IOMMU Driver Conformance}\label{sec:Conformance / Driver Conformance / IOMMU Driver Conformance}
> > +
> > +An IOMMU driver MUST conform to the following normative statements:
> > +
> > +\begin{itemize}
> > +\item \ref{drivernormative:Device Types / IOMMU Device / Feature bits}
> > +\item \ref{drivernormative:Device Types / IOMMU Device / Device configuration layout}
> > +\item \ref{drivernormative:Device Types / IOMMU Device / Device Initialization}
> > +\item \ref{drivernormative:Device Types / IOMMU Device / Device operations}
> > +\item \ref{drivernormative:Device Types / IOMMU Device / Device operations / ATTACH request}
> > +\item \ref{drivernormative:Device Types / IOMMU Device / Device operations / DETACH request}
> > +\item \ref{drivernormative:Device Types / IOMMU Device / Device operations / MAP request}
> > +\item \ref{drivernormative:Device Types / IOMMU Device / Device operations / UNMAP request}
> > +\item \ref{drivernormative:Device Types / IOMMU Device / Device operations / PROBE request}
> > +\item \ref{drivernormative:Device Types / IOMMU Device / Device operations / PROBE properties / RESVMEM}
> > +\item \ref{drivernormative:Device Types / IOMMU Device / Device operations / Fault reporting}
> > +\end{itemize}
> > +
> >  \conformance{\section}{Device Conformance}\label{sec:Conformance / Device Conformance}
> >  
> >  A device MUST conform to the following normative statements:
> > @@ -336,6 +354,24 @@ \section{Conformance Targets}\label{sec:Conformance / Conformance Targets}
> >  \item \ref{devicenormative:Device Types / Socket Device / Device Operation / Receive and Transmit}
> >  \end{itemize}
> >  
> > +\conformance{\subsection}{IOMMU Device Conformance}\label{sec:Conformance / Device Conformance / IOMMU Device Conformance}
> > +
> > +An IOMMU device MUST conform to the following normative statements:
> > +
> > +\begin{itemize}
> > +\item \ref{devicenormative:Device Types / IOMMU Device / Feature bits}
> > +\item \ref{devicenormative:Device Types / IOMMU Device / Device configuration layout}
> > +\item \ref{devicenormative:Device Types / IOMMU Device / Device Initialization}
> > +\item \ref{devicenormative:Device Types / IOMMU Device / Device operations}
> > +\item \ref{devicenormative:Device Types / IOMMU Device / Device operations / ATTACH request}
> > +\item \ref{devicenormative:Device Types / IOMMU Device / Device operations / DETACH request}
> > +\item \ref{devicenormative:Device Types / IOMMU Device / Device operations / MAP request}
> > +\item \ref{devicenormative:Device Types / IOMMU Device / Device operations / UNMAP request}
> > +\item \ref{devicenormative:Device Types / IOMMU Device / Device operations / PROBE request}
> > +\item \ref{devicenormative:Device Types / IOMMU Device / Device operations / PROBE properties / RESVMEM}
> > +\item \ref{devicenormative:Device Types / IOMMU Device / Device operations / Fault reporting}
> > +\end{itemize}
> > +
> >  \conformance{\section}{Legacy Interface: Transitional Device and Transitional Driver Conformance}\label{sec:Conformance / Legacy Interface: Transitional Device and Transitional Driver Conformance}
> >  A conformant implementation MUST be either transitional or
> >  non-transitional, see \ref{intro:Legacy
> > diff --git a/content.tex b/content.tex
> > index 193b6e1..5449a46 100644
> > --- a/content.tex
> > +++ b/content.tex
> > @@ -5594,6 +5594,7 @@ \subsubsection{Legacy Interface: Framing Requirements}\label{sec:Device
> >  \input{virtio-input.tex}
> >  \input{virtio-crypto.tex}
> >  \input{virtio-vsock.tex}
> > +\input{virtio-iommu.tex}
> >  
> >  \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
> >  
> > diff --git a/virtio-iommu.tex b/virtio-iommu.tex
> > new file mode 100644
> > index 0000000..be3dcd3
> > --- /dev/null
> > +++ b/virtio-iommu.tex
> > @@ -0,0 +1,850 @@
> > +\section{IOMMU device}\label{sec:Device Types / IOMMU Device}
> > +
> > +The virtio-iommu device manages Direct Memory Access (DMA) from one or
> > +more endpoints. It may act both as a proxy for physical IOMMUs managing
> > +devices assigned to the guest, and as virtual IOMMU managing emulated and
> > +paravirtualized devices.
> > +
> > +The driver first discovers endpoints managed by the virtio-iommu device
> > +using standard firmware mechanisms. It then sends requests to create
> > +virtual address spaces and virtual-to-physical mappings for these
> > +endpoints. In its simplest form, the virtio-iommu supports four request
> > +types:
> > +
> > +\begin{enumerate}
> > +\item Create a domain and attach an endpoint to it.  \\
> > +  \texttt{attach(endpoint = 0x8, domain = 1)}
> > +\item Create a mapping between a range of guest-virtual and guest-physical
> > +  address. \\
> > +  \texttt{map(domain = 1, virt_start = 0x1000, virt_end = 0x1fff,
> > +          phys = 0xa000, flags = READ)}
> > +
> > +  Endpoint 0x8, for example a hardware PCI endpoint with BDF 00:01.0, can
> > +  now read at addresses 0x1000-0x1fff. These accesses are translated
> > +  into system-physical addresses by the IOMMU.
> > +
> > +\item Remove the mapping.\\
> > +  \texttt{unmap(domain = 1, virt_start = 0x1000, virt_end = 0x1fff)}
> > +
> > +  Any access to addresses 0x1000-0x1fff by endpoint 0x8 would now be
> > +  rejected.
> > +\item Detach the device and remove the domain.\\
> > +  \texttt{detach(endpoint = 0x8, domain = 1)}
> > +\end{enumerate}
> > +
> > +\subsection{Device ID}\label{sec:Device Types / IOMMU Device / Device ID}
> > +
> > +23
> > +
> > +\subsection{Virtqueues}\label{sec:Device Types / IOMMU Device / Virtqueues}
> > +
> > +\begin{description}
> > +\item[0] requestq
> > +\item[1] eventq
> > +\end{description}
> > +
> > +\subsection{Feature bits}\label{sec:Device Types / IOMMU Device / Feature bits}
> > +
> > +\begin{description}
> > +\item[VIRTIO_IOMMU_F_INPUT_RANGE (0)]
> > +  Available range of virtual addresses is described in \field{input_range}
> > +
> > +\item[VIRTIO_IOMMU_F_DOMAIN_BITS (1)]
> > +  The number of domains supported is described in \field{domain_bits}
> > +
> > +\item[VIRTIO_IOMMU_F_MAP_UNMAP (2)]
> > +  Map and unmap requests are available.\footnote{Future extensions may add
> > +  different modes of operations. At the moment, only
> > +  VIRTIO_IOMMU_F_MAP_UNMAP is supported.}
> > +
> > +\item[VIRTIO_IOMMU_F_BYPASS (3)]
> > +  When not attached to a domain, endpoints downstream of the IOMMU
> > +  can access the guest-physical address space.
> > +
> > +\item[VIRTIO_IOMMU_F_PROBE (4)]
> > +  The PROBE request is available.
> > +\end{description}
> > +
> > +\drivernormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device / Feature bits}
> > +
> > +The driver SHOULD accept any of the VIRTIO_IOMMU_F_INPUT_RANGE,
> > +VIRTIO_IOMMU_F_DOMAIN_BITS, VIRTIO_IOMMU_F_MAP_UNMAP and
> > +VIRTIO_IOMMU_F_PROBE feature bits if offered by the device.
> > +
> > +\devicenormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device / Feature bits}
> > +
> > +If the device offers any of VIRTIO_IOMMU_F_INPUT_RANGE,
> > +VIRTIO_IOMMU_F_DOMAIN_BITS, VIRTIO_IOMMU_F_PROBE or
> > +VIRTIO_IOMMU_F_MAP_UNMAP feature bits, and if the driver did not accept
> > +this feature bit, then the device MAY signal failure by failing to set
> > +FEATURES_OK \field{device status} bit when the driver writes it.
> > +
> > +\subsection{Device configuration layout}\label{sec:Device Types / IOMMU Device / Device configuration layout}
> > +
> > +The \field{page_size_mask} field is always present. Availability of the
> > +others depend on various feature bits as indicated above.
> > +
> > +\begin{lstlisting}
> > +struct virtio_iommu_config {
> > +  u64 page_size_mask;
> > +  struct virtio_iommu_range {
> > +    u64 start;
> > +    u64 end;
> > +  } input_range;
> > +  u8  domain_bits;
> > +  u8  padding[3];
> > +  u32 probe_size;
> > +};
> > +\end{lstlisting}
> > +
> > +\drivernormative{\subsubsection}{Device configuration layout}{Device Types / IOMMU Device / Device configuration layout}
> > +
> > +The driver MUST NOT write to device configuration fields.
> > +
> > +\devicenormative{\subsubsection}{Device configuration layout}{Device Types / IOMMU Device / Device configuration layout}
> > +
> > +The device SHOULD set \field{padding} to zero.
> > +
> > +The device MUST set at least one bit in \field{page_size_mask}, describing
> > +the page granularity. The device MAY set more than one bit in
> > +\field{page_size_mask}.
> > +
> > +\subsection{Device initialization}\label{sec:Device Types / IOMMU Device / Device initialization}
> > +
> > +When the device is reset, endpoints are not attached to any domain.
> > +If the VIRTIO_IOMMU_F_BYPASS feature is negotiated, all endpoints can
> > +access guest-physical addresses ("bypass mode"). If the feature is not
> > +negotiated, then any memory access from endpoints will fault. Upon
> > +attaching an endpoint in bypass mode to a new domain, any memory access
> > +from the endpoint will fault, since the domain does not contain any
> > +mapping.
> > +
> > +The driver chooses operating mode depending on its capabilities. In this
> > +version of the virtio-iommu device, the only supported mode is
> > +VIRTIO_IOMMU_F_MAP_UNMAP.
> > +
> > +\drivernormative{\subsubsection}{Device Initialization}{Device Types / IOMMU Device / Device Initialization}
> > +
> > +The driver MUST NOT negotiate VIRTIO_IOMMU_F_MAP_UNMAP if it is incapable
> > +of sending VIRTIO_IOMMU_T_MAP and VIRTIO_IOMMU_T_UNMAP requests.
> > +
> > +If the VIRTIO_IOMMU_F_PROBE feature is negotiated, the driver SHOULD send a
> > +VIRTIO_IOMMU_T_PROBE request for each endpoint before attaching the
> > +endpoint to a domain.
> > +
> > +\devicenormative{\subsubsection}{Device Initialization}{Device Types / IOMMU Device / Device Initialization}
> > +
> > +If the driver does not accept the VIRTIO_IOMMU_F_BYPASS feature, the
> > +device SHOULD NOT let endpoints access the guest-physical address space.
> > +
> > +\subsection{Device operations}\label{sec:Device Types / IOMMU Device / Device operations}
> > +
> > +Driver send requests on the request virtqueue, notifies the device and
> > +waits for the device to return the request with a status in the used ring.
> > +All requests are split in two parts: one device-readable, one device-
> > +writable.
> > +
> > +\begin{lstlisting}
> > +struct virtio_iommu_req_head {
> > +  u8   type;
> > +  u8   reserved[3];
> > +};
> > +
> > +struct virtio_iommu_req_tail {
> > +  u8   status;
> > +  u8   reserved[3];
> > +};
> > +\end{lstlisting}
> > +
> > +Type may be one of:
> > +
> > +\begin{lstlisting}
> > +#define VIRTIO_IOMMU_T_ATTACH     1
> > +#define VIRTIO_IOMMU_T_DETACH     2
> > +#define VIRTIO_IOMMU_T_MAP        3
> > +#define VIRTIO_IOMMU_T_UNMAP      4
> > +#define VIRTIO_IOMMU_T_PROBE      5
> > +\end{lstlisting}
> > +
> > +A few general-purpose status codes are defined here. Unless explicitly
> > +described in a \textbf{Requirements} section, these values are hints to
> > +make troubleshooting easier.
> > +
> > +When the device fails to parse a request, for instance if a request seems
> > +too small for its type and the device cannot find the tail, then it will
> > +be unable to set \field{status}. In that case, it should return the
> > +buffers without writing in them.
> > +
> > +\begin{lstlisting}
> > +/* All good! Carry on. */
> > +#define VIRTIO_IOMMU_S_OK         0
> > +/* Virtio communication error */
> > +#define VIRTIO_IOMMU_S_IOERR      1
> > +/* Unsupported request */
> > +#define VIRTIO_IOMMU_S_UNSUPP     2
> > +/* Internal device error */
> > +#define VIRTIO_IOMMU_S_DEVERR     3
> > +/* Invalid parameters */
> > +#define VIRTIO_IOMMU_S_INVAL      4
> > +/* Out-of-range parameters */
> > +#define VIRTIO_IOMMU_S_RANGE      5
> > +/* Entry not found */
> > +#define VIRTIO_IOMMU_S_NOENT      6
> > +/* Bad address */
> > +#define VIRTIO_IOMMU_S_FAULT      7
> > +\end{lstlisting}
> > +
> > +Range limits of some request fields are described in the device
> > +configuration:
> > +
> > +\begin{itemize}
> > +\item \field{page_size_mask} contains the bitmask of all page sizes that
> > +  can be mapped. The least significant bit set defines the page
> > +  granularity of IOMMU mappings. Other bits in the mask are hints
> > +  describing page sizes that the IOMMU can merge into a single mapping
> > +  (page blocks).
> > +
> > +  The smallest page granularity supported by the IOMMU is one byte. It is
> > +  legal for the driver to map one byte at a time if bit 0 of
> > +  \field{page_size_mask} is set.
> > +
> > +\item If the VIRTIO_IOMMU_F_DOMAIN_BITS feature is offered,
> > +  \field{domain_bits} contains the number of bits supported in a domain
> > +  ID, the identifier used in most requests. A value of 0 is valid, it
> > +  means that a single domain is supported and endpoints can only be
> > +  attached to domain 0.
> > +
> > +  If the feature is not offered, domain identifiers can use up to 32 bits.
> > +
> > +\item If the VIRTIO_IOMMU_F_INPUT_RANGE feature is offered,
> > +  \field{input_range} contains the virtual address range that the IOMMU is
> > +  able to translate. Any mapping request to virtual addresses outside of
> > +  this range will fail.
> > +
> > +  If the feature is not offered, virtual mappings span over the whole
> > +  64-bit address space (\texttt{start = 0, end = 0xffffffff ffffffff})
> > +\end{itemize}
> > +
> > +\drivernormative{\subsubsection}{Device operations}{Device Types / IOMMU Device / Device operations}
> > +
> > +The driver SHOULD set field \field{reserved} of
> > +\verb+struct virtio_iommu_req_head+ to zero.
> > +
> > +When a device returns a complete request in the used queue without having
> > +written to it, the driver SHOULD interpret it as a failure from the device
> > +to parse the request.
> > +
> > +If the VIRTIO_IOMMU_F_INPUT_RANGE feature is negotiated, the driver SHOULD
> > +NOT send requests with \field{virt_start} less than
> > +\field{input_range.start} or \field{virt_end} greater than
> > +\field{input_range.end}.
> > +
> > +If the VIRTIO_IOMMU_F_DOMAIN_BITS feature is negotiated, the driver SHOULD
> > +NOT send requests with \field{domain} greater than the size described by
> > +\field{domain_bits}.
> > +
> > +The driver SHOULD NOT use multiple descriptor chains for a single request.
> > +
> > +\devicenormative{\subsubsection}{Device operations}{Device Types / IOMMU Device / Device operations}
> > +
> > +The device SHOULD NOT set \field{status} to VIRTIO_IOMMU_S_OK if a request
> > +didn't succeed.
> > +
> > +If a request \field{type} is not recognized, the device SHOULD return the
> > +buffers on the used ring and set the \field{len} field of the used element
> > +to zero.
> > +
> > +The device SHOULD ignore field \field{reserved} of
> > +\verb+struct virtio_iommu_req_head+ and SHOULD set field \field{reserved}
> > +of \verb+struct virtio_iommu_req_tail+ to zero.
> > +
> > +If the VIRTIO_IOMMU_F_INPUT_RANGE feature is negotiated and the range
> > +described by fields \field{virt_start} and \field{virt_end} doesn't fit in
> > +the range described by \field{input_range}, the device MAY set
> > +\field{status} to VIRTIO_IOMMU_S_RANGE and ignore the request.
> > +
> > +If the VIRTIO_IOMMU_F_DOMAIN_BITS is negotiated and bits above
> > +\field{domain_bits} are set in field \field{domain}, the device MAY set
> > +\field{status} to VIRTIO_IOMMU_S_RANGE and ignore the request.
> > +
> > +\subsubsection{ATTACH request}\label{sec:Device Types / IOMMU Device / Device operations / ATTACH request}
> > +
> > +\begin{lstlisting}
> > +struct virtio_iommu_req_attach {
> > +  struct virtio_iommu_req_head head;
> > +  le32 domain;
> > +  le32 endpoint;
> > +  u8   reserved[8];
> > +  struct virtio_iommu_req_tail tail;
> > +};
> > +\end{lstlisting}
> > +
> > +Attach an endpoint to a domain. \field{domain} is an identifier unique to
> > +the virtio-iommu device. The \field{domain} number doesn't have a meaning
> > +outside of virtio-iommu. If the domain doesn't exist in the device, it is
> > +created. \field{endpoint} is an identifier unique to the virtio-iommu
> > +device. The host communicates these unique endpoint IDs to the guest using
> > +methods outside the scope of this specification, but the following rules
> > +apply:
> > +
> > +\begin{itemize}
> > +\item The endpoint ID is unique from the virtio-iommu point of view.
> > +  Multiple endpoints whose DMA transactions are not translated by the same
> > +  virtio-iommu may have the same endpoint ID. Endpoints whose DMA
> > +  transactions may be translated by the same virtio-iommu must have
> > +  different endpoint IDs.
> > +
> > +\item Sometimes the host cannot completely isolate two endpoints from each
> > +  others. For example on a legacy PCI bus, endpoints can snoop DMA
> > +  transactions from their neighbours. In this case, the host must
> > +  communicate to the guest that it cannot isolate these endpoints from
> > +  each others, or that the physical IOMMU cannot distinguish transactions
> > +  coming from these endpoints. The method used to communicate this is
> > +  outside the scope of this specification.
> > +\end{itemize}
> > +
> > +Multiple endpoints may be attached to the same domain. An endpoint cannot
> > +be attached to multiple domains at the same time.
> > +
> > +\drivernormative{\paragraph}{ATTACH request}{Device Types / IOMMU Device / Device operations / ATTACH request}
> > +
> > +The driver SHOULD set \field{reserved} to zero.
> > +
> > +The driver SHOULD ensure that endpoints that cannot be isolated by the
> > +host are attached to the same domain.
> > +
> > +\devicenormative{\paragraph}{ATTACH request}{Device Types / IOMMU Device / Device operations / ATTACH request}
> > +
> > +If the \field{reserved} field of an ATTACH request is not zero, the device
> > +SHOULD set the request \field{status} to VIRTIO_IOMMU_S_INVAL and SHOULD
> > +NOT attach the endpoint to the domain. \footnote{The device should
> > +validate input of ATTACH requests in case the driver attempts to attach in
> > +a mode that is unimplemented by the device, and would be incompatible with
> > +the modes implemented by the device.}
> > +
> > +If the endpoint identified by \field{endpoint} doesn't exist, then the
> > +device SHOULD set the request \field{status} to VIRTIO_IOMMU_S_NOENT.
> > +
> > +If another endpoint is already attached to the domain identified by
> > +\field{domain}, then the device MAY attach the endpoint identified by
> > +\field{endpoint} to the domain. If it cannot do so, the device
> > +MUST set the request \field{status} to VIRTIO_IOMMU_S_UNSUPP.
> > +
> > +If the endpoint identified by \field{endpoint} is already attached to
> > +another domain, then the device SHOULD first detach it from that domain
> > +and attach it to the one identified by \field{domain}. In that case the
> > +device behaves as if the driver issued a DETACH request with this
> > +\field{endpoint}, followed by the ATTACH request. If the device cannot do
> > +so, it MUST set the request \field{status} to VIRTIO_IOMMU_S_UNSUPP.
> > +
> > +If properties of the endpoint (obtained with a PROBE request) are
> > +incompatible with properties of other endpoints already attached to the
> > +requested domain, the device MAY attach the endpoint. If it cannot do so, the
> > +device SHOULD set the request \field{status} to VIRTIO_IOMMU_S_UNSUPP.
> > +\footnote{In general it is simpler and safer to reject attach when two devices
> > +have differing values in a property, for example two reserved regions of
> > +different types that would overlap. Depending on the property, device
> > +implementation can try to merge them and accept the attach.}
> > +
> > +\subsubsection{DETACH request}
> > +
> > +\begin{lstlisting}
> > +struct virtio_iommu_req_detach {
> > +  struct virtio_iommu_req_head head;
> > +  le32 domain;
> > +  le32 endpoint;
> > +  u8   reserved[8];
> > +  struct virtio_iommu_req_tail tail;
> > +};
> > +\end{lstlisting}
> > +
> > +Detach an endpoint from a domain. When this request completes, the
> > +endpoint cannot access any mapping from that domain anymore. If feature
> > +VIRTIO_IOMMU_F_BYPASS has been negotiated, then the endpoint accesses the
> > +guest-physical address space once this request completes.
> > +
> > +After all endpoints have been successfully detached from a domain, it
> > +ceases to exist and its ID can be reused by the driver for another domain.
> > +
> > +\drivernormative{\paragraph}{DETACH request}{Device Types / IOMMU Device / Device operations / DETACH request}
> > +
> > +The driver SHOULD set \field{reserved} to zero.
> > +
> > +\devicenormative{\paragraph}{DETACH request}{Device Types / IOMMU Device / Device operations / DETACH request}
> > +
> > +If the \field{reserved} field of a DETACH request is not zero, the device
> > +MAY set the request \field{status} to VIRTIO_IOMMU_S_INVAL, in which case
> > +the device MAY still perform the DETACH operation.
> > +
> > +If the endpoint identified by \field{endpoint} doesn't exist, then the
> > +device SHOULD set the request \field{status} to VIRTIO_IOMMU_S_NOENT.
> > +
> > +If the domain identified by \field{domain} doesn't exist, or if the
> > +endpoint identified by \field{endpoint} isn't attached to this domain,
> > +then the device MAY set the request \field{status} to
> > +VIRTIO_IOMMU_S_INVAL.
> > +
> > +The device MUST ensure that after being detached from a domain, the
> > +endpoint cannot access any mapping from that domain.
> > +
> > +\subsubsection{MAP request}\label{sec:Device Types / IOMMU Device / Device operations / MAP request}
> > +
> > +\begin{lstlisting}
> > +struct virtio_iommu_req_map {
> > +  struct virtio_iommu_req_head head;
> > +  le32  domain;
> > +  le64  virt_start;
> > +  le64  virt_end;
> > +  le64  phys_start;
> > +  le32  flags;
> > +  struct virtio_iommu_req_tail tail;
> > +};
> > +
> > +/* Flags are: */
> > +#define VIRTIO_IOMMU_MAP_F_READ   (1 << 0)
> > +#define VIRTIO_IOMMU_MAP_F_WRITE  (1 << 1)
> > +#define VIRTIO_IOMMU_MAP_F_EXEC   (1 << 2)
> > +#define VIRTIO_IOMMU_MAP_F_MMIO   (1 << 3)
> > +\end{lstlisting}
> > +
> > +Map a range of virtually-contiguous addresses to a range of
> > +physically-contiguous addresses of the same size. After the request
> > +succeeds, all endpoints attached to this domain can access memory in the
> > +range $[virt\_start; virt\_end]$ (inclusive). For example, if an endpoint
> > +accesses address $VA \in [virt\_start; virt\_end]$, the device (or the
> > +physical IOMMU) translates the address: $PA = VA - virt\_start +
> > +phys\_start$. If the access parameters are compatible with \field{flags}
> > +(for instance, the access is write and \field{flags} are
> > +VIRTIO_IOMMU_MAP_F_READ | VIRTIO_IOMMU_MAP_F_WRITE) then the IOMMU allows
> > +the access to reach $PA$.
> > +
> > +The range defined by \field{virt_start} and \field{virt_end} should be
> > +within the limits specified by \field{input_range}. Given $phys\_end =
> > +phys\_start + virt\_end - virt\_start$, the range defined by
> > +\field{phys_start} and phys_end should be within the guest-physical
> > +address space. This includes upper and lower limits, as well as any
> > +carving of guest-physical addresses for use by the host. Guest physical
> > +boundaries are set by the host using a firmware mechanism outside the
> > +scope of this specification.
> > +
> > +Availability and allowed combinations of \field{flags} depend on the
> > +underlying IOMMU architectures. VIRTIO_IOMMU_MAP_F_READ and
> > +VIRTIO_IOMMU_MAP_F_WRITE are usually implemented, although READ is
> > +sometimes implied by WRITE. VIRTIO_IOMMU_MAP_F_EXEC might not be
> > +available. In addition combinations such as "WRITE and not READ" or "WRITE
> > +and EXEC" might not be supported.
> > +
> > +The VIRTIO_IOMMU_MAP_F_MMIO flag is a memory type rather than a protection
> > +flag. It may be used, for example, to map Message Signaled Interrupt
> > +doorbells when a VIRTIO_IOMMU_RESV_MEM_T_MSI region isn't available. To
> > +trigger interrupts the endpoint performs a direct memory write to another
> > +peripheral, the IRQ chip. Since it is a signal, the write must not be
> > +buffered, elided, or combined with other writes by the memory
> > +interconnect. The precise meaning of the MMIO flag depends on the
> > +underlying memory architecture (for example on Armv8-A it corresponds to
> > +the "Device-nGnRE" memory type). Unless needed by mapped MSIs, the device
> > +isn't required to support the MMIO flag.
> > +
> > +This request is only available when VIRTIO_IOMMU_F_MAP_UNMAP has been
> > +negotiated.
> > +
> > +\drivernormative{\paragraph}{MAP request}{Device Types / IOMMU Device / Device operations / MAP request}
> > +
> > +The driver SHOULD set undefined \field{flags} bits to zero.
> > +
> > +\field{virt_end} MUST be strictly greater than \field{virt_start}.
> > +
> > +The driver SHOULD set the VIRTIO_IOMMU_MAP_F_MMIO flag when the physical
> > +range corresponds to memory-mapped device registers. The physical range
> > +SHOULD have a single memory type: either normal memory or memory-mapped
> > +I/O.
> > +
> > +\devicenormative{\paragraph}{MAP request}{Device Types / IOMMU Device / Device operations / MAP request}
> > +
> > +If \field{virt_start}, \field{phys_start} or (\field{virt_end} + 1) is
> > +not aligned on the page granularity, the device SHOULD set the request
> > +\field{status} to VIRTIO_IOMMU_S_RANGE and SHOULD NOT create the mapping.
> > +
> > +If a mapping already exists in the requested range, the device SHOULD set
> > +the request \field{status} to VIRTIO_IOMMU_S_INVAL and SHOULD NOT change
> > +any mapping.
> > +
> > +If the device doesn't recognize a \field{flags} bit, it SHOULD set the
> > +request \field{status} to VIRTIO_IOMMU_S_INVAL. In this case the device
> > +SHOULD NOT create the mapping. \footnote{Validating the input is important
> > +here, because the driver might be attempting to map with special flags
> > +that the device doesn't recognize. Creating the mapping with incompatible
> > +flags may result in loss of coherency and security hazards.}
> > +
> > +If a flag or combination of flag isn't supported, the device MAY set the
> > +request \field{status} to VIRTIO_IOMMU_S_UNSUPP.
> > +
> > +The device MUST NOT allow writes to a range mapped without the
> > +VIRTIO_IOMMU_MAP_F_WRITE flag. However, if the underlying architecture
> > +does not support write-only mappings, the device MAY allow reads to a
> > +range mapped with VIRTIO_IOMMU_MAP_F_WRITE but not
> > +VIRTIO_IOMMU_MAP_F_READ.
> > +
> > +If \field{domain} does not exist, the device SHOULD set the request
> > +\field{status} to VIRTIO_IOMMU_S_NOENT.
> > +
> > +\subsubsection{UNMAP request}\label{sec:Device Types / IOMMU Device / Device operations / UNMAP request}
> > +
> > +\begin{lstlisting}
> > +struct virtio_iommu_req_unmap {
> > +  struct virtio_iommu_req_head head;
> > +  le32  domain;
> > +  le64  virt_start;
> > +  le64  virt_end;
> > +  u8    reserved[4];
> > +  struct virtio_iommu_req_tail tail;
> > +};
> > +\end{lstlisting}
> > +
> > +Unmap a range of addresses mapped with VIRTIO_IOMMU_T_MAP. We define here
> > +a mapping as a virtual region created with a single MAP request. All
> > +mappings covered by the range $[virt\_start; virt\_end]$ (inclusive) are
> > +removed.
> > +
> > +The semantics of unmapping are specified in \ref{drivernormative:Device
> > +Types / IOMMU Device / Device operations / UNMAP request} and
> > +\ref{devicenormative:Device Types / IOMMU Device / Device operations /
> > +UNMAP request}, and illustrated with the following requests, assuming each
> > +example sequence starts with a blank address space. We define two
> > +pseudocode functions \texttt{map(virt_start, virt_end) -> mapping} and
> > +\texttt{unmap(virt_start, virt_end)}.
> > +
> > +\begin{lstlisting}
> > +(1) unmap(virt_start=0,
> > +          virt_end=4)            -> succeeds, doesn't unmap anything
> > +
> > +(2) a = map(virt_start=0,
> > +            virt_end=9);
> > +    unmap(0, 9)                  -> succeeds, unmaps a
> > +
> > +(3) a = map(0, 4);
> > +    b = map(5, 9);
> > +    unmap(0, 9)                  -> succeeds, unmaps a and b
> > +
> > +(4) a = map(0, 9);
> > +    unmap(0, 4)                  -> faults, doesn't unmap anything
> > +
> > +(5) a = map(0, 4);
> > +    b = map(5, 9);
> > +    unmap(0, 4)                  -> succeeds, unmaps a
> > +
> > +(6) a = map(0, 4);
> > +    unmap(0, 9)                  -> succeeds, unmaps a
> > +
> > +(7) a = map(0, 4);
> > +    b = map(10, 14);
> > +    unmap(0, 14)                 -> succeeds, unmaps a and b
> > +\end{lstlisting}
> > +
> > +This request is only available when VIRTIO_IOMMU_F_MAP_UNMAP has been
> > +negotiated.
> > +
> > +\drivernormative{\paragraph}{UNMAP request}{Device Types / IOMMU Device / Device operations / UNMAP request}
> > +
> > +The driver SHOULD set the \field{reserved} field to zero.
> > +
> > +The range, defined by \field{virt_start} and \field{virt_end}, SHOULD
> > +cover one or more contiguous mappings created with MAP requests. The range
> > +MAY spill over unmapped virtual addresses.
> > +
> > +The first address of a range SHOULD either be the first address of a
> > +mapping or be outside any mapping. The last address of a range SHOULD
> > +either be the last address of a mapping or be outside any mapping.
> > +
> > +\devicenormative{\paragraph}{UNMAP request}{Device Types / IOMMU Device / Device operations / UNMAP request}
> > +
> > +If the \field{reserved} field of an UNMAP request is not zero, the device
> > +MAY set the request \field{status} to VIRTIO_IOMMU_S_INVAL, in which case
> > +the device MAY perform the UNMAP operation.
> > +
> > +If \field{domain} does not exist, the device SHOULD set the request
> > +\field{status} to VIRTIO_IOMMU_S_NOENT.
> > +
> > +If a mapping affected by the range is not covered in its entirety by the
> > +range (the UNMAP request would split the mapping), then the device SHOULD
> > +set the request \field{status} to VIRTIO_IOMMU_S_RANGE, and SHOULD NOT
> > +remove any mapping.
> > +
> > +If part of the range or the full range is not covered by an existing
> > +mapping, then the device SHOULD remove all mappings affected by the range
> > +and set the request \field{status} to VIRTIO_IOMMU_S_OK.
> > +
> > +\subsubsection{PROBE request}\label{sec:Device Types / IOMMU Device / Device operations / PROBE request}
> > +
> > +If the VIRTIO_IOMMU_F_PROBE feature bit is present, the driver sends a
> > +VIRTIO_IOMMU_T_PROBE request for each endpoint that the virtio-iommu
> > +device manages. This probe is performed before attaching the endpoint to
> > +a domain.
> > +
> > +\begin{lstlisting}
> > +struct virtio_iommu_req_probe {
> > +  struct virtio_iommu_req_head head;
> > +  /* Device-readable */
> > +  le32  endpoint;
> > +  u8    reserved[64];
> > +
> > +  /* Device-writable */
> > +  u8    properties[probe_size];
> > +  struct virtio_iommu_req_tail tail;
> > +};
> > +\end{lstlisting}
> > +
> > +\begin{description}
> > +\item[\field{endpoint}] has the same meaning as in ATTACH and DETACH
> > +  requests.
> > +
> > +\item[\field{reserved}] is used as padding, so that future extensions can
> > +  add fields to the device-readable part.
> > +
> > +\item[\field{properties}] contains a list of properties of the
> > +  \field{endpoint}, filled by the device. The length of the
> > +  \field{properties} field is \field{probe_size} bytes. Each property is
> > +  described with a \verb+struct virtio_iommu_probe_property+ header, which
> > +  may be followed by a value of size \field{length}.
> > +
> > +\begin{lstlisting}
> > +#define VIRTIO_IOMMU_PROBE_T_MASK 0xfff
> > +
> > +struct virtio_iommu_probe_property {
> > +  le16  type;
> > +  le16  length;
> > +};
> > +\end{lstlisting}
> > +
> > +\end{description}
> > +
> > +The driver allocates a buffer of adequate size for the probe request,
> > +writes \field{endpoint} and adds the buffer to the request queue. The
> > +device fills the \field{properties} field with a list of properties for
> > +this endpoint.
> > +
> > +The driver parses the first property by reading \field{type}, then
> > +\field{length}. If the driver recognizes \field{type}, it reads and
> > +handles the rest of the property. The driver then reads the next property,
> > +that is located $(\field{length} + 4)$ bytes after the beginning of the
> > +first one, and so on. The driver parses all properties until it reaches a
> > +NONE property or the end of \field{properties}.
> > +
> > +The upper nibble of \field{type} is reserved for future extensions.
> > +Therefore only 4096 types are available. The actual type of a property is
> > +extracted like this:
> > +
> > +\begin{lstlisting}
> > +u16 type = le16_to_cpu(property.type) & VIRTIO_IOMMU_PROBE_T_MASK;
> > +\end{lstlisting}
> > +
> > +Available property types are described in section
> > +\ref{sec:Device Types / IOMMU Device / Device operations / PROBE properties}.
> > +
> > +\drivernormative{\paragraph}{PROBE request}{Device Types / IOMMU Device / Device operations / PROBE request}
> > +
> > +The size of \field{properties} MUST be \field{probe_size} bytes.
> > +
> > +The driver SHOULD set \field{reserved} to zero.
> > +
> > +If the driver doesn't recognize the \field{type} of a property, it SHOULD
> > +ignore the property and continue parsing the list.
> > +
> > +The driver SHOULD NOT deduce the property length from \field{type}.
> > +
> > +The driver SHOULD ignore bits[15:12] of \field{type}.
> > +
> > +\devicenormative{\paragraph}{PROBE request}{Device Types / IOMMU Device / Device operations / PROBE request}
> > +
> > +If the \field{reserved} field of a PROBE request is not zero, the device
> > +MAY set the request \field{status} to VIRTIO_IOMMU_S_INVAL.
> > +
> > +If the endpoint identified by \field{endpoint} doesn't exist, then the
> > +device SHOULD set the request \field{status} to VIRTIO_IOMMU_S_NOENT.
> > +
> > +If the device does not offer the VIRTIO_IOMMU_F_PROBE feature, and if the
> > +driver sends a VIRTIO_IOMMU_T_PROBE request, then the device SHOULD return
> > +the buffers on the used ring and set the \field{len} field of the used
> > +element to zero.
> > +
> > +The device SHOULD set bits [15:12] of property \field{type} to zero.
> > +
> > +The device MUST write the size of the property without the
> > +\verb+struct virtio_iommu_probe_property+ header, in bytes, into
> > +\field{length}.
> > +
> > +When two properties follow each others, the device MUST put the second
> > +property exactly $(\field{length} + 4)$ bytes after the beginning of the
> > +first one.
> > +
> > +If the \field{properties} list is smaller than \field{probe_size}, then
> > +the device SHOULD NOT write any property and SHOULD set the request
> > +\field{status} to VIRTIO_IOMMU_S_INVAL.
> > +
> > +If the device doesn't fill all \field{probe_size} bytes with properties,
> > +it SHOULD fill the remaining bytes of \field{properties} with zeroes.
> > +
> > +\subsubsection{PROBE properties}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties}
> > +
> > +\begin{lstlisting}
> > +#define VIRTIO_IOMMU_PROBE_T_NONE       0
> > +#define VIRTIO_IOMMU_PROBE_T_RESV_MEM   1
> > +\end{lstlisting}
> > +
> > +\paragraph{Property NONE}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties / NONE}
> > +
> > +Marks the end of the property list. This property doesn't have any value,
> > +and should have \field{length} 0.
> > +
> > +\paragraph{Property RESV_MEM}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties / RESVMEM}
> > +
> > +The RESV_MEM property describes a chunk of reserved virtual memory. It may
> > +be used by the device to describe virtual address ranges that shouldn't be
> > +allocated by the driver, or that are special.
> > +
> > +\begin{lstlisting}
> > +struct virtio_iommu_probe_resv_mem {
> > +  struct virtio_iommu_probe_property head;
> > +  u8    subtype;
> > +  u8    reserved[3];
> > +  le64  start;
> > +  le64  end;
> > +};
> > +\end{lstlisting}
> > +
> > +Fields \field{start} and \field{end} describe the range of reserved virtual
> > +addresses. \field{subtype} may be one of:
> > +
> > +\begin{description}
> > +  \item[VIRTIO_IOMMU_RESV_MEM_T_RESERVED (0)]
> > +    Accesses to virtual addresses in this region have undefined behavior.
> > +    They may be aborted by the device, bypass it, or never even reach it.
> > +    The region may also be used for host mappings, for example Message
> > +    Signaled Interrupts.
> > +
> > +    The guest should neither use these virtual addresses in a MAP request
> > +    nor instruct endpoints to perform DMA on them.
> > +
> > +  \item[VIRTIO_IOMMU_RESV_MEM_T_MSI (1)]
> > +    This region is a doorbell for Message Signaled Interrupts (MSIs). It
> > +    is similar to VIRTIO_IOMMU_RESV_MEM_T_RESERVED, in that the driver
> > +    should not map virtual addresses described by the property.
> > +
> > +    In addition it tells the guest how to handle MSI doorbells. If the
> > +    endpoint doesn't have a VIRTIO_IOMMU_RESV_MEM_T_MSI property
> > +    corresponding to the doorbell of a virtual MSI controller, then the
> > +    guest should create a mapping for it.
> > +\end{description}
> > +
> > +\drivernormative{\subparagraph}{Property RESV_MEM}{Device Types / IOMMU Device / Device operations / PROBE properties / RESVMEM}
> > +
> > +The driver SHOULD NOT map any virtual address described by a
> > +VIRTIO_IOMMU_RESV_MEM_T_RESERVED or VIRTIO_IOMMU_RESV_MEM_T_MSI property.
> > +
> > +The driver SHOULD ignore \field{reserved}.
> > +
> > +The driver SHOULD treat any \field{subtype} it doesn't recognize as if it
> > +was VIRTIO_IOMMU_RESV_MEM_T_RESERVED.
> > +
> > +\devicenormative{\subparagraph}{Property RESV_MEM}{Device Types / IOMMU Device / Device operations / PROBE properties / RESVMEM}
> > +
> > +The device SHOULD set \field{reserved} to zero.
> > +
> > +The device SHOULD NOT present more than one VIRTIO_IOMMU_RESV_MEM_T_MSI
> > +property per endpoint.
> > +
> > +The device SHOULD NOT present RESV_MEM properties that overlap each others
> > +for the same endpoint.
> > +
> > +\subsubsection{Fault reporting}\label{sev:Device Types / IOMMU Device / Device operations / Fault reporting}
> > +
> > +The device can report translation faults and other significant asynchronous
> > +events on the event virtqueue. The driver initially populates the queue with
> > +empty report buffers. When the device needs to report an event, it fills a
> > +buffer and notifies the driver with an interrupt. The driver consumes the
> > +report and moves the buffer back onto the queue.
> > +
> > +If no buffer is available, the device may either wait for one to be consumed,
> > +or drop the event.
> > +
> > +\begin{lstlisting}
> > +struct virtio_iommu_fault {
> > +  u8    reason;
> > +  u8    reserved[3];
> > +  le32  flags;
> > +  le32  endpoint;
> > +  le32  reserved1;
> > +  le64  address;
> > +};
> > +
> > +#define VIRTIO_IOMMU_FAULT_F_READ     (1 << 0)
> > +#define VIRTIO_IOMMU_FAULT_F_WRITE    (1 << 1)
> > +#define VIRTIO_IOMMU_FAULT_F_EXEC     (1 << 2)
> > +#define VIRTIO_IOMMU_FAULT_F_ADDRESS  (1 << 8)
> > +\end{lstlisting}
> > +
> > +\begin{description}
> > +  \item[\field{reason}] The reason for this report. It may have the
> > +    following values:
> > +    \begin{description}
> > +      \item[VIRTIO_IOMMU_FAULT_R_UNKNOWN (0)] An internal error happened, or
> > +        an error that cannot be described with the following reasons.
> > +      \item[VIRTIO_IOMMU_FAULT_R_DOMAIN (1)] The endpoint attempted to
> > +        access \field{address} without being attached to a domain.
> > +      \item[VIRTIO_IOMMU_FAULT_R_MAPPING (2)] The endpoint attempted to
> > +        access \field{address}, which wasn't mapped in the domain or
> > +        didn't have the correct protection flags.
> > +    \end{description}
> > +  \item[\field{flags}] Information about the fault context.
> > +  \item[\field{endpoint}] The endpoint causing the fault.
> > +  \item[\field{reserved} and \field{reserved1}] Should be zero.
> > +  \item[\field{address}] If VIRTIO_IOMMU_FAULT_F_ADDRESS is set, the
> > +    address causing the fault.
> > +\end{description}
> > +
> > +These faults are not recoverable\footnote{This means that the PRI
> > +extension to PCI, for example, that allows recoverable faults, isn't
> > +supported for the moment.}. The guest has to do its best to
> > +prevent any future fault from happening, by stopping or resetting the
> > +endpoint.
> > +
> > +When the fault is reported by a physical IOMMU, the fault reasons may not
> > +match exactly the reason of the original fault report. The device should
> > +try its best to find the closest match.
> > +
> > +If the device encounters a fault that wasn't caused by a specific
> > +endpoint, it is unlikely that the driver would be able to do anything else
> > +than print the fault and stop using the device, so reporting the fault on
> > +the event queue isn't useful. In that case, we recommend using the
> > +DEVICE_NEEDS_RESET status bit.
> > +
> > +\drivernormative{\paragraph}{Fault reporting}{Device Types / IOMMU Device / Device operations / Fault reporting}
> > +
> > +If the \field{reserved} field is not zero, the driver SHOULD ignore the
> > +fault report.\footnote{A future format may implement events that are not
> > +faults, which would be differentiated by a type field in place of
> > +\field{reserved}.}
> > +
> > +The driver SHOULD ignore undefined \field{flags}.
> > +
> > +If the driver doesn't recognize \field{reason}, it SHOULD treat the fault
> > +as if it was VIRTIO_IOMMU_FAULT_R_UNKNOWN.
> > +
> > +\devicenormative{\paragraph}{Fault reporting}{Device Types / IOMMU Device / Device operations / Fault reporting}
> > +
> > +The device SHOULD set \field{reserved} and \field{reserved1} to zero.
> > +
> > +The device SHOULD set undefined \field{flags} to zero.
> > +
> > +The device SHOULD write a valid endpoint ID in \field{endpoint}.
> > +
> > +The device MAY omit setting VIRTIO_IOMMU_FAULT_F_ADDRESS and writing
> > +\field{address} in any fault report, regardless of the \field{reason}.
> > +
> > +If a buffer is too small to contain the fault report\footnotemark, the
> > +device SHOULD NOT use multiple buffers to describe it. The device MAY fall
> > +back to using an older fault report format that fits in the buffer.
> > +
> > +\footnotetext{This would happen for example if the device implements a
> > +more recent version of this specification, whose fault report contains
> > +additional fields.}
> > 


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [virtio-comment] Re: [virtio-dev] [PATCH v3] Add virtio-iommu device specification
  2019-05-03 19:47   ` Michael S. Tsirkin
@ 2019-05-05 13:33     ` Jean-Philippe Brucker
  0 siblings, 0 replies; 11+ messages in thread
From: Jean-Philippe Brucker @ 2019-05-05 13:33 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-comment, virtio-dev, joro, tnowicki, eric.auger,
	kevin.tian, Lorenzo Pieralisi, bharat.bhushan, bauerman

On 03/05/2019 20:47, Michael S. Tsirkin wrote:
> On Tue, Apr 30, 2019 at 03:42:14PM +0100, Jean-Philippe Brucker wrote:
>> On 30/04/2019 14:56, Jean-Philippe Brucker wrote:
>> > The IOMMU device allows a guest to manage DMA mappings for physical,
>> > emulated and paravirtualized endpoints. Add device description for the
>> > virtio-iommu device and driver. Introduce PROBE, ATTACH, DETACH, MAP and
>> > UNMAP requests, as well as translation error reporting.
>> > 
>> > Fixes: https://github.com/oasis-tcs/virtio-spec/issues/37
>> > Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
>> 
>> I'd like to request a vote for this version (the github issue is now up
>> to date and the patch applies cleanly)
>> 
>> Thanks,
>> Jean
> 
> I sent some comments that might be worth addressing before the vote.  If
> you don't agree I'm fine with starting voting now if you like and we'll
> let the TC decide, pls let me know.

Thanks for the review, I agree with most comments. I'll reply this week
and then send a new version, so you can hold off the vote

Thanks,
Jean

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [virtio-comment] Re: [virtio-dev] Re: [PATCH v3] Add virtio-iommu device specification
  2019-05-03 17:05 ` [virtio-comment] " Michael S. Tsirkin
@ 2019-05-10 12:07   ` Jean-Philippe Brucker
  2019-05-10 16:52     ` Michael S. Tsirkin
  0 siblings, 1 reply; 11+ messages in thread
From: Jean-Philippe Brucker @ 2019-05-10 12:07 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-comment, virtio-dev, joro, tnowicki, eric.auger,
	kevin.tian, lorenzo.pieralisi, bharat.bhushan, bauerman

Thanks for taking the time to review it! For brevity I only replied to
questions and the bits I don't completely agree with, I will change the
rest.

On 03/05/2019 18:05, Michael S. Tsirkin wrote:
> On Tue, Apr 30, 2019 at 02:56:48PM +0100, Jean-Philippe Brucker wrote:
>> diff --git a/virtio-iommu.tex b/virtio-iommu.tex
[...]
>> +\item[VIRTIO_IOMMU_F_DOMAIN_BITS (1)]
>> +  The number of domains supported is described in \field{domain_bits}
> 
> I would rename this to domain_range or something like this.
> E.g. there's an option of 0 which is special.

I don't see a use for it right now, but I can make this a range.

>> +\drivernormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device / Feature bits}
>> +
>> +The driver SHOULD accept any of the VIRTIO_IOMMU_F_INPUT_RANGE,
>> +VIRTIO_IOMMU_F_DOMAIN_BITS, VIRTIO_IOMMU_F_MAP_UNMAP and
>> +VIRTIO_IOMMU_F_PROBE feature bits if offered by the device.
> 
> Why does it have to accept MAP_UNMAP if it does not want to use it?

I was thinking a future version that adds another mode would remove this
statement. I can remove it, it's clear enough that currently the driver
won't get anything if it doesn't accept the bit.

>> +If the VIRTIO_IOMMU_F_BYPASS feature is negotiated, all endpoints can
>> +access guest-physical addresses ("bypass mode").
> 
> I think this means actually "all unattached endpoints"?
> 
> 
> So how does this interact with e.g. encrypted memory?
> I think it's actually weaker. This bit IMHO just means that
> all accesses are allowed by the iommu and all addresses
> are translated 1:1.

Yes that's what it means, I'll try to clarify the description. I haven't
looked at memory encryption yet, I guess it's disabled until you attach
a domain, and then you need to pass keys in an extended MAP request.

> Also what about other endpoints? With bypass
> does an endpoint not in any domain get access to all other
> endpoints?

Note that it's guest-physical addresses, not physical memory. If other
devices are memory-mapped, then with bypass an endpoint can access them.

>> If the feature is not
>> +negotiated, then any memory access from endpoints will fault.
> 
> fault -> fail?
> 
> So all accesses are disallowed, right?

Yes (Minus reserved regions in some cases. More below)

I rewrote the paragraph as:
"If the VIRTIO_IOMMU_F_BYPASS feature is negotiated, all accesses from
unattached endpoints are allowed and translated by the IOMMU using the
identity function. If the feature is not negotiated, any memory access
from an unattached endpoint will fail. Upon attaching an endpoint in
bypass mode to a new domain, any memory access from the endpoint fails,
since the domain does not contain any mapping."

>> +
>> +\begin{itemize}
>> +\item \field{page_size_mask} contains the bitmask of all page sizes that
>> +  can be mapped. The least significant bit set defines the page
>> +  granularity of IOMMU mappings.
> 
> so this refers to map/unmap requests only really?

Yes

>> Other bits in the mask are hints
>> +  describing page sizes that the IOMMU can merge into a single mapping
>> +  (page blocks).
> 
> what does this merging refer to? I don't find it anywhere in the spec.

It doesn't refer to anything else in the spec, the current wording isn't
great. When the host stores mappings in page tables, it's useful for the
guest to know which huge page sizes are available. For example if the
physical IOMMU can map 2M and 1G blocks in addition to 4k pages, then
the mask is 0x40201000. The hints allow a guest to optimize DMA
allocation (iommu_dma_alloc() in Linux), in order to minimize the number
of IOTLB entries used in hardware.

I now have the following:
"Other bits in \field{page_size_mask} are hints and describe larger page
sizes that the IOMMU device handles efficiently. For example, when the
device stores mappings using a page table tree, it may be able to
describe large mappings using a few leaf entries in intermediate tables,
rather than using lots of entries in the last level of the tree.
Creating mappings aligned on large page size can improve performance
since they require fewer page tables and TLB entries."

>> +  The smallest page granularity supported by the IOMMU is one byte. It is
>> +  legal for the driver to map one byte at a time if bit 0 of
>> +  \field{page_size_mask} is set.
> 
> With single byte mappings, can't guest quickly exhaust
> device memory? what guidance can we provide for handling
> this kind of issue? how do device and driver avoid it?

The advantage of a single-byte granularity is arbitrary mapping sizes,
it doesn't mean that the driver will use any single-byte mapping. If the
host uses page tables to store mappings then the granularity is
generally 4kB. But if it uses interval trees then it can have
arbitrary mapping sizes, for example the guest could map a packet with a
1500-byte mapping.

For the moment the best way to deal with memory exhaustion might be to
return a "out of memory" status in the MAP request (more below).

>> +\subsubsection{ATTACH request}\label{sec:Device Types / IOMMU Device / Device operations / ATTACH request}
>> +
>> +\begin{lstlisting}
>> +struct virtio_iommu_req_attach {
>> +  struct virtio_iommu_req_head head;
>> +  le32 domain;
>> +  le32 endpoint;
>> +  u8   reserved[8];
>> +  struct virtio_iommu_req_tail tail;
>> +};
>> +\end{lstlisting}
>> +
>> +Attach an endpoint to a domain. \field{domain} is an identifier unique to
>> +the virtio-iommu device.
> 
> what does this mean? do you mean that it uniquely identifies
> a domain? 

It uniquely identifies a domain for one device. When there are multiple
virtio-iommu devices, the driver can maintain one domain ID space per
device rather than a global ID space.

> And I guess the assumption is that endpoints in different domains
> are isolated from each other?

Yes, I'll add this.

>> +\devicenormative{\paragraph}{ATTACH request}{Device Types / IOMMU Device / Device operations / ATTACH request}
>> +
>> +If the \field{reserved} field of an ATTACH request is not zero, the device
>> +SHOULD set the request \field{status} to VIRTIO_IOMMU_S_INVAL and SHOULD
>> +NOT attach the endpoint to the domain. \footnote{The device should
> 
> what does should mean here? Is expected to? Or is this part of
> the conformance clause? the SHOULD and take it out of the footnote.
> 
>> +validate input of ATTACH requests in case the driver attempts to attach in
>> +a mode that is unimplemented by the device, and would be incompatible with
>> +the modes implemented by the device.}
> 
> Where does mode come from?

It would be a new feature bit. I don't think the footnote is useful, so
I'll just specify that the device rejects the request if reserved != 0.

>> +\subsubsection{MAP request}\label{sec:Device Types / IOMMU Device / Device operations / MAP request}
>> +
>> +\begin{lstlisting}
>> +struct virtio_iommu_req_map {
>> +  struct virtio_iommu_req_head head;
>> +  le32  domain;
>> +  le64  virt_start;
>> +  le64  virt_end;
>> +  le64  phys_start;
>> +  le32  flags;
>> +  struct virtio_iommu_req_tail tail;
>> +};
>> +
>> +/* Flags are: */
>> +#define VIRTIO_IOMMU_MAP_F_READ   (1 << 0)
>> +#define VIRTIO_IOMMU_MAP_F_WRITE  (1 << 1)
>> +#define VIRTIO_IOMMU_MAP_F_EXEC   (1 << 2)
> 
> what is exec? pls add some comments near all flags.

Exec means instruction fetch. Some systems have different transaction
flags for read and exec (PCI only supports that in the PASID TLP prefix)
and some page tables have a "no exec" bit.

>> The precise meaning of the MMIO flag depends on the
>> +underlying memory architecture (for example on Armv8-A it corresponds to
>> +the "Device-nGnRE" memory type). Unless needed by mapped MSIs, the device
>> +isn't required to support the MMIO flag.
> 
> I really dislike how this part is written. A platform specific bit that
> does whatever platform wants? How does one use it based on this?
> Platform documentation is hoghly unlikely to explain what does
> VIRTIO_IOMMU_MAP_F_MMIO mean.
> 
> Does this mean "not buffered, elided, or combined with other writes"?
> 
> Also maybe add a feature bit so platforms can declare support.

Right, a feature bit for MMIO is a good idea. I'll add one for EXEC as
well. I'll describe MMIO using only the expected behavior of memory
accesses to the region, which is the same regardless of the architecture.

>> +The device MUST NOT allow writes to a range mapped without the
>> +VIRTIO_IOMMU_MAP_F_WRITE flag. However, if the underlying architecture
>> +does not support write-only mappings, the device MAY allow reads to a
>> +range mapped with VIRTIO_IOMMU_MAP_F_WRITE but not
>> +VIRTIO_IOMMU_MAP_F_READ.
> 
> do drivers care? if yes how about a way for it to know?

I don't think they care. If write-implies-read was a security problem I
think that architectures (x86, arm64) would have changed it by now.

>> +(4) a = map(0, 9);
>> +    unmap(0, 4)                  -> faults, doesn't unmap anything
> 
> fails?
> 
>> +
>> +(5) a = map(0, 4);
>> +    b = map(5, 9);
>> +    unmap(0, 4)                  -> succeeds, unmaps a
>> +
>> +(6) a = map(0, 4);
>> +    unmap(0, 9)                  -> succeeds, unmaps a
>> +
>> +(7) a = map(0, 4);
>> +    b = map(10, 14);
>> +    unmap(0, 14)                 -> succeeds, unmaps a and b
>> +\end{lstlisting}
>> +
> 
> what about a partial unmap? e.g. unmap 0, 3?

That's not allowed at the moment (example 4 and next-to-last paragraph
of device normative).

>> +If a mapping affected by the range is not covered in its entirety by the
>> +range (the UNMAP request would split the mapping), then the device SHOULD
>> +set the request \field{status} to VIRTIO_IOMMU_S_RANGE, and SHOULD NOT
>> +remove any mapping.
>> +
>> +If part of the range or the full range is not covered by an existing
>> +mapping, then the device SHOULD remove all mappings affected by the range
>> +and set the request \field{status} to VIRTIO_IOMMU_S_OK.
>> +
> 
> 
> So based on above, it seems clear that device must remember any number
> of mappings sent to it in device memory. Seems like an easy way
> for guest to use any amount of device memory.
> 
> If any limits are allowed then I think device needs a way
> to communicate them to driver.

Except for embedded implementations, it seems difficult to determine a
limit upfront, since it's going to depend on the host's memory which
changes at runtime. And even if we do advertise a limit in the config,
the driver is unlikely to change its DMA allocation strategy. It will
still create mappings until it has to return an error to the caller. So
for now the best strategy is probably for the device to return a DEVERR
status when it runs out of mappings.

>> +The driver parses the first property by reading \field{type}, then
>> +\field{length}. If the driver recognizes \field{type}, it reads and
>> +handles the rest of the property. The driver then reads the next property,
>> +that is located $(\field{length} + 4)$ bytes after the beginning of the
>> +first one, and so on. The driver parses all properties until it reaches a
>> +NONE property or the end of \field{properties}.
> 
> I guess device must make sure length fits within the buffer?

Yes, the device knows how much info it needs to communicate, and writes
it in the probe_size field of the config.

>> +\paragraph{Property NONE}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties / NONE}
>> +
>> +Marks the end of the property list. This property doesn't have any value,
>> +and should have \field{length} 0.
> 
> why is this needed btw?

Just provides a convenient way to mark the end of the list.

>> +\paragraph{Property RESV_MEM}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties / RESVMEM}
>> +
>> +The RESV_MEM property describes a chunk of reserved virtual memory. It may
>> +be used by the device to describe virtual address ranges that shouldn't be
>> +allocated by the driver, or that are special.
> 
> avoid shouldn't outside conformance clauses. maybe add a conformance
> clause that driver must avoid these ranges?
> what if driver ignores this property? I guess device must validate
> the addresses?

Yes, device returns an error when a MAP request crosses a resv region.

> 
>> +
>> +\begin{lstlisting}
>> +struct virtio_iommu_probe_resv_mem {
>> +  struct virtio_iommu_probe_property head;
>> +  u8    subtype;
>> +  u8    reserved[3];
>> +  le64  start;
>> +  le64  end;
>> +};
>> +\end{lstlisting}
>> +
>> +Fields \field{start} and \field{end} describe the range of reserved virtual
>> +addresses. \field{subtype} may be one of:
>> +
>> +\begin{description}
>> +  \item[VIRTIO_IOMMU_RESV_MEM_T_RESERVED (0)]
>> +    Accesses to virtual addresses in this region have undefined behavior.
> 
> undefined?  how can platforms with such a property coexist with untrusted
> devices?

It is "undefined" because I can't enumerate all cases (don't know about
all of them). This would contain for example the RMRR regions of VT-d
(the BIOS reserving some DMA regions for itself), regions allocated for
host-managed MSIs on arm platforms, and some PCI bridge windows
(although they removed those from the Linux resv regions recently, not
certain why). Ideally we wouldn't have any, but some special cases make
it necessary.

In general having reserved regions is incompatible with untrusted
devices, and in some cases incompatible with untrusted guests. But
choosing the policy about which devices to present to guest is up to the
platform. The host knows what the resv regions are used for, and if they
are safe enough. The guest just need to knows what to avoid.

> 
>> +    They may be aborted by the device,
> 
> the device as in the iommu device?

Yes, although I can't think of an example of this case right now.

> 
>> bypass it,
> 
> bypass might be a big security problem.

I have two cases in mind:
* MSI, which is fine since the irqchip provides device isolation,
* mappings reserved by the BIOS, which again only affects the device but
may allow the guest to access firmware information. In my opinion such
endpoint shouldn't be assigned to a guest.

>> or never even reach it.
> 
> that's ok

We could be losing isolation if some bridge is intercepting accesses to
a range of addresses.

>> +    The region may also be used for host mappings, for example Message
>> +    Signaled Interrupts.
> 
> Let's avoid mentioning host if we can.
> 
>> +
>> +    The guest should neither use these virtual addresses in a MAP request
>> +    nor instruct endpoints to perform DMA on them.
> 
> avoid should outside conformance clauses.
> 
> and what it guest does not obey this request?
> 
> I think we must fix this and require that such accesses
> are ignored.

That's generally not something the host can enforce. It has to make sure
that such accesses would only disturb the guest or the endpoint.
Otherwise it had better trust both guest and endpoint, or it shouldn't
present the endpoint to the guest at all.

>> +The driver SHOULD treat any \field{subtype} it doesn't recognize as if it
>> +was VIRTIO_IOMMU_RESV_MEM_T_RESERVED.
> 
> why is that a good idea?

Some future version might add a new reserved type that provides more
details. For a driver that doesn't recognize it but would still work
without the additional details, it's better to treat it as a reserved
region than try to map it.

>> +These faults are not recoverable\footnote{This means that the PRI
>> +extension to PCI, for example, that allows recoverable faults, isn't
>> +supported for the moment.}. The guest has to do its best to
>> +prevent any future fault from happening, by stopping or resetting the
>> +endpoint.
> 
> what does recoverable mean from virtio iommu pov?
> nothing right?
> it's an endpoint issue.

With PRI the IOMMU receives page requests, and handles them through the
memory management subsystem. A response to the fault ("we fixed the
fault, retry access") is returned through the IOMMU, not the endpoint.

> 
>> +
>> +When the fault is reported by a physical IOMMU, the fault reasons may not
>> +match exactly the reason of the original fault report. The device should
>> +try its best to find the closest match.
>> +
>> +If the device encounters a fault
> 
> device doesn't right? endpoint does ...

This paragraph describes what to do if the virtio-iommu encounters
some asynchronous error of its own, without any endpoint involved.

> 
>> that wasn't caused by a specific
>> +endpoint, it is unlikely that the driver would be able to do anything else
>> +than print the fault and stop using the device, so reporting the fault on
>> +the event queue isn't useful. In that case, we recommend using the
>> +DEVICE_NEEDS_RESET status bit.

Thanks,
Jean

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [virtio-comment] Re: [virtio-dev] Re: [PATCH v3] Add virtio-iommu device specification
  2019-05-10 12:07   ` [virtio-comment] Re: [virtio-dev] " Jean-Philippe Brucker
@ 2019-05-10 16:52     ` Michael S. Tsirkin
  2019-05-14 16:07       ` Jean-Philippe Brucker
  0 siblings, 1 reply; 11+ messages in thread
From: Michael S. Tsirkin @ 2019-05-10 16:52 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: virtio-comment, virtio-dev, joro, tnowicki, eric.auger,
	kevin.tian, lorenzo.pieralisi, bharat.bhushan, bauerman

On Fri, May 10, 2019 at 01:07:33PM +0100, Jean-Philippe Brucker wrote:
> Thanks for taking the time to review it! For brevity I only replied to
> questions and the bits I don't completely agree with, I will change the
> rest.
> 
> On 03/05/2019 18:05, Michael S. Tsirkin wrote:
> > On Tue, Apr 30, 2019 at 02:56:48PM +0100, Jean-Philippe Brucker wrote:
> >> diff --git a/virtio-iommu.tex b/virtio-iommu.tex
> [...]
> >> +\item[VIRTIO_IOMMU_F_DOMAIN_BITS (1)]
> >> +  The number of domains supported is described in \field{domain_bits}
> > 
> > I would rename this to domain_range or something like this.
> > E.g. there's an option of 0 which is special.
> 
> I don't see a use for it right now, but I can make this a range.

The point is it's not always # of bits. It encodes the legal range
in a special way.

> >> +\drivernormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device / Feature bits}
> >> +
> >> +The driver SHOULD accept any of the VIRTIO_IOMMU_F_INPUT_RANGE,
> >> +VIRTIO_IOMMU_F_DOMAIN_BITS, VIRTIO_IOMMU_F_MAP_UNMAP and
> >> +VIRTIO_IOMMU_F_PROBE feature bits if offered by the device.
> > 
> > Why does it have to accept MAP_UNMAP if it does not want to use it?
> 
> I was thinking a future version that adds another mode would remove this
> statement. I can remove it, it's clear enough that currently the driver
> won't get anything if it doesn't accept the bit.
> 
> >> +If the VIRTIO_IOMMU_F_BYPASS feature is negotiated, all endpoints can
> >> +access guest-physical addresses ("bypass mode").
> > 
> > I think this means actually "all unattached endpoints"?
> > 
> > 
> > So how does this interact with e.g. encrypted memory?
> > I think it's actually weaker. This bit IMHO just means that
> > all accesses are allowed by the iommu and all addresses
> > are translated 1:1.
> 
> Yes that's what it means, I'll try to clarify the description. I haven't
> looked at memory encryption yet, I guess it's disabled until you attach
> a domain, and then you need to pass keys in an extended MAP request.
> 
> > Also what about other endpoints? With bypass
> > does an endpoint not in any domain get access to all other
> > endpoints?
> 
> Note that it's guest-physical addresses, not physical memory. If other
> devices are memory-mapped, then with bypass an endpoint can access them.
> >> If the feature is not
> >> +negotiated, then any memory access from endpoints will fault.
> > 
> > fault -> fail?
> > 
> > So all accesses are disallowed, right?
> 
> Yes (Minus reserved regions in some cases. More below)
> 
> I rewrote the paragraph as:
> "If the VIRTIO_IOMMU_F_BYPASS feature is negotiated, all accesses from
> unattached endpoints are allowed and translated by the IOMMU using the
> identity function. If the feature is not negotiated, any memory access
> from an unattached endpoint will fail. Upon attaching an endpoint in
> bypass mode to a new domain, any memory access from the endpoint fails,
> since the domain does not contain any mapping."
> 
> >> +
> >> +\begin{itemize}
> >> +\item \field{page_size_mask} contains the bitmask of all page sizes that
> >> +  can be mapped. The least significant bit set defines the page
> >> +  granularity of IOMMU mappings.
> > 
> > so this refers to map/unmap requests only really?
> 
> Yes
> 
> >> Other bits in the mask are hints
> >> +  describing page sizes that the IOMMU can merge into a single mapping
> >> +  (page blocks).
> > 
> > what does this merging refer to? I don't find it anywhere in the spec.
> 
> It doesn't refer to anything else in the spec, the current wording isn't
> great. When the host stores mappings in page tables, it's useful for the
> guest to know which huge page sizes are available. For example if the
> physical IOMMU can map 2M and 1G blocks in addition to 4k pages, then
> the mask is 0x40201000. The hints allow a guest to optimize DMA
> allocation (iommu_dma_alloc() in Linux), in order to minimize the number
> of IOTLB entries used in hardware.
> 
> I now have the following:
> "Other bits in \field{page_size_mask} are hints and describe larger page
> sizes that the IOMMU device handles efficiently. For example, when the
> device stores mappings using a page table tree, it may be able to
> describe large mappings using a few leaf entries in intermediate tables,
> rather than using lots of entries in the last level of the tree.
> Creating mappings aligned on large page size can improve performance
> since they require fewer page tables and TLB entries."
> 
> >> +  The smallest page granularity supported by the IOMMU is one byte. It is
> >> +  legal for the driver to map one byte at a time if bit 0 of
> >> +  \field{page_size_mask} is set.
> > 
> > With single byte mappings, can't guest quickly exhaust
> > device memory? what guidance can we provide for handling
> > this kind of issue? how do device and driver avoid it?
> 
> The advantage of a single-byte granularity is arbitrary mapping sizes,
> it doesn't mean that the driver will use any single-byte mapping. If the
> host uses page tables to store mappings then the granularity is
> generally 4kB. But if it uses interval trees then it can have
> arbitrary mapping sizes, for example the guest could map a packet with a
> 1500-byte mapping.
> 
> For the moment the best way to deal with memory exhaustion might be to
> return a "out of memory" status in the MAP request (more below).
> 
> >> +\subsubsection{ATTACH request}\label{sec:Device Types / IOMMU Device / Device operations / ATTACH request}
> >> +
> >> +\begin{lstlisting}
> >> +struct virtio_iommu_req_attach {
> >> +  struct virtio_iommu_req_head head;
> >> +  le32 domain;
> >> +  le32 endpoint;
> >> +  u8   reserved[8];
> >> +  struct virtio_iommu_req_tail tail;
> >> +};
> >> +\end{lstlisting}
> >> +
> >> +Attach an endpoint to a domain. \field{domain} is an identifier unique to
> >> +the virtio-iommu device.
> > 
> > what does this mean? do you mean that it uniquely identifies
> > a domain? 
> 
> It uniquely identifies a domain for one device. When there are multiple
> virtio-iommu devices, the driver can maintain one domain ID space per
> device rather than a global ID space.

I think all spec is per-device anyway. This qualification just
seems to confuse.


> > And I guess the assumption is that endpoints in different domains
> > are isolated from each other?
> 
> Yes, I'll add this.
> 
> >> +\devicenormative{\paragraph}{ATTACH request}{Device Types / IOMMU Device / Device operations / ATTACH request}
> >> +
> >> +If the \field{reserved} field of an ATTACH request is not zero, the device
> >> +SHOULD set the request \field{status} to VIRTIO_IOMMU_S_INVAL and SHOULD
> >> +NOT attach the endpoint to the domain. \footnote{The device should
> > 
> > what does should mean here? Is expected to? Or is this part of
> > the conformance clause? the SHOULD and take it out of the footnote.
> > 
> >> +validate input of ATTACH requests in case the driver attempts to attach in
> >> +a mode that is unimplemented by the device, and would be incompatible with
> >> +the modes implemented by the device.}
> > 
> > Where does mode come from?
> 
> It would be a new feature bit. I don't think the footnote is useful, so
> I'll just specify that the device rejects the request if reserved != 0.
> 
> >> +\subsubsection{MAP request}\label{sec:Device Types / IOMMU Device / Device operations / MAP request}
> >> +
> >> +\begin{lstlisting}
> >> +struct virtio_iommu_req_map {
> >> +  struct virtio_iommu_req_head head;
> >> +  le32  domain;
> >> +  le64  virt_start;
> >> +  le64  virt_end;
> >> +  le64  phys_start;
> >> +  le32  flags;
> >> +  struct virtio_iommu_req_tail tail;
> >> +};
> >> +
> >> +/* Flags are: */
> >> +#define VIRTIO_IOMMU_MAP_F_READ   (1 << 0)
> >> +#define VIRTIO_IOMMU_MAP_F_WRITE  (1 << 1)
> >> +#define VIRTIO_IOMMU_MAP_F_EXEC   (1 << 2)
> > 
> > what is exec? pls add some comments near all flags.
> 
> Exec means instruction fetch. Some systems have different transaction
> flags for read and exec (PCI only supports that in the PASID TLP prefix)
> and some page tables have a "no exec" bit.

So let's say I don't set EXEC on such a system. Is EXEC allowed or not?

Maybe we can ask user to always set EXEC is exec is needed,
but then say we can not promise instruction fetch will fail
if exec is not set.


> >> The precise meaning of the MMIO flag depends on the
> >> +underlying memory architecture (for example on Armv8-A it corresponds to
> >> +the "Device-nGnRE" memory type). Unless needed by mapped MSIs, the device
> >> +isn't required to support the MMIO flag.
> > 
> > I really dislike how this part is written. A platform specific bit that
> > does whatever platform wants? How does one use it based on this?
> > Platform documentation is hoghly unlikely to explain what does
> > VIRTIO_IOMMU_MAP_F_MMIO mean.
> > 
> > Does this mean "not buffered, elided, or combined with other writes"?
> > 
> > Also maybe add a feature bit so platforms can declare support.
> 
> Right, a feature bit for MMIO is a good idea. I'll add one for EXEC as
> well. I'll describe MMIO using only the expected behavior of memory
> accesses to the region, which is the same regardless of the architecture.
> 
> >> +The device MUST NOT allow writes to a range mapped without the
> >> +VIRTIO_IOMMU_MAP_F_WRITE flag. However, if the underlying architecture
> >> +does not support write-only mappings, the device MAY allow reads to a
> >> +range mapped with VIRTIO_IOMMU_MAP_F_WRITE but not
> >> +VIRTIO_IOMMU_MAP_F_READ.
> > 
> > do drivers care? if yes how about a way for it to know?
> 
> I don't think they care. If write-implies-read was a security problem I
> think that architectures (x86, arm64) would have changed it by now.

so maybe drivers MUST set READ if read is required?
and device does not have to respect it if not set?
and maybe a feature bit like exec?


> >> +(4) a = map(0, 9);
> >> +    unmap(0, 4)                  -> faults, doesn't unmap anything
> > 
> > fails?
> > 
> >> +
> >> +(5) a = map(0, 4);
> >> +    b = map(5, 9);
> >> +    unmap(0, 4)                  -> succeeds, unmaps a
> >> +
> >> +(6) a = map(0, 4);
> >> +    unmap(0, 9)                  -> succeeds, unmaps a
> >> +
> >> +(7) a = map(0, 4);
> >> +    b = map(10, 14);
> >> +    unmap(0, 14)                 -> succeeds, unmaps a and b
> >> +\end{lstlisting}
> >> +
> > 
> > what about a partial unmap? e.g. unmap 0, 3?
> 
> That's not allowed at the moment (example 4 and next-to-last paragraph
> of device normative).

worth saying then.

I think they are very useful but can depend on a feature bit.


> >> +If a mapping affected by the range is not covered in its entirety by the
> >> +range (the UNMAP request would split the mapping), then the device SHOULD
> >> +set the request \field{status} to VIRTIO_IOMMU_S_RANGE, and SHOULD NOT
> >> +remove any mapping.
> >> +
> >> +If part of the range or the full range is not covered by an existing
> >> +mapping, then the device SHOULD remove all mappings affected by the range
> >> +and set the request \field{status} to VIRTIO_IOMMU_S_OK.
> >> +
> > 
> > 
> > So based on above, it seems clear that device must remember any number
> > of mappings sent to it in device memory. Seems like an easy way
> > for guest to use any amount of device memory.
> > 
> > If any limits are allowed then I think device needs a way
> > to communicate them to driver.
> 
> Except for embedded implementations, it seems difficult to determine a
> limit upfront, since it's going to depend on the host's memory which
> changes at runtime. And even if we do advertise a limit in the config,
> the driver is unlikely to change its DMA allocation strategy. It will
> still create mappings until it has to return an error to the caller. So
> for now the best strategy is probably for the device to return a DEVERR
> status when it runs out of mappings.

Let's at least create a dedicated code for this?

> >> +The driver parses the first property by reading \field{type}, then
> >> +\field{length}. If the driver recognizes \field{type}, it reads and
> >> +handles the rest of the property. The driver then reads the next property,
> >> +that is located $(\field{length} + 4)$ bytes after the beginning of the
> >> +first one, and so on. The driver parses all properties until it reaches a
> >> +NONE property or the end of \field{properties}.
> > 
> > I guess device must make sure length fits within the buffer?
> 
> Yes, the device knows how much info it needs to communicate, and writes
> it in the probe_size field of the config.

i would add a conformance statement that it must

> >> +\paragraph{Property NONE}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties / NONE}
> >> +
> >> +Marks the end of the property list. This property doesn't have any value,
> >> +and should have \field{length} 0.
> > 
> > why is this needed btw?
> 
> Just provides a convenient way to mark the end of the list.

but we know the length already. generally redundant info is
a source of bugs ...

> >> +\paragraph{Property RESV_MEM}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties / RESVMEM}
> >> +
> >> +The RESV_MEM property describes a chunk of reserved virtual memory. It may
> >> +be used by the device to describe virtual address ranges that shouldn't be
> >> +allocated by the driver, or that are special.
> > 
> > avoid shouldn't outside conformance clauses. maybe add a conformance
> > clause that driver must avoid these ranges?
> > what if driver ignores this property? I guess device must validate
> > the addresses?
> 
> Yes, device returns an error when a MAP request crosses a resv region.

ok let's add a conformance clause

> > 
> >> +
> >> +\begin{lstlisting}
> >> +struct virtio_iommu_probe_resv_mem {
> >> +  struct virtio_iommu_probe_property head;
> >> +  u8    subtype;
> >> +  u8    reserved[3];
> >> +  le64  start;
> >> +  le64  end;
> >> +};
> >> +\end{lstlisting}
> >> +
> >> +Fields \field{start} and \field{end} describe the range of reserved virtual
> >> +addresses. \field{subtype} may be one of:
> >> +
> >> +\begin{description}
> >> +  \item[VIRTIO_IOMMU_RESV_MEM_T_RESERVED (0)]
> >> +    Accesses to virtual addresses in this region have undefined behavior.
> > 
> > undefined?  how can platforms with such a property coexist with untrusted
> > devices?
> 
> It is "undefined" because I can't enumerate all cases (don't know about
> all of them). This would contain for example the RMRR regions of VT-d
> (the BIOS reserving some DMA regions for itself), regions allocated for
> host-managed MSIs on arm platforms, and some PCI bridge windows
> (although they removed those from the Linux resv regions recently, not
> certain why). Ideally we wouldn't have any, but some special cases make
> it necessary.
> 
> In general having reserved regions is incompatible with untrusted
> devices, and in some cases incompatible with untrusted guests. But
> choosing the policy about which devices to present to guest is up to the
> platform. The host knows what the resv regions are used for, and if they
> are safe enough. The guest just need to knows what to avoid.

Yes but ... if you trust driver and device then what's the
point of the iommu?

RMRR access basically fails with translation right?
Not undefined.

What happens with MSIs on ARM?



> > 
> >> +    They may be aborted by the device,
> > 
> > the device as in the iommu device?
> 
> Yes, although I can't think of an example of this case right now.
> 
> > 
> >> bypass it,
> > 
> > bypass might be a big security problem.
> 
> I have two cases in mind:
> * MSI, which is fine since the irqchip provides device isolation,
> * mappings reserved by the BIOS, which again only affects the device but
> may allow the guest to access firmware information. In my opinion such
> endpoint shouldn't be assigned to a guest.

ok so drop this from spec?

> >> or never even reach it.
> > 
> > that's ok
> 
> We could be losing isolation if some bridge is intercepting accesses to
> a range of addresses.

hmm sorry not sure I understand.

> >> +    The region may also be used for host mappings, for example Message
> >> +    Signaled Interrupts.
> > 
> > Let's avoid mentioning host if we can.
> > 
> >> +
> >> +    The guest should neither use these virtual addresses in a MAP request
> >> +    nor instruct endpoints to perform DMA on them.
> > 
> > avoid should outside conformance clauses.
> > 
> > and what it guest does not obey this request?
> > 
> > I think we must fix this and require that such accesses
> > are ignored.
> 
> That's generally not something the host can enforce. It has to make sure
> that such accesses would only disturb the guest or the endpoint.

right. so let's require this.

> Otherwise it had better trust both guest and endpoint, or it shouldn't
> present the endpoint to the guest at all.
> 
> >> +The driver SHOULD treat any \field{subtype} it doesn't recognize as if it
> >> +was VIRTIO_IOMMU_RESV_MEM_T_RESERVED.
> > 
> > why is that a good idea?
> 
> Some future version might add a new reserved type that provides more
> details. For a driver that doesn't recognize it but would still work
> without the additional details, it's better to treat it as a reserved
> region than try to map it.

confused. didn't we say we never map these addresses?
if we just ignore a type then we will map the virtual address no?


> >> +These faults are not recoverable\footnote{This means that the PRI
> >> +extension to PCI, for example, that allows recoverable faults, isn't
> >> +supported for the moment.}. The guest has to do its best to
> >> +prevent any future fault from happening, by stopping or resetting the
> >> +endpoint.
> > 
> > what does recoverable mean from virtio iommu pov?
> > nothing right?
> > it's an endpoint issue.
> 
> With PRI the IOMMU receives page requests, and handles them through the
> memory management subsystem. A response to the fault ("we fixed the
> fault, retry access") is returned through the IOMMU, not the endpoint.

I'd say this is kind of true vacously then. IOW we don't
have a spec for PRI support but there is no need
to actually say this.

> > 
> >> +
> >> +When the fault is reported by a physical IOMMU, the fault reasons may not
> >> +match exactly the reason of the original fault report. The device should
> >> +try its best to find the closest match.
> >> +
> >> +If the device encounters a fault
> > 
> > device doesn't right? endpoint does ...
> 
> This paragraph describes what to do if the virtio-iommu encounters
> some asynchronous error of its own, without any endpoint involved.


ok ... so I'd remove should here as this is not in any
conformance clauses. and maybe stress "an internal error" not
"a fault".

> > 
> >> that wasn't caused by a specific
> >> +endpoint, it is unlikely that the driver would be able to do anything else
> >> +than print the fault and stop using the device, so reporting the fault on
> >> +the event queue isn't useful. In that case, we recommend using the
> >> +DEVICE_NEEDS_RESET status bit.
> 
> Thanks,
> Jean

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [virtio-comment] Re: [virtio-dev] Re: [PATCH v3] Add virtio-iommu device specification
  2019-05-10 16:52     ` Michael S. Tsirkin
@ 2019-05-14 16:07       ` Jean-Philippe Brucker
  2019-05-15  5:08         ` Tian, Kevin
       [not found]         ` <AADFC41AFE54684AB9EE6CBC0274A5D19CA3166E@SHSMSX104.ccr.corp.intel.com>
  0 siblings, 2 replies; 11+ messages in thread
From: Jean-Philippe Brucker @ 2019-05-14 16:07 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-comment, virtio-dev, joro, tnowicki, eric.auger,
	kevin.tian, lorenzo.pieralisi, bharat.bhushan, bauerman

On 10/05/2019 17:52, Michael S. Tsirkin wrote:
>>>> +/* Flags are: */
>>>> +#define VIRTIO_IOMMU_MAP_F_READ   (1 << 0)
>>>> +#define VIRTIO_IOMMU_MAP_F_WRITE  (1 << 1)
>>>> +#define VIRTIO_IOMMU_MAP_F_EXEC   (1 << 2)
>>>
>>> what is exec? pls add some comments near all flags.
>>
>> Exec means instruction fetch. Some systems have different transaction
>> flags for read and exec (PCI only supports that in the PASID TLP prefix)
>> and some page tables have a "no exec" bit.
> 
> So let's say I don't set EXEC on such a system. Is EXEC allowed or not?
> 
> Maybe we can ask user to always set EXEC is exec is needed,
> but then say we can not promise instruction fetch will fail
> if exec is not set.

So the semantics could be:
* device sets EXEC feature bit
  - driver sets EXEC flag in MAP, instruction fetch succeeds
  - driver doesn't set EXEC flag in MAP, instructions fetch will fail
* device doesn't set EXEC feature bit, then driver cannot set EXEC flag
but instruction fetch may not fail.

But I'm tempted to just remove the EXEC flag for now, I don't think
anyone needs it at the moment (and VFIO doesn't support it). I'd add it
back as NO_EXEC which is closer to what IOMMUs offer.


>>>> +\begin{description}
>>>> +  \item[VIRTIO_IOMMU_RESV_MEM_T_RESERVED (0)]
>>>> +    Accesses to virtual addresses in this region have undefined behavior.
>>>
>>> undefined?  how can platforms with such a property coexist with untrusted
>>> devices?
>>
>> It is "undefined" because I can't enumerate all cases (don't know about
>> all of them). This would contain for example the RMRR regions of VT-d
>> (the BIOS reserving some DMA regions for itself), regions allocated for
>> host-managed MSIs on arm platforms, and some PCI bridge windows
>> (although they removed those from the Linux resv regions recently, not
>> certain why). Ideally we wouldn't have any, but some special cases make
>> it necessary.
>>
>> In general having reserved regions is incompatible with untrusted
>> devices, and in some cases incompatible with untrusted guests. But
>> choosing the policy about which devices to present to guest is up to the
>> platform. The host knows what the resv regions are used for, and if they
>> are safe enough. The guest just need to knows what to avoid.
> 
> Yes but ... if you trust driver and device then what's the
> point of the iommu?

Improving memory allocation. The guest can do scatter-gather for large
buffers, instead of allocating big contiguous chunks of guest-physical
addresses. And with device pass-through, any memory used for DMA has to
be pinned (since devices generally don't support page faults). So when
assigning an endpoint to a guest, without a vIOMMU all of the guest
memory has to be pinned upfront. With an IOMMU only the memory that will
actually be used for DMA is pinned.

> RMRR access basically fails with translation right?
> Not undefined.

From section 3.15 of VT-d rev3.0, it looks like they are translated

"Requests to these reserved regions may either occur as a result of
operations performed by the system software driver (for example in the
case of DMA from unified memory access (UMA) graphics controllers to
graphics reserved memory), or may be initiated by non system software
(for example in case of DMA performed by a USB controller under BIOS SMM
control for legacy keyboard emulation)."

> What happens with MSIs on ARM?

On Arm platforms it's the IRQ chip rather than the IOMMU that performs
MSI isolation - ensure that an endpoint doesn't trigger MSIs for another
endpoint. The SMMU doesn't differentiate an MSI from a normal write,
unlike x86 which has a special address range. MSI-X tables for a
pass-through device are managed by the host, and a virtual MSI-X table
is presented to the guest. So the host allocates a VA range and use it
to map the MSI doorbell in the SMMU. It then reports that VA range as
reserved to the guest. If the endpoint did write to the region, nothing
bad would happen, it would simply trigger an MSI as if it had triggered
via the MSI-X table. But the guest can't use that region for normal
memory mapping.

>>>> or never even reach it.
>>>
>>> that's ok
>>
>> We could be losing isolation if some bridge is intercepting accesses to
>> a range of addresses.
> 
> hmm sorry not sure I understand.

After reading a bit more about this, I don't think this is relevant here.

This was about lack of ACS isolation allowing endpoints to do p2p by
writing to some specific MMIO region, but we already deal with this in
the ATTACH section - either endpoints are properly isolated with ACS, or
the platform describes them as being in the same IOMMU group and they
cannot be isolated from each other. We don't need to care about reserved
MMIO ranges on top of that.

RESV_MEM is used for MSI and RMRR-style regions for now, so I'll
reformulate this paragraph. I'll also require that accesses to RESV
regions don't affect anything else than the endpoint and the SW that
owns it.


>>>> +The driver SHOULD treat any \field{subtype} it doesn't recognize as if it
>>>> +was VIRTIO_IOMMU_RESV_MEM_T_RESERVED.
>>>
>>> why is that a good idea?
>>
>> Some future version might add a new reserved type that provides more
>> details. For a driver that doesn't recognize it but would still work
>> without the additional details, it's better to treat it as a reserved
>> region than try to map it.
> 
> confused. didn't we say we never map these addresses?
> if we just ignore a type then we will map the virtual address no?

Yes, that's why we require the driver not to ignore the region

Thanks,
Jean

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [virtio-dev] Re: [PATCH v3] Add virtio-iommu device specification
  2019-05-14 16:07       ` Jean-Philippe Brucker
@ 2019-05-15  5:08         ` Tian, Kevin
       [not found]         ` <AADFC41AFE54684AB9EE6CBC0274A5D19CA3166E@SHSMSX104.ccr.corp.intel.com>
  1 sibling, 0 replies; 11+ messages in thread
From: Tian, Kevin @ 2019-05-15  5:08 UTC (permalink / raw)
  To: Jean-Philippe Brucker, Michael S. Tsirkin
  Cc: virtio-comment, virtio-dev, joro, tnowicki, eric.auger,
	lorenzo.pieralisi, bharat.bhushan, bauerman

> From: Jean-Philippe Brucker 
> Sent: Wednesday, May 15, 2019 12:08 AM
> 
> On 10/05/2019 17:52, Michael S. Tsirkin wrote:
> >>>> +/* Flags are: */
> >>>> +#define VIRTIO_IOMMU_MAP_F_READ   (1 << 0)
> >>>> +#define VIRTIO_IOMMU_MAP_F_WRITE  (1 << 1)
> >>>> +#define VIRTIO_IOMMU_MAP_F_EXEC   (1 << 2)
> >>>
> >>> what is exec? pls add some comments near all flags.
> >>
> >> Exec means instruction fetch. Some systems have different transaction
> >> flags for read and exec (PCI only supports that in the PASID TLP prefix)
> >> and some page tables have a "no exec" bit.
> >
> > So let's say I don't set EXEC on such a system. Is EXEC allowed or not?
> >
> > Maybe we can ask user to always set EXEC is exec is needed,
> > but then say we can not promise instruction fetch will fail
> > if exec is not set.
> 
> So the semantics could be:
> * device sets EXEC feature bit
>   - driver sets EXEC flag in MAP, instruction fetch succeeds
>   - driver doesn't set EXEC flag in MAP, instructions fetch will fail
> * device doesn't set EXEC feature bit, then driver cannot set EXEC flag
> but instruction fetch may not fail.
> 
> But I'm tempted to just remove the EXEC flag for now, I don't think
> anyone needs it at the moment (and VFIO doesn't support it). I'd add it
> back as NO_EXEC which is closer to what IOMMUs offer.

Agree. It's more for PASID-based usage e.g. sharing CPU page table
with device in SVA. Now this version is purely about MAP/UNMAP.
On Intel VT-d, EXEC is ignored for request-without-PASID.

> 
> 
> >>>> +\begin{description}
> >>>> +  \item[VIRTIO_IOMMU_RESV_MEM_T_RESERVED (0)]
> >>>> +    Accesses to virtual addresses in this region have undefined behavior.
> >>>
> >>> undefined?  how can platforms with such a property coexist with
> untrusted
> >>> devices?
> >>
> >> It is "undefined" because I can't enumerate all cases (don't know about
> >> all of them). This would contain for example the RMRR regions of VT-d
> >> (the BIOS reserving some DMA regions for itself), regions allocated for
> >> host-managed MSIs on arm platforms, and some PCI bridge windows
> >> (although they removed those from the Linux resv regions recently, not
> >> certain why). Ideally we wouldn't have any, but some special cases make
> >> it necessary.
> >>
> >> In general having reserved regions is incompatible with untrusted
> >> devices, and in some cases incompatible with untrusted guests. But
> >> choosing the policy about which devices to present to guest is up to the
> >> platform. The host knows what the resv regions are used for, and if they
> >> are safe enough. The guest just need to knows what to avoid.
> >
> > Yes but ... if you trust driver and device then what's the
> > point of the iommu?
> 
> Improving memory allocation. The guest can do scatter-gather for large
> buffers, instead of allocating big contiguous chunks of guest-physical
> addresses. And with device pass-through, any memory used for DMA has to
> be pinned (since devices generally don't support page faults). So when
> assigning an endpoint to a guest, without a vIOMMU all of the guest
> memory has to be pinned upfront. With an IOMMU only the memory that will
> actually be used for DMA is pinned.

I'd not vote using current vIOMMU just for avoiding pin instead of for
isolation purpose... the map/unmap overhead is high except you only 
talking about static mapping e.g. in guest user space driver (e.g. DPDK) 
(but then it's actually for isolation purpose between user and kernel).
 btw we are working on a lighter-weight approach (aim <5% perf drop
for most cases) while allowing host to fully introspect guest DMA activities. 
Likely we'll introduce a new virtio-iommu capability for this purpose, 
instead of using current MAP/UNMAP primitives. Stay tuned and we'll 
send out RFC soon once getting good result. :-)

And I don't think having reserved regions is incompatible with untrusted
devices. Favoring reserved region is a functional requirement so guest
driver can operate assigned device in a good shape. Violating reserved
region requirement (e.g. map guest memory page to reserved region) just
means device access to this page doesn't provide desired behavior, since
host will always guarantee safe behavior in physical IOMMU on those
regions:

- it is either mapped to fixed resource (e.g ARM MSI doorbell)
which is associated with this device and thus safe
- or map to nothing

I don't think architecturally we break any isolation implication on
IOMMU here.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [virtio-dev] Re: [PATCH v3] Add virtio-iommu device specification
       [not found]         ` <AADFC41AFE54684AB9EE6CBC0274A5D19CA3166E@SHSMSX104.ccr.corp.intel.com>
@ 2019-05-15  5:30           ` Tian, Kevin
  2019-05-16 11:46             ` [virtio-comment] " Jean-Philippe Brucker
  0 siblings, 1 reply; 11+ messages in thread
From: Tian, Kevin @ 2019-05-15  5:30 UTC (permalink / raw)
  To: 'Jean-Philippe Brucker', Michael S. Tsirkin
  Cc: virtio-comment, virtio-dev, joro, tnowicki, eric.auger,
	lorenzo.pieralisi, bharat.bhushan, bauerman,
	Alex Williamson (alex.williamson@redhat.com)

> From: Tian, Kevin
> Sent: Wednesday, May 15, 2019 1:08 PM
> To: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>; Michael S.
> Tsirkin <mst@redhat.com>
> >
> >
> > >>>> +\begin{description}
> > >>>> +  \item[VIRTIO_IOMMU_RESV_MEM_T_RESERVED (0)]
> > >>>> +    Accesses to virtual addresses in this region have undefined
> behavior.
> > >>>
> > >>> undefined?  how can platforms with such a property coexist with
> > untrusted
> > >>> devices?
> > >>
> > >> It is "undefined" because I can't enumerate all cases (don't know about
> > >> all of them). This would contain for example the RMRR regions of VT-d
> > >> (the BIOS reserving some DMA regions for itself), regions allocated for
> > >> host-managed MSIs on arm platforms, and some PCI bridge windows
> > >> (although they removed those from the Linux resv regions recently, not
> > >> certain why). Ideally we wouldn't have any, but some special cases make
> > >> it necessary.

+Alex.

Actually I don't think RMRR is right example here (sorry if it's my input in
our previous discussion). There was concern on RMRR providing a shared
memory region between device and platform controller, thus may cause
isolation issue:
	https://access.redhat.com/sites/default/files/attachments/rmrr-wp1.pdf

So the basic policy for devices with RMRR is, excluding them from device
assignment, except for Intel GPU and USB. The latter two has well-understood
behavior on RMRR, thus they can be safely assigned with RMRR ignored
(i.e. RMRR region can be reused as valid IOVA region).  

As I commented below, I think it's safe to describe reserved region behavior
as either succeed with undefined behavior (e.g. in ARM MSI doorbell case, 
where a spurious interrupt on allocated doorbell for this device is triggered 
instead of desired DMA access), or cause vIOMMU unrecoverable fault due 
to invalid mappings. An untrusted device can never use reserved region to 
escape since physically IOMMU only contains verified mapping to granted
resource (MSI) or nothing.

> > >>
> > >> In general having reserved regions is incompatible with untrusted
> > >> devices, and in some cases incompatible with untrusted guests. But
> > >> choosing the policy about which devices to present to guest is up to the
> > >> platform. The host knows what the resv regions are used for, and if they
> > >> are safe enough. The guest just need to knows what to avoid.
> > >
> > > Yes but ... if you trust driver and device then what's the
> > > point of the iommu?
> >
> > Improving memory allocation. The guest can do scatter-gather for large
> > buffers, instead of allocating big contiguous chunks of guest-physical
> > addresses. And with device pass-through, any memory used for DMA has to
> > be pinned (since devices generally don't support page faults). So when
> > assigning an endpoint to a guest, without a vIOMMU all of the guest
> > memory has to be pinned upfront. With an IOMMU only the memory that
> will
> > actually be used for DMA is pinned.
> 
> I'd not vote using current vIOMMU just for avoiding pin instead of for
> isolation purpose... the map/unmap overhead is high except you only
> talking about static mapping e.g. in guest user space driver (e.g. DPDK)
> (but then it's actually for isolation purpose between user and kernel).
>  btw we are working on a lighter-weight approach (aim <5% perf drop
> for most cases) while allowing host to fully introspect guest DMA activities.
> Likely we'll introduce a new virtio-iommu capability for this purpose,
> instead of using current MAP/UNMAP primitives. Stay tuned and we'll
> send out RFC soon once getting good result. :-)
> 
> And I don't think having reserved regions is incompatible with untrusted
> devices. Favoring reserved region is a functional requirement so guest
> driver can operate assigned device in a good shape. Violating reserved
> region requirement (e.g. map guest memory page to reserved region) just
> means device access to this page doesn't provide desired behavior, since
> host will always guarantee safe behavior in physical IOMMU on those
> regions:
> 
> - it is either mapped to fixed resource (e.g ARM MSI doorbell)
> which is associated with this device and thus safe
> - or map to nothing
> 
> I don't think architecturally we break any isolation implication on
> IOMMU here.
> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [virtio-comment] Re: [virtio-dev] Re: [PATCH v3] Add virtio-iommu device specification
  2019-05-15  5:30           ` Tian, Kevin
@ 2019-05-16 11:46             ` Jean-Philippe Brucker
  0 siblings, 0 replies; 11+ messages in thread
From: Jean-Philippe Brucker @ 2019-05-16 11:46 UTC (permalink / raw)
  To: Tian, Kevin, Michael S. Tsirkin
  Cc: virtio-comment, virtio-dev, joro, tnowicki, eric.auger,
	lorenzo.pieralisi, bharat.bhushan, bauerman,
	Alex Williamson (alex.williamson@redhat.com)

On 15/05/2019 06:30, Tian, Kevin wrote:
>> From: Tian, Kevin
>> Sent: Wednesday, May 15, 2019 1:08 PM
>> To: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>; Michael S.
>> Tsirkin <mst@redhat.com>
>>>
>>>
>>>>>>> +\begin{description}
>>>>>>> +  \item[VIRTIO_IOMMU_RESV_MEM_T_RESERVED (0)]
>>>>>>> +    Accesses to virtual addresses in this region have undefined
>> behavior.
>>>>>>
>>>>>> undefined?  how can platforms with such a property coexist with
>>> untrusted
>>>>>> devices?
>>>>>
>>>>> It is "undefined" because I can't enumerate all cases (don't know about
>>>>> all of them). This would contain for example the RMRR regions of VT-d
>>>>> (the BIOS reserving some DMA regions for itself), regions allocated for
>>>>> host-managed MSIs on arm platforms, and some PCI bridge windows
>>>>> (although they removed those from the Linux resv regions recently, not
>>>>> certain why). Ideally we wouldn't have any, but some special cases make
>>>>> it necessary.
> 
> +Alex.
> 
> Actually I don't think RMRR is right example here (sorry if it's my input in
> our previous discussion). There was concern on RMRR providing a shared
> memory region between device and platform controller, thus may cause
> isolation issue:
> 	https://access.redhat.com/sites/default/files/attachments/rmrr-wp1.pdf
> 
> So the basic policy for devices with RMRR is, excluding them from device
> assignment, except for Intel GPU and USB. The latter two has well-understood
> behavior on RMRR, thus they can be safely assigned with RMRR ignored
> (i.e. RMRR region can be reused as valid IOVA region).  

Ah I see, seems like a good policy.

> As I commented below, I think it's safe to describe reserved region behavior
> as either succeed with undefined behavior (e.g. in ARM MSI doorbell case, 
> where a spurious interrupt on allocated doorbell for this device is triggered 
> instead of desired DMA access), or cause vIOMMU unrecoverable fault due 
> to invalid mappings. An untrusted device can never use reserved region to 
> escape since physically IOMMU only contains verified mapping to granted
> resource (MSI) or nothing.

Yes, I will add that as a requirement in next version: reserved regions
are okay as long as they don't allow a device to escape isolation.
Finding the right wording without being too vague is a bit tough for
this one.

>>>>>
>>>>> In general having reserved regions is incompatible with untrusted
>>>>> devices, and in some cases incompatible with untrusted guests. But
>>>>> choosing the policy about which devices to present to guest is up to the
>>>>> platform. The host knows what the resv regions are used for, and if they
>>>>> are safe enough. The guest just need to knows what to avoid.
>>>>
>>>> Yes but ... if you trust driver and device then what's the
>>>> point of the iommu?
>>>
>>> Improving memory allocation. The guest can do scatter-gather for large
>>> buffers, instead of allocating big contiguous chunks of guest-physical
>>> addresses. And with device pass-through, any memory used for DMA has to
>>> be pinned (since devices generally don't support page faults). So when
>>> assigning an endpoint to a guest, without a vIOMMU all of the guest
>>> memory has to be pinned upfront. With an IOMMU only the memory that
>> will
>>> actually be used for DMA is pinned.
>>
>> I'd not vote using current vIOMMU just for avoiding pin instead of for
>> isolation purpose... the map/unmap overhead is high except you only
>> talking about static mapping e.g. in guest user space driver (e.g. DPDK)
>> (but then it's actually for isolation purpose between user and kernel).
>>  btw we are working on a lighter-weight approach (aim <5% perf drop
>> for most cases) while allowing host to fully introspect guest DMA activities.
>> Likely we'll introduce a new virtio-iommu capability for this purpose,
>> instead of using current MAP/UNMAP primitives. Stay tuned and we'll
>> send out RFC soon once getting good result. :-)

Sounds great, I'd like to hear more about this. I also have some drafts
for performance improvement and SVA, but haven't found time to progress
on that in a while. For SVA however, introspection seems nearly
impossible (on Arm the host may not even see invalidation commands from
the guest).

Thanks,
Jean

>>
>> And I don't think having reserved regions is incompatible with untrusted
>> devices. Favoring reserved region is a functional requirement so guest
>> driver can operate assigned device in a good shape. Violating reserved
>> region requirement (e.g. map guest memory page to reserved region) just
>> means device access to this page doesn't provide desired behavior, since
>> host will always guarantee safe behavior in physical IOMMU on those
>> regions:
>>
>> - it is either mapped to fixed resource (e.g ARM MSI doorbell)
>> which is associated with this device and thus safe
>> - or map to nothing
>>
>> I don't think architecturally we break any isolation implication on
>> IOMMU here.
>>
>> Thanks
>> Kevin


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-05-16 11:46 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-30 13:56 [virtio-comment] [PATCH v3] Add virtio-iommu device specification Jean-Philippe Brucker
2019-04-30 14:42 ` [virtio-comment] Re: [virtio-dev] " Jean-Philippe Brucker
2019-05-03 19:47   ` Michael S. Tsirkin
2019-05-05 13:33     ` Jean-Philippe Brucker
2019-05-03 17:05 ` [virtio-comment] " Michael S. Tsirkin
2019-05-10 12:07   ` [virtio-comment] Re: [virtio-dev] " Jean-Philippe Brucker
2019-05-10 16:52     ` Michael S. Tsirkin
2019-05-14 16:07       ` Jean-Philippe Brucker
2019-05-15  5:08         ` Tian, Kevin
     [not found]         ` <AADFC41AFE54684AB9EE6CBC0274A5D19CA3166E@SHSMSX104.ccr.corp.intel.com>
2019-05-15  5:30           ` Tian, Kevin
2019-05-16 11:46             ` [virtio-comment] " Jean-Philippe Brucker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.