All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] virtio-iommu version 0.4
@ 2017-08-04 18:19 ` Jean-Philippe Brucker
  0 siblings, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-08-04 18:19 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: will.deacon, robin.murphy, lorenzo.pieralisi, mst, jasowang,
	marc.zyngier, eric.auger, eric.auger.pro, bharat.bhushan, peterx,
	kevin.tian

This is the continuation of my proposal for virtio-iommu, the para-
virtualized IOMMU. Here is a summary of the changes since last time [1]:

* The virtio-iommu document now resembles an actual specification. It is
  split into a formal description of the virtio device, and implementation
  notes. Please find sources and binaries at [2].

* Added a probe request to describe to the guest different properties that
  do not fit in firmware or in the virtio config space. This is a
  necessary stepping stone for extending the virtio-iommu.

* There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
  Bhushan.

You can find the Linux driver and kvmtool device at [4] and [5]. I
plan to rework driver and kvmtool device slightly before sending the
patches.

To understand the virtio-iommu, I advise to first read introduction and
motivation, then skim through implementation notes and finally look at the
device specification.

I wasn't sure how to organize the review. For those who prefer to comment
inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex
to this thread. They are the biggest chunks of the document. But LaTeX
isn't very pleasant to read, so you can simply send a list of comments in
relation to section numbers and a few words of context, we'll manage.

---
Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare
more easily differences since the RFC (see [6]), but haven't been made
public so far. This is the first public posting since initial proposal
[1], and the following describes all changes.

## v0.1 ##

Content is the same as the RFC, but formatted to LaTeX. 'make' generates
one PDF and one HTML document.

## v0.2 ##

Add introductions, improve topology example and firmware description based
on feedback and a number of useful discussions.

## v0.3 ##

Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten
the device and driver behaviour. Unmap semantics are consolidated; they
are now closer to VFIO Type1 v2 semantics.

## v0.4 ##

Introduce PROBE requests. They provide per-endpoint information to the
driver that couldn't be described otherwise.

For the moment, they allow to handle MSIs on x86 virtual platforms (see
3.2). To do that we communicate reserved IOVA regions, that will also be
useful for describing regions that cannot be mapped for a given endpoint,
for instance addresses that correspond to a PCI bridge window.

Introducing such a large framework for this tiny feature may seem
overkill, but it is needed for future extensions of the virtio-iommu and I
believe it really is worth the effort.

## Future ##

Other extensions are in preparation. I won't detail them here because v0.4
already is a lot to digest, but in short, building on top of PROBE:

* First, since the IOMMU is paravirtualized, the device can expose some
  properties of the physical topology to the guest, and let it allocate
  resources more efficiently. For example, when the virtio-iommu manages
  both physical and emulated endpoints, with different underlying IOMMUs,
  we now have a way to describe multiple page and block granularities,
  instead of forcing the guest to use the most restricted one for all
  endpoints. This will most likely be in v0.5.

* Then on top of that, a major improvement will describe hardware
  acceleration features available to the guest. There is what I call "Page
  Table Handover" (or simply, from the host POV, "Nested"), the ability
  for the guest to manipulate its own page tables instead of sending
  MAP/UNMAP requests to the host. This, along with IO Page Fault
  reporting, will also permit SVM virtualization on different platforms.

Thanks,
Jean

[1] http://www.spinics.net/lists/kvm/msg147990.html
[2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
    http://www.linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
    I reiterate the disclaimers: don't use this document as a reference,
    it's a draft. It's also not an OASIS document yet. It may be riddled
    with mistakes. As this is a working draft, it is unstable and I do not
    guarantee backward compatibility of future versions.
[3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
[4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
    Warning: UAPI headers have changed! They didn't follow the spec,
    please update. (Use branch v0.1, that has the old headers, for the
    Qemu prototype [3])
[5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
    Warning: command-line has changed! Use --viommu vfio[,opts] and
    --viommu virtio[,opts] to instantiate a device.
[6] http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [virtio-dev] [RFC] virtio-iommu version 0.4
@ 2017-08-04 18:19 ` Jean-Philippe Brucker
  0 siblings, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-08-04 18:19 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: will.deacon, robin.murphy, lorenzo.pieralisi, mst, jasowang,
	marc.zyngier, eric.auger, eric.auger.pro, bharat.bhushan, peterx,
	kevin.tian

This is the continuation of my proposal for virtio-iommu, the para-
virtualized IOMMU. Here is a summary of the changes since last time [1]:

* The virtio-iommu document now resembles an actual specification. It is
  split into a formal description of the virtio device, and implementation
  notes. Please find sources and binaries at [2].

* Added a probe request to describe to the guest different properties that
  do not fit in firmware or in the virtio config space. This is a
  necessary stepping stone for extending the virtio-iommu.

* There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
  Bhushan.

You can find the Linux driver and kvmtool device at [4] and [5]. I
plan to rework driver and kvmtool device slightly before sending the
patches.

To understand the virtio-iommu, I advise to first read introduction and
motivation, then skim through implementation notes and finally look at the
device specification.

I wasn't sure how to organize the review. For those who prefer to comment
inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex
to this thread. They are the biggest chunks of the document. But LaTeX
isn't very pleasant to read, so you can simply send a list of comments in
relation to section numbers and a few words of context, we'll manage.

---
Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare
more easily differences since the RFC (see [6]), but haven't been made
public so far. This is the first public posting since initial proposal
[1], and the following describes all changes.

## v0.1 ##

Content is the same as the RFC, but formatted to LaTeX. 'make' generates
one PDF and one HTML document.

## v0.2 ##

Add introductions, improve topology example and firmware description based
on feedback and a number of useful discussions.

## v0.3 ##

Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten
the device and driver behaviour. Unmap semantics are consolidated; they
are now closer to VFIO Type1 v2 semantics.

## v0.4 ##

Introduce PROBE requests. They provide per-endpoint information to the
driver that couldn't be described otherwise.

For the moment, they allow to handle MSIs on x86 virtual platforms (see
3.2). To do that we communicate reserved IOVA regions, that will also be
useful for describing regions that cannot be mapped for a given endpoint,
for instance addresses that correspond to a PCI bridge window.

Introducing such a large framework for this tiny feature may seem
overkill, but it is needed for future extensions of the virtio-iommu and I
believe it really is worth the effort.

## Future ##

Other extensions are in preparation. I won't detail them here because v0.4
already is a lot to digest, but in short, building on top of PROBE:

* First, since the IOMMU is paravirtualized, the device can expose some
  properties of the physical topology to the guest, and let it allocate
  resources more efficiently. For example, when the virtio-iommu manages
  both physical and emulated endpoints, with different underlying IOMMUs,
  we now have a way to describe multiple page and block granularities,
  instead of forcing the guest to use the most restricted one for all
  endpoints. This will most likely be in v0.5.

* Then on top of that, a major improvement will describe hardware
  acceleration features available to the guest. There is what I call "Page
  Table Handover" (or simply, from the host POV, "Nested"), the ability
  for the guest to manipulate its own page tables instead of sending
  MAP/UNMAP requests to the host. This, along with IO Page Fault
  reporting, will also permit SVM virtualization on different platforms.

Thanks,
Jean

[1] http://www.spinics.net/lists/kvm/msg147990.html
[2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
    http://www.linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
    I reiterate the disclaimers: don't use this document as a reference,
    it's a draft. It's also not an OASIS document yet. It may be riddled
    with mistakes. As this is a working draft, it is unstable and I do not
    guarantee backward compatibility of future versions.
[3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
[4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
    Warning: UAPI headers have changed! They didn't follow the spec,
    please update. (Use branch v0.1, that has the old headers, for the
    Qemu prototype [3])
[5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
    Warning: command-line has changed! Use --viommu vfio[,opts] and
    --viommu virtio[,opts] to instantiate a device.
[6] http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [RFC] virtio-iommu v0.4 - IOMMU Device
  2017-08-04 18:19 ` [virtio-dev] " Jean-Philippe Brucker
@ 2017-08-04 18:19   ` Jean-Philippe Brucker
  -1 siblings, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-08-04 18:19 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: will.deacon, robin.murphy, lorenzo.pieralisi, mst, jasowang,
	marc.zyngier, eric.auger, eric.auger.pro, bharat.bhushan, peterx,
	kevin.tian

The following is roughly the content of device-operations.tex

---
\section{IOMMU device}\label{sec:Device Types / IOMMU Device}

The virtio-iommu device manages Direct Memory Access (DMA) from one or
more endpoints. It may act as a proxy for multiple physical IOMMUs
managing devices assigned to the guest, and as standalone IOMMU for
virtual devices.

The driver first discovers endpoints managed by the virtio-iommu device
using standard firmware mechanisms. It then sends requests to create
virtual address spaces and virtual-to-physical mappings for these
endpoints. In its simplest form, the virtio-iommu supports four request
types:

\begin{enumerate}
\item Create an address space and attach an endpoint to it.  \\
  \texttt{attach(device = 0x104, address space = 1)}
\item Create a mapping between a range of guest-virtual and guest-physical
  address. \\
  \texttt{map(address space = 1, virt = 0x1000, phys = 0xa000,
          size = 0x1000, flags = READ)}

  Endpoint 0x104, for example a hardware PCI endpoint, can now read at
  addresses 0x1000-0x1fff. These accesses are translated into
  system-physical addresses by the IOMMU.

\item Remove the mapping.\\
  \texttt{unmap(address space = 1, virt = 0x1000, size = 0x1000)}

  Any access to addresses 0x1000-0x1fff by endpoint 0x104 would now be
  rejected.
\item Detach the device and remove the address space.\\
  \texttt{detach(device = 0x104)}
\end{enumerate}

\subsection{Device ID}\label{sec:Device Types / IOMMU Device / Device ID}

TBD. During development, use 61216.

\subsection{Virtqueues}\label{sec:Device Types / IOMMU Device / Virtqueues}

\begin{description}
\item[0] requestq
\end{description}

\subsection{Feature bits}\label{sec:Device Types / IOMMU Device / Feature bits}

\begin{description}
\item[VIRTIO_IOMMU_F_INPUT_RANGE (0)]
  Available range of virtual addresses is described in \field{input_range}

\item[VIRTIO_IOMMU_F_IOASID_BITS (1)]
  The number of address spaces supported is described in \field{ioasid_bits}

\item[VIRTIO_IOMMU_F_MAP_UNMAP (2)]
  Map and unmap requests are available.\footnote{Future extensions may add
  different modes of operations. At the moment, only
  VIRTIO_IOMMU_F_MAP_UNMAP is supported.}

\item[VIRTIO_IOMMU_F_BYPASS (3)]
  When not attached to an address space, endpoints downstream of the IOMMU
  can access the guest-physical address space.

\item[VIRTIO_IOMMU_F_PROBE (4)]
  Probe request is available.
\end{description}

\drivernormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device / Feature bits}

The driver SHOULD accept any of the VIRTIO_IOMMU_F_INPUT_RANGE,
VIRTIO_IOMMU_F_IOASID_BITS, VIRTIO_IOMMU_F_MAP_UNMAP and
VIRTIO_IOMMU_F_PROBE feature bits if offered by the device.

% XXX F_MAP_UNMAP will be optional when introducing PTH. But a 0.2 driver
% must implement it (otherwise the device is useless)

\devicenormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device / Feature bits}

If the device offers any of VIRTIO_IOMMU_F_INPUT_RANGE,
VIRTIO_IOMMU_F_IOASID_BITS or VIRTIO_IOMMU_F_PROBE feature bits, and if
the driver did not accept this feature bit, then the device MAY signal
failure by failing to set FEATURES_OK \field{device status} bit when the
driver writes it.

If the device offers the VIRTIO_IOMMU_F_MAP_UNMAP feature bit, and if the
driver did not accept this feature bit, then the device SHOULD behave as
if the feature was negotiated.
% This takes into account all the following "If the F_MAP_UNMAP feature
% was negotiated..."

% If the driver supports F_PTH but not F_MAP_UNMAP, then the driver MUST
% give up upon seeing that the 0.2 device doesn't support F_PTH. If it
% supports F_MAP_UNMAP, then it SHOULD use F_MAP_UNMAP (this is described
% in "Driver Requirements: Feature Bits")

% If the driver supports F_PTH but the device doesn't, and the driver
% stupidly sets F_PTH but not F_MAP_UNMAP, then the device SHOULD reject
% any PTH request and PTH flag in attach.

% When not using the legacy interface, if the driver doesn't negotiate
% F_MAP_UNMAP, then the device may disable it an reject any request.

\subsubsection{Legacy Interface: Feature bits}\label{sec:Device Types / IOMMU Device / Feature bits / Legacy Interface: Feature bits}

When using the legacy interface, transitional devices MUST support guests
which do not negotiate any of the VIRTIO_IOMMU_F_INPUT_RANGE,
VIRTIO_IOMMU_F_IOASID_BITS, VIRTIO_IOMMU_F_MAP_UNMAP or
VIRTIO_IOMMU_F_PROBE features, and MUST behave as if the feature was
negotiated.

%\subsection{Feature Bits Requirements}\label{sec:Device Types / IOMMU Device / Feature bits requirements}

\subsection{Device configuration layout}\label{sec:Device Types / IOMMU Device / Device configuration layout}

The \field{page_size_mask} field is always present. Availability of the
others depend on various feature bits as indicated above.

\begin{lstlisting}
struct virtio_iommu_config {
	u64 page_size_mask;
	struct virtio_iommu_range {
		u64 start;
		u64 end;
	} input_range;
	u8 ioasid_bits;
	u8 padding[3];
	u32 probe_size;
};
\end{lstlisting}

\drivernormative{\subsubsection}{Device configuration layout}{Device Types / IOMMU Device / Device configuration layout}

The driver MUST NOT write to device configuration fields.

\devicenormative{\subsubsection}{Device configuration layout}{Device Types / IOMMU Device / Device configuration layout}

The device SHOULD set \field{padding} to zero.

The device MUST set at least one bit in \field{page_size_mask}, describing
the page granularity. The device MAY set more than one bit in
\field{page_size_mask}.

\subsubsection{Legacy Interface: Device configuration layout}\label{sec:Device Types / IOMMU Device / Device configuration layout / Legacy Interface: Device configuration layout}
When using the legacy interface, transitional devices and drivers
MUST format the fields in struct virtio_iommu_config
according to the native endian of the guest rather than
(necessarily when not using the legacy interface) little-endian.


\subsection{Device initialization}\label{sec:Device Types / IOMMU Device / Device initialization}

When the device is reset, endpoints are not attached to any address space.
If the VIRTIO_IOMMU_F_BYPASS feature is negotiated, all endpoints can
access guest-physical addresses ("bypass mode"). If the feature is not
negotiated, then any memory access from endpoints will fault. Upon
attaching an endpoint in bypass mode to a new address space, any memory
access from the endpoint will fault, since the address space does not
contain any mapping.

The driver chooses operating mode depending on its capabilities. In this
revision of the virtio-iommu specification, the only supported mode is
VIRTIO_IOMMU_F_MAP_UNMAP.

\drivernormative{\subsubsection}{Device Initialization}{Device Types / IOMMU Device / Device Initialization}

The driver MUST NOT negotiate VIRTIO_IOMMU_F_MAP_UNMAP if it is incapable
of sending VIRTIO_IOMMU_T_MAP and VIRTIO_IOMMU_T_UNMAP requests.

If the VIRTIO_IOMMU_F_PROBE feature is offered, the driver SHOULD send a
VIRTIO_IOMMU_T_PROBE request for each endpoint before attaching the
endpoint to an address space.

\devicenormative{\subsubsection}{Device Initialization}{Device Types / IOMMU Device / Device Initialization}

If the driver does not accept the VIRTIO_IOMMU_F_BYPASS feature, the
device SHOULD NOT let endpoints access the guest-physical address space.
However, the device MAY let endpoints access memory regions negotiated
with VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS (see
\ref{devicenormative:Device Types / IOMMU Device / Device operations / PROBE properties / RESV_MEM}).

\subsection{Device operations}\label{sec:Device Types / IOMMU Device / Device operations}

Driver send requests on the request virtqueue, notifies the device and
waits for the device to return the request with a status in the used ring.
All requests are split in two parts: one device-readable, one device-
writable. Each request is therefore described with at least two
descriptors, as illustrated below.

\begin{figure}[htb]
  \centering
  \includegraphics[width=0.7\textwidth]{img/request-wrapping.png}
  \caption{Anatomy of a virtio-iommu request}
\end{figure}

\begin{lstlisting}
struct virtio_iommu_req_head {
	u8	type;
	u8	reserved[3];
};

struct virtio_iommu_req_tail {
	u8	status;
	u8	reserved[3];
};
\end{lstlisting}

\rfc{% TODO
There is a problem with framing using multiple chains... If a request
isn't recognized by the device and is scattered across multiple descriptor
chains, how would the device know where this request ends and where the
next one begins? It assumes that a transition from WO descriptors to RO
descriptors is a new request, ok. But if this unknown request is from a
future extension that uses interleaved RO, WO, RO, WO descriptors for a
single request, then the device is doomed.\\
The extension will have to force that third buffer to start with 0 so our
device, that thinks it's a new request, doesn't recognize the request type
and waits for the next one.\\
Alternatively, we could add a 16-bit 'size' field into
virtio_iommu_req_head. We'd loose some space for future flags, and force
requests to be at most 64k, or 256k if we drop size bits [1:0]. (head
couldn't be extended in the future, with a field to describe a bigger
size, because our legacy device here would ignore it.) We'd have to resort
to multiple "container" requests transporting a single big one.\\
Personally I think we can let future extensions deal with interleaved
RO/WO/RO descriptors, but do you think a 'size' field would be better?
}

Type may be one of:

\begin{lstlisting}
#define VIRTIO_IOMMU_T_ATTACH			1
#define VIRTIO_IOMMU_T_DETACH			2
#define VIRTIO_IOMMU_T_MAP			3
#define VIRTIO_IOMMU_T_UNMAP			4
#define VIRTIO_IOMMU_T_PROBE			5
\end{lstlisting}

A few general-purpose status codes are defined here. Unless explicitly
described in a \textbf{Requirements} section, these values are hints to
make troubleshooting easier.

When the device fails to parse a request, for instance if a request seems
too small for its type and the device cannot find the tail, then it will
be unable to set \field{status}. In that case, it should return the
buffers without writing in them.

\begin{lstlisting}
/* All good! Carry on. */
#define VIRTIO_IOMMU_S_OK			0
/* Virtio communication error */
#define VIRTIO_IOMMU_S_IOERR			1
/* Unsupported request */
#define VIRTIO_IOMMU_S_UNSUPP			2
/* Internal device error */
#define VIRTIO_IOMMU_S_DEVERR			3
/* Invalid parameters */
#define VIRTIO_IOMMU_S_INVAL			4
/* Out-of-range parameters */
#define VIRTIO_IOMMU_S_RANGE			5
/* Entry not found */
#define VIRTIO_IOMMU_S_NOENT			6
/* Bad address */
#define VIRTIO_IOMMU_S_FAULT			7
\end{lstlisting}

Range limits of some request fields are described in the device
configuration:

\begin{itemize}
\item \field{page_size_mask} contains the bitmask of all page sizes that
  can be mapped. The least significant bit set defines the page
  granularity of IOMMU mappings. Other bits in the mask are hints
  describing page sizes that the IOMMU can merge into a single mapping
  (page blocks).

  The smallest page granularity supported by the IOMMU is one byte. It is
  legal for the driver to map one byte at a time if bit 0 of
  \field{page_size_mask} is set.

\item If the VIRTIO_IOMMU_F_IOASID_BITS feature is offered,
  \field{ioasid_bits} contains the number of bits supported in an I/O
  Address Space ID, the identifier used in most requests. A value of 0 is
  valid, and means that a single address space is supported.

  If the feature is not negotiated, address space identifiers can use up
  to 32 bits.

\item If the VIRTIO_IOMMU_F_INPUT_RANGE feature is offered,
  \field{input_range} contains the virtual address range that the IOMMU is
  able to translate. Any mapping request to virtual addresses outside of
  this range will fail.

  If the feature is not negotiated, virtual mappings span over the whole
  64-bit address space (\texttt{start = 0, end = 0xffffffff ffffffff})
\end{itemize}

\drivernormative{\subsubsection}{Device operations}{Device Types / IOMMU Device / Device operations}

The driver SHOULD set reserved fields of the head and the tail of a
request to zero.

When a device returns a complete request in the used queue without having
written to it, the driver SHOULD interpret it as a failure from the device
to parse the request.

If the VIRTIO_IOMMU_F_INPUT_RANGE feature is offered, the driver SHOULD
NOT send requests with \field{virt_addr} less than
\field{input_range.start} or greater than \field{input_range.end}.

If the VIRTIO_IOMMU_F_IOASID_BITS feature is offered, the driver SHOULD
NOT send requests with \field{address_space} greater than the size
described by \field{ioasid_bits}.

% We mandate truncation to allow a future extension X.Y that would store
% information in addresses and address space IDs.
%
% If device is 0.2 and driver is X.Y, then device ignores ext. bits. But
% if device is X.Y and device is 0.2, then driver *might* set ext. bits to
% garbage. But this extension would be negotiated with a feature bit
% anyway. If it's not, then device must assume that driver is 0.2 and must
% keep truncating the fields.

\devicenormative{\subsubsection}{Device operations}{Device Types / IOMMU Device / Device operations}

The device SHOULD NOT set \field{status} to VIRTIO_IOMMU_S_OK if a request
didn't succeed. \footnote{%
For IMPLEMENTATION DEFINED values of 'succeed'... For example,
virtio_iommu_req_detach.reserved is allowed to be non-zero. If it is
non-zero, the device may consider it to be a failure and abort the
request. Or it may go on with the detach and return OK.}

If a request \field{type} is not recognized, the device SHOULD return the
buffers on the used ring and set the \field{len} field of the used element
to zero.

The device MUST ignore reserved fields of the head and the tail of a
request.

If the VIRTIO_IOMMU_F_INPUT_RANGE feature is offered, the device MUST
truncate the range described by \field{virt_addr} and \field{size} in
requests to fit in the range described by \field{input_range}.

If the VIRTIO_IOMMU_F_IOASID_BITS is offered, the device MUST ignore bits
above \field{ioasid_bits} in field \field{address_space} of requests.

\subsubsection{ATTACH request}\label{sec:Device Types / IOMMU Device / Device operations / ATTACH request}

\begin{lstlisting}
struct virtio_iommu_req_attach {
	le32	address_space;
	le32	device;
	le32	reserved;
};
\end{lstlisting}

Attach an endpoint to an address space. \field{address_space} is an
identifier unique to the virtio-iommu device. If the address space doesn't
exist in the device, it is created. \field{device} is an endpoint
identifier unique to the virtio-iommu device. The host communicates unique
device IDs to the guest using methods outside the scope of this
specification, but the following rules apply:

\begin{itemize}
\item The device ID is unique from the virtio-iommu point of view. Multiple
  endpoints whose DMA transactions are not translated by the same
  virtio-iommu may have the same device ID. Endpoints whose DMA
  transactions may be translated by the same virtio-iommu must have
  different device IDs.

\item Sometimes the host cannot completely isolate two endpoints from each
  others. For example on a legacy PCI bus, endpoints can snoop DMA
  transactions from their neighbours. In this case, the host must
  communicate to the guest that it cannot isolate these endpoints from
  each others, or that the physical IOMMU cannot distinguish transactions
  coming from these endpoints. The method used to communicate this is
  outside the scope of this specification.
\end{itemize}

Multiple endpoints may be added to the same address space. An endpoint
cannot be attached to multiple address spaces in VIRTIO_IOMMU_F_MAP_UNMAP
mode.

\drivernormative{\paragraph}{ATTACH request}{Device Types / IOMMU Device / Device operations / ATTACH request}

The driver SHOULD set \field{reserved} to zero.

The driver SHOULD ensure that endpoints that cannot be isolated by the
host are attached to the same address space.

\devicenormative{\paragraph}{ATTACH request}{Device Types / IOMMU Device / Device operations / ATTACH request}

If the \field{reserved} field of an ATTACH request is not zero, the device
SHOULD set the request \field{status} to VIRTIO_IOMMU_S_INVAL and SHOULD
NOT attach the endpoint to the address space. \footnote{The device should
validate input of ATTACH requests in case the driver attempts to attach in
a mode that is unimplemented by the device, and would be incompatible with
the modes implemented by the device.}

If the endpoint identified by \field{device} doesn't exist, then the
device SHOULD set the request \field{status} to VIRTIO_IOMMU_S_NOENT.

If another endpoint is already attached to the address space identified by
\field{address_space}, then the device MAY attach the endpoint identified
by \field{device} to the address space. If it cannot do so, the device
MUST set the request \field{status} to VIRTIO_IOMMU_S_UNSUPP.

If the endpoint identified by \field{device} is already attached to
another address space, then the device SHOULD first detach it from that
address space and attach it to the one identified by
\field{address_space}. In that case the device behaves as if the driver
issued a DETACH request with this \field{device}, followed by the ATTACH
request. If the device cannot do so, it MUST set the request
\field{status} to VIRTIO_IOMMU_S_UNSUPP.

\subsubsection{DETACH request}

\begin{lstlisting}
struct virtio_iommu_req_detach {
	le32	device;
	le32	reserved;
};
\end{lstlisting}

Detach an endpoint from its address space. When this request completes,
the endpoint cannot access any mapping from that address space anymore.

After all endpoints have been successfully detached from an address space,
it ceases to exist and its ID can be reused by the driver for another
address space.

\drivernormative{\paragraph}{DETACH request}{Device Types / IOMMU Device / Device operations / DETACH request}

The driver SHOULD set \field{reserved} to zero.

\devicenormative{\paragraph}{DETACH request}{Device Types / IOMMU Device / Device operations / DETACH request}

If the \field{reserved} field of an DETACH request is not zero, the device
MAY set the request \field{status} to VIRTIO_IOMMU_S_INVAL, in which case
the device MAY perform the DETACH operation.
% If it returns OK, the device SHOULD go on with the detach, as required
% by the VIRTIO_IOMMU_S_OK rule.

If the endpoint identified by \field{device} doesn't exist, then the
device SHOULD set the request \field{status} to VIRTIO_IOMMU_S_NOENT.

If the endpoint identified by \field{device} wasn't attached to any
address space, then the device MAY set the request \field{status} to
VIRTIO_IOMMU_S_INVAL.

The device MUST ensure that after being detached from an address space,
the endpoint cannot access any mapping from that address space.

\subsubsection{MAP request}\label{sec:Device Types / IOMMU Device / Device operations / MAP request}

\begin{lstlisting}
struct virtio_iommu_req_map {
	le32	address_space;
	le64	phys_addr;
	le64	virt_addr;
	le64	size;
	le32	flags;
};

/* Flags are: */
#define VIRTIO_IOMMU_MAP_F_READ		(1 << 0)
#define VIRTIO_IOMMU_MAP_F_WRITE	(1 << 1)
#define VIRTIO_IOMMU_MAP_F_EXEC		(1 << 2)
\end{lstlisting}

Map a range of virtually-contiguous addresses to a range of
physically-contiguous addresses of the same size. After the request
succeeds, all endpoints attached to this address space can access memory
in the range $[phys\_addr; phys\_addr + size[$. For example, if an endpoint
accesses address $VA \in [virt\_addr; virt\_addr + size[$, the device (or the
physical IOMMU) translates the address: $PA = VA - virt\_addr +
phys\_addr$. If the access parameters are compatible with \field{flags}
(for instance, the access is write and \field{flags} are
VIRTIO_IOMMU_MAP_F_READ | VIRTIO_IOMMU_MAP_F_WRITE) then the IOMMU allows
the access to reach $PA$.

The range defined by (\field{virt_addr}, \field{size}) must be within the
limits specified by \field{input_range}. The range defined by
(\field{phys_addr}, \field{size}) must be within the guest-physical
address space. This includes upper and lower limits, as well as any
carving of guest-physical addresses for use by the host (for instance MSI
doorbells). Guest physical boundaries are set by the host using a firmware
mechanism outside the scope of this specification.

\begin{note}
This format prevents from creating the identity mapping in a single
request \texttt{[0x0; 0xfff....fff] $\rightarrow$ [0x0; 0xfff...fff]},
since it would result in a size of zero. Hopefully allowing
VIRTIO_IOMMU_F_BYPASS eliminates the need for issuing such request. It
would also be unlikely to conform to the physical range restrictions
from the previous paragraph.
\end{note}

\begin{note}
On flags: it is unlikely that all possible combinations of flags will be
supported by the physical IOMMU. For instance, $W \& !R$ or $X \& W$ might
be invalid. We do not have a way to advertise supported and implicit (for
instance $W \rightarrow R$) flags or combination thereof for the moment,
you are free to send any suggestions for describing this. Please keep in
mind that we might soon want to add more flags, such as privileged,
device, transient, shared, etc. (whatever these would mean).
\end{note}

This request is only available when VIRTIO_IOMMU_F_MAP_UNMAP has been
negotiated.

\drivernormative{\paragraph}{MAP request}{Device Types / IOMMU Device / Device operations / MAP request}

The driver SHOULD set undefined \field{flags} bits to zero.

\devicenormative{\paragraph}{MAP request}{Device Types / IOMMU Device / Device operations / MAP request}

If \field{virt_addr}, \field{phys_addr} or \field{size} is not aligned on
the page granularity, the device SHOULD set the request \field{status} to
VIRTIO_IOMMU_S_RANGE and SHOULD NOT create the mapping.

If the device doesn't recognize a \field{flags} bit, it SHOULD set the
request \field{status} to VIRTIO_IOMMU_S_INVAL. In this case the device
SHOULD NOT create the mapping. \footnote{Validating the input is important
here, because the driver might be attempting to map with special flags
that the device doesn't recognize. Creating the mapping with incompatible
flags may introduce a security hazard.}

If \field{address_space} does not exist, the device SHOULD set the request
\field{status} to VIRTIO_IOMMU_S_NOENT.

\subsubsection{UNMAP request}\label{sec:Device Types / IOMMU Device / Device operations / UNMAP request}

\begin{lstlisting}
struct virtio_iommu_req_unmap {
	le32	address_space;
	le64	virt_addr;
	le64	size;
	le32	reserved;
};
\end{lstlisting}

Unmap a range of addresses mapped with VIRTIO_IOMMU_T_MAP. We define here
a mapping as a virtual region created with a single MAP request. All
mappings covered by the range $[virt\_addr; virt\_addr + size [$ are
removed.

The semantics of unmapping are specified below, and illustrated with the
following requests, assuming each example sequence starts with a blank
address space. We define two pseudocode functions \texttt{map(virt\_addr,
size) -> mapping} and \texttt{unmap(virt\_addr, size)}.

\begin{lstlisting}
(1) unmap(addr=0, size=5)        -> succeeds, doesn't unmap anything

(2) a = map(addr=0, size=10);
    unmap(0, 10)                 -> succeeds, unmaps a

(3) a = map(0, 5);
    b = map(5, 5);
    unmap(0, 10)                 -> succeeds, unmaps a and b

(4) a = map(0, 10);
    unmap(0, 5)                  -> faults, doesn't unmap anything

(5) a = map(0, 5);
    b = map(5, 5);
    unmap(0, 5)                  -> succeeds, unmaps a

(6) a = map(0, 5);
    unmap(0, 10)                 -> succeeds, unmaps a

(7) a = map(0, 5);
    b = map(10, 5);
    unmap(0, 15)                 -> succeeds, unmaps a and b
\end{lstlisting}

This request is only available when VIRTIO_IOMMU_F_MAP_UNMAP has been
negotiated.

\drivernormative{\paragraph}{UNMAP request}{Device Types / IOMMU Device / Device operations / UNMAP request}

The driver SHOULD set the \field{reserved} field to zero.

The range, defined by \field{virt_addr} and \field{size}, SHOULD cover one
or more contiguous mappings created with MAP requests. The range MAY spill
over unmapped virtual addresses.

The first address of a range SHOULD either be the first address of a
mapping or be outside any mapping. The last address of a range SHOULD
either be the last address of a mapping or be outside any mapping.

\devicenormative{\paragraph}{UNMAP request}{Device Types / IOMMU Device / Device operations / UNMAP request}

If the \field{reserved} field of an UNMAP request is not zero, the device
MAY set the request \field{status} to VIRTIO_IOMMU_S_INVAL, in which case
the device MAY perform the UNMAP operation.
% If it returns OK, the device SHOULD go on with the unmap, as required by
% the VIRTIO_IOMMU_S_OK rule.

If \field{address_space} does not exist, the device SHOULD set the request
\field{status} to VIRTIO_IOMMU_S_NOENT.

If a mapping affected by the range is not covered in its entirety by the
range (the UNMAP request would split the mapping), then the device SHOULD
set the request \field{status} to VIRTIO_IOMMU_S_RANGE, and SHOULD NOT
remove any mapping.

If part of the range or the full range is not covered by an existing
mapping, then the device SHOULD remove all mappings affected by the range
and set the request \field{status} to VIRTIO_IOMMU_S_OK.

\subsubsection{PROBE request}\label{sec:Device Types / IOMMU Device / Device operations / PROBE request}

If the VIRTIO_IOMMU_F_PROBE feature bit is present, the driver sends a
VIRTIO_IOMMU_T_PROBE request for each endpoint that the virtio-iommu
device manages. This probe is performed before attaching the endpoint to
an address space.

\begin{lstlisting}
struct virtio_iommu_req_probe {
	/* Device-readable */
	le32	device;
	le32	flags;
	u8	reserved[60];

	/* Device-writable when not ACK */
	u8	properties[];
};

/* Flags are: */
#define VIRTIO_IOMMU_PROBE_F_ACK	(1 << 0)
\end{lstlisting}

\begin{description}
\item[\field{device}] has the same meaning as in ATTACH and DETACH
  requests.

\item[\field{flags}] contain additional information about the request.
  The VIRTIO_IOMMU_PROBE_F_ACK flag changes the descriptor chain layout:
  when ACK is clear, the \field{properties} field is device-writable;
  when it is set, the \field{properties} field is device-readable.

\item[\field{reserved}] is used as padding, so that future extensions can
  add fields to the device-readable part.

\item[\field{properties}] contains a list of properties of endpoint
  \field{device}, filled by the device. This field is exactly
  \field{probe_size} bytes. Each property is described with a type, four
    flag bits, a length, and a value:
\begin{lstlisting}
#define VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK	0xfff
#define VIRTIO_IOMMU_PROBE_PROPERTY_F_ACK	(1 << 12)

struct virtio_iommu_probe_property {
	le16	type;
	le16	length;
	u8	value[];
};
\end{lstlisting}

\end{description}

The driver allocates a buffer of adequate size for the probe request,
writes \field{device} and adds it to the request queue. The device fills
the \field{properties} field with a list of properties for this endpoint.

The driver parses the first property by reading \field{type}, then
\field{length}. If the driver recognizes \field{type}, it reads and
handles \field{value}. The driver then reads the next property, that is
located $(\field{length} + 4)$ bytes after the beginning of the first one,
and so on. The driver parses all properties until it reaches a NONE
property or the end of \field{properties}.

The upper nibble of property \field{type} is reserved for flags.
Therefore only 4096 types are available. The actual type of a property is
extracted like this:

\begin{lstlisting}
u16 type = le16_to_cpu(property.type) & VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK;
\end{lstlisting}

If a property is correctly understood by the driver, then it sets the ACK
bit in \field{type}:

\begin{lstlisting}
property.type |= cpu_to_le16(VIRTIO_IOMMU_PROBE_PROPERTY_F_ACK);
\end{lstlisting}

Then, to signal to the device which properties are understood, the device
sends the probe again with the VIRTIO_IOMMU_PROBE_F_ACK flag. In all
properties understood and accepted by the driver, \field{type} has the
VIRTIO_IOMMU_PROBE_PROPERTY_F_ACK bit set. The other properties are left
as is.

This second phase of the probe request allows the device to ensure that
all properties crucial for good operations are recognized and handled by
the driver. This is analogous to the initial feature negotiation of virtio
devices: an endpoint property is \emph{offered} by the device to the
driver during the first PROBE, and it is \emph{negotiated} after the
driver acknowledges it during the second PROBE.

Available property types are described in section
\ref{sec:Device Types / IOMMU Device / Device operations / PROBE properties}.
When attaching multiple devices to the same address space, their
properties are combined. \emph{Combination Rules} are given for each
property, and describe the rules to apply when combining properties
obtained during probe.

\drivernormative{\paragraph}{PROBE request}{Device Types / IOMMU Device / Device operations / PROBE request}

The size of \field{properties} MUST be \field{probe_size} bytes.

The driver SHOULD set undefined \field{flags} to zero.

The driver SHOULD set \field{reserved} to zero.

If the driver doesn't recognize the \field{type} of a property, it SHOULD
ignore the property and continue parsing the list.

The driver SHOULD NOT deduce the property length from \field{type}.

If the driver recognizes a property \field{type} and is able to
handle{\footnotemark} the property, then the driver SHOULD set the
VIRTIO_IOMMU_PROBE_PROPERTY_F_ACK bit of that property.

\footnotetext{A driver's ability to handle a property depends on the
property type. Without a specific definition of the ACK requirements for a
given property type, it simply means that the driver read all fields of
that property.}

The driver SHOULD resend the PROBE request with the
VIRTIO_IOMMU_PROBE_F_ACK bit set after parsing and updating the
\field{properties} list. Depending on the properties encountered in the
list, the driver MAY modify some of their fields between the first and
second probe, but it SHOULD NOT modify the \field{length} field or bits
[11:0] of field \field{type}.

\devicenormative{\paragraph}{PROBE request}{Device Types / IOMMU Device / Device operations / PROBE request}

If an undefined bit is set in \field{flags}, the device MAY set the
request \field{status} to VIRTIO_IOMMU_S_INVAL.

If the \field{reserved} field of a PROBE request is not zero, the device
MAY set the request \field{status} to VIRTIO_IOMMU_S_INVAL.

If the endpoint identified by \field{device} doesn't exist, then the
device SHOULD set the request \field{status} to VIRTIO_IOMMU_S_NOENT.

If the device does not offer the VIRTIO_IOMMU_F_PROBE feature, and if the
driver sends a VIRTIO_IOMMU_T_PROBE request, then the device SHOULD return
the buffers on the used ring and set the \field{len} field of the used
element to zero.

The device SHOULD set bits [15:13] of property \field{type} to zero.

The device MUST write the size of \field{value}, in bytes, into
\field{length}.

When two properties follow each others, the device MUST put the second
property exactly $(\field{length} + 4)$ bytes after the beginning of the
first one.

If the device doesn't fill all \field{probe_size} bytes with properties,
it SHOULD terminate the list with a property of type NONE and size 0. The
device MAY fill the remaining bytes of \field{properties}, if any, with
zeroes. If there isn't enough space remaining in \field{properties} to
terminate the list with a complete NONE property (4 bytes), then the
device SHOULD fill the remaining bytes with zeroes.

If the PROBE request has VIRTIO_IOMMU_PROBE_F_ACK bit set, the device MAY
ignore the request and set the request \field{status} to
VIRTIO_IOMMU_S_OK.

\subsubsection{PROBE properties}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties}

\begin{lstlisting}
#define VIRTIO_IOMMU_PROBE_T_NONE		0
#define VIRTIO_IOMMU_PROBE_T_RESV_MEM		2
\end{lstlisting}

\paragraph{Property NONE}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties / NONE}

Marks the end of the property list. This property doesn't have any value,
and should have \field{length} 0.

\paragraph{Property RESV_MEM}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties / RESV_MEM}

The RESV_MEM property describes a chunk of reserved virtual memory. It may
be used by the device to describe virtual address ranges that shouldn't be
allocated by the driver, or that are special.

\begin{lstlisting}
struct virtio_iommu_probe_resv_mem {
	u8	subtype;
	u8	reserved[3];
	le64	addr;
	le64	size;
	le32	flags;
};
\end{lstlisting}

Fields \field{addr} and \field{size} describe the range of reserved
addresses. \field{subtype} may be one of:

\begin{description}
  \item[VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT (0)]
    Accesses to this region are aborted. This subtype does not accept any
    flag.
  \item[VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS (1)]
    Accesses to this region behave as if the IOMMU was bypassed, and reach
    the bus upstream of the IOMMU untranslated.

    The following \field{flags} are defined for BYPASS regions:
    \begin{description}
      \item[VIRTIO_IOMMU_PROBE_RESV_MEM_F_MSI (1)]
        Provides a hint to the guest that this is a doorbell for Message
        Signaled Interrupts.

        If the device doesn't provide such a region, then MSIs are normal
        write accesses from the IOMMU point of view, and arbitrary virtual
        addresses should be allocated by the driver to map MSI doorbells.
        Otherwise, the guest should use the guest-physical doorbell
        address when programming MSIs for this endpoint.
    \end{description}

  %\item[VIRTIO_IOMMU_PROBE_RESV_MEM_T_IDENTITY (2)]
  %  This region should be identity-mapped by the guest. TODO: is this
  %  useful for anyone?
\end{description}

\propcombination{\subparagraph}{Property RESV_MEM}{Device Types / IOMMU Device / Device operations / PROBE properties / RESV_MEM}

Multiple overlapping RESV_MEM properties are merged together. Difference
in subtype on the intersecting range doesn't make a difference from the
driver point of view.

\drivernormative{\subparagraph}{Property RESV_MEM}{Device Types / IOMMU Device / Device operations / PROBE properties / RESV_MEM}

The driver SHOULD NOT map any virtual address described by a
VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT or
VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS property.

% An old driver that doesn't find or understand this property will
% allocate and map virtual addresses. We really can't do anything about
% that. We're not introducing a regression, MSIs never worked for x86
% before we introduced the F_MSI flag.

The driver SHOULD ignore \field{reserved}.

For a given \field{subtype}, the driver SHOULD ignore undefined
\field{flags} bits.

The driver SHOULD treat any \field{subtype} it doesn't recognize as if it
was VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT.

\devicenormative{\subparagraph}{Property RESV_MEM}{Device Types / IOMMU Device / Device operations / PROBE properties / RESV_MEM}

The device SHOULD set \field{reserved} to zero.

For a given \field{subtype}, the device SHOULD set undefined \field{flags}
bits to zero.

The device MAY abort any transaction targeting a
VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT region.

If an endpoint is attached to an address space, the device SHOULD leave
any access targeting one of its VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS
regions pass through untranslated. In other words, the device SHOULD
handle such a region as if it was identity-mapped (virtual address equal
to physical address). If the endpoint is not attached to any address
space, then the device MAY abort the transaction.

The device MAY abort any transaction that isn't a write access and that
targets a VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS region with flag
VIRTIO_IOMMU_PROBE_RESV_MEM_F_MSI.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [RFC] virtio-iommu v0.4 - IOMMU Device
  2017-08-04 18:19 ` [virtio-dev] " Jean-Philippe Brucker
  (?)
@ 2017-08-04 18:19 ` Jean-Philippe Brucker
  -1 siblings, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-08-04 18:19 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: lorenzo.pieralisi, mst, marc.zyngier, will.deacon, eric.auger,
	robin.murphy, eric.auger.pro

The following is roughly the content of device-operations.tex

---
\section{IOMMU device}\label{sec:Device Types / IOMMU Device}

The virtio-iommu device manages Direct Memory Access (DMA) from one or
more endpoints. It may act as a proxy for multiple physical IOMMUs
managing devices assigned to the guest, and as standalone IOMMU for
virtual devices.

The driver first discovers endpoints managed by the virtio-iommu device
using standard firmware mechanisms. It then sends requests to create
virtual address spaces and virtual-to-physical mappings for these
endpoints. In its simplest form, the virtio-iommu supports four request
types:

\begin{enumerate}
\item Create an address space and attach an endpoint to it.  \\
  \texttt{attach(device = 0x104, address space = 1)}
\item Create a mapping between a range of guest-virtual and guest-physical
  address. \\
  \texttt{map(address space = 1, virt = 0x1000, phys = 0xa000,
          size = 0x1000, flags = READ)}

  Endpoint 0x104, for example a hardware PCI endpoint, can now read at
  addresses 0x1000-0x1fff. These accesses are translated into
  system-physical addresses by the IOMMU.

\item Remove the mapping.\\
  \texttt{unmap(address space = 1, virt = 0x1000, size = 0x1000)}

  Any access to addresses 0x1000-0x1fff by endpoint 0x104 would now be
  rejected.
\item Detach the device and remove the address space.\\
  \texttt{detach(device = 0x104)}
\end{enumerate}

\subsection{Device ID}\label{sec:Device Types / IOMMU Device / Device ID}

TBD. During development, use 61216.

\subsection{Virtqueues}\label{sec:Device Types / IOMMU Device / Virtqueues}

\begin{description}
\item[0] requestq
\end{description}

\subsection{Feature bits}\label{sec:Device Types / IOMMU Device / Feature bits}

\begin{description}
\item[VIRTIO_IOMMU_F_INPUT_RANGE (0)]
  Available range of virtual addresses is described in \field{input_range}

\item[VIRTIO_IOMMU_F_IOASID_BITS (1)]
  The number of address spaces supported is described in \field{ioasid_bits}

\item[VIRTIO_IOMMU_F_MAP_UNMAP (2)]
  Map and unmap requests are available.\footnote{Future extensions may add
  different modes of operations. At the moment, only
  VIRTIO_IOMMU_F_MAP_UNMAP is supported.}

\item[VIRTIO_IOMMU_F_BYPASS (3)]
  When not attached to an address space, endpoints downstream of the IOMMU
  can access the guest-physical address space.

\item[VIRTIO_IOMMU_F_PROBE (4)]
  Probe request is available.
\end{description}

\drivernormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device / Feature bits}

The driver SHOULD accept any of the VIRTIO_IOMMU_F_INPUT_RANGE,
VIRTIO_IOMMU_F_IOASID_BITS, VIRTIO_IOMMU_F_MAP_UNMAP and
VIRTIO_IOMMU_F_PROBE feature bits if offered by the device.

% XXX F_MAP_UNMAP will be optional when introducing PTH. But a 0.2 driver
% must implement it (otherwise the device is useless)

\devicenormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device / Feature bits}

If the device offers any of VIRTIO_IOMMU_F_INPUT_RANGE,
VIRTIO_IOMMU_F_IOASID_BITS or VIRTIO_IOMMU_F_PROBE feature bits, and if
the driver did not accept this feature bit, then the device MAY signal
failure by failing to set FEATURES_OK \field{device status} bit when the
driver writes it.

If the device offers the VIRTIO_IOMMU_F_MAP_UNMAP feature bit, and if the
driver did not accept this feature bit, then the device SHOULD behave as
if the feature was negotiated.
% This takes into account all the following "If the F_MAP_UNMAP feature
% was negotiated..."

% If the driver supports F_PTH but not F_MAP_UNMAP, then the driver MUST
% give up upon seeing that the 0.2 device doesn't support F_PTH. If it
% supports F_MAP_UNMAP, then it SHOULD use F_MAP_UNMAP (this is described
% in "Driver Requirements: Feature Bits")

% If the driver supports F_PTH but the device doesn't, and the driver
% stupidly sets F_PTH but not F_MAP_UNMAP, then the device SHOULD reject
% any PTH request and PTH flag in attach.

% When not using the legacy interface, if the driver doesn't negotiate
% F_MAP_UNMAP, then the device may disable it an reject any request.

\subsubsection{Legacy Interface: Feature bits}\label{sec:Device Types / IOMMU Device / Feature bits / Legacy Interface: Feature bits}

When using the legacy interface, transitional devices MUST support guests
which do not negotiate any of the VIRTIO_IOMMU_F_INPUT_RANGE,
VIRTIO_IOMMU_F_IOASID_BITS, VIRTIO_IOMMU_F_MAP_UNMAP or
VIRTIO_IOMMU_F_PROBE features, and MUST behave as if the feature was
negotiated.

%\subsection{Feature Bits Requirements}\label{sec:Device Types / IOMMU Device / Feature bits requirements}

\subsection{Device configuration layout}\label{sec:Device Types / IOMMU Device / Device configuration layout}

The \field{page_size_mask} field is always present. Availability of the
others depend on various feature bits as indicated above.

\begin{lstlisting}
struct virtio_iommu_config {
	u64 page_size_mask;
	struct virtio_iommu_range {
		u64 start;
		u64 end;
	} input_range;
	u8 ioasid_bits;
	u8 padding[3];
	u32 probe_size;
};
\end{lstlisting}

\drivernormative{\subsubsection}{Device configuration layout}{Device Types / IOMMU Device / Device configuration layout}

The driver MUST NOT write to device configuration fields.

\devicenormative{\subsubsection}{Device configuration layout}{Device Types / IOMMU Device / Device configuration layout}

The device SHOULD set \field{padding} to zero.

The device MUST set at least one bit in \field{page_size_mask}, describing
the page granularity. The device MAY set more than one bit in
\field{page_size_mask}.

\subsubsection{Legacy Interface: Device configuration layout}\label{sec:Device Types / IOMMU Device / Device configuration layout / Legacy Interface: Device configuration layout}
When using the legacy interface, transitional devices and drivers
MUST format the fields in struct virtio_iommu_config
according to the native endian of the guest rather than
(necessarily when not using the legacy interface) little-endian.


\subsection{Device initialization}\label{sec:Device Types / IOMMU Device / Device initialization}

When the device is reset, endpoints are not attached to any address space.
If the VIRTIO_IOMMU_F_BYPASS feature is negotiated, all endpoints can
access guest-physical addresses ("bypass mode"). If the feature is not
negotiated, then any memory access from endpoints will fault. Upon
attaching an endpoint in bypass mode to a new address space, any memory
access from the endpoint will fault, since the address space does not
contain any mapping.

The driver chooses operating mode depending on its capabilities. In this
revision of the virtio-iommu specification, the only supported mode is
VIRTIO_IOMMU_F_MAP_UNMAP.

\drivernormative{\subsubsection}{Device Initialization}{Device Types / IOMMU Device / Device Initialization}

The driver MUST NOT negotiate VIRTIO_IOMMU_F_MAP_UNMAP if it is incapable
of sending VIRTIO_IOMMU_T_MAP and VIRTIO_IOMMU_T_UNMAP requests.

If the VIRTIO_IOMMU_F_PROBE feature is offered, the driver SHOULD send a
VIRTIO_IOMMU_T_PROBE request for each endpoint before attaching the
endpoint to an address space.

\devicenormative{\subsubsection}{Device Initialization}{Device Types / IOMMU Device / Device Initialization}

If the driver does not accept the VIRTIO_IOMMU_F_BYPASS feature, the
device SHOULD NOT let endpoints access the guest-physical address space.
However, the device MAY let endpoints access memory regions negotiated
with VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS (see
\ref{devicenormative:Device Types / IOMMU Device / Device operations / PROBE properties / RESV_MEM}).

\subsection{Device operations}\label{sec:Device Types / IOMMU Device / Device operations}

Driver send requests on the request virtqueue, notifies the device and
waits for the device to return the request with a status in the used ring.
All requests are split in two parts: one device-readable, one device-
writable. Each request is therefore described with at least two
descriptors, as illustrated below.

\begin{figure}[htb]
  \centering
  \includegraphics[width=0.7\textwidth]{img/request-wrapping.png}
  \caption{Anatomy of a virtio-iommu request}
\end{figure}

\begin{lstlisting}
struct virtio_iommu_req_head {
	u8	type;
	u8	reserved[3];
};

struct virtio_iommu_req_tail {
	u8	status;
	u8	reserved[3];
};
\end{lstlisting}

\rfc{% TODO
There is a problem with framing using multiple chains... If a request
isn't recognized by the device and is scattered across multiple descriptor
chains, how would the device know where this request ends and where the
next one begins? It assumes that a transition from WO descriptors to RO
descriptors is a new request, ok. But if this unknown request is from a
future extension that uses interleaved RO, WO, RO, WO descriptors for a
single request, then the device is doomed.\\
The extension will have to force that third buffer to start with 0 so our
device, that thinks it's a new request, doesn't recognize the request type
and waits for the next one.\\
Alternatively, we could add a 16-bit 'size' field into
virtio_iommu_req_head. We'd loose some space for future flags, and force
requests to be at most 64k, or 256k if we drop size bits [1:0]. (head
couldn't be extended in the future, with a field to describe a bigger
size, because our legacy device here would ignore it.) We'd have to resort
to multiple "container" requests transporting a single big one.\\
Personally I think we can let future extensions deal with interleaved
RO/WO/RO descriptors, but do you think a 'size' field would be better?
}

Type may be one of:

\begin{lstlisting}
#define VIRTIO_IOMMU_T_ATTACH			1
#define VIRTIO_IOMMU_T_DETACH			2
#define VIRTIO_IOMMU_T_MAP			3
#define VIRTIO_IOMMU_T_UNMAP			4
#define VIRTIO_IOMMU_T_PROBE			5
\end{lstlisting}

A few general-purpose status codes are defined here. Unless explicitly
described in a \textbf{Requirements} section, these values are hints to
make troubleshooting easier.

When the device fails to parse a request, for instance if a request seems
too small for its type and the device cannot find the tail, then it will
be unable to set \field{status}. In that case, it should return the
buffers without writing in them.

\begin{lstlisting}
/* All good! Carry on. */
#define VIRTIO_IOMMU_S_OK			0
/* Virtio communication error */
#define VIRTIO_IOMMU_S_IOERR			1
/* Unsupported request */
#define VIRTIO_IOMMU_S_UNSUPP			2
/* Internal device error */
#define VIRTIO_IOMMU_S_DEVERR			3
/* Invalid parameters */
#define VIRTIO_IOMMU_S_INVAL			4
/* Out-of-range parameters */
#define VIRTIO_IOMMU_S_RANGE			5
/* Entry not found */
#define VIRTIO_IOMMU_S_NOENT			6
/* Bad address */
#define VIRTIO_IOMMU_S_FAULT			7
\end{lstlisting}

Range limits of some request fields are described in the device
configuration:

\begin{itemize}
\item \field{page_size_mask} contains the bitmask of all page sizes that
  can be mapped. The least significant bit set defines the page
  granularity of IOMMU mappings. Other bits in the mask are hints
  describing page sizes that the IOMMU can merge into a single mapping
  (page blocks).

  The smallest page granularity supported by the IOMMU is one byte. It is
  legal for the driver to map one byte at a time if bit 0 of
  \field{page_size_mask} is set.

\item If the VIRTIO_IOMMU_F_IOASID_BITS feature is offered,
  \field{ioasid_bits} contains the number of bits supported in an I/O
  Address Space ID, the identifier used in most requests. A value of 0 is
  valid, and means that a single address space is supported.

  If the feature is not negotiated, address space identifiers can use up
  to 32 bits.

\item If the VIRTIO_IOMMU_F_INPUT_RANGE feature is offered,
  \field{input_range} contains the virtual address range that the IOMMU is
  able to translate. Any mapping request to virtual addresses outside of
  this range will fail.

  If the feature is not negotiated, virtual mappings span over the whole
  64-bit address space (\texttt{start = 0, end = 0xffffffff ffffffff})
\end{itemize}

\drivernormative{\subsubsection}{Device operations}{Device Types / IOMMU Device / Device operations}

The driver SHOULD set reserved fields of the head and the tail of a
request to zero.

When a device returns a complete request in the used queue without having
written to it, the driver SHOULD interpret it as a failure from the device
to parse the request.

If the VIRTIO_IOMMU_F_INPUT_RANGE feature is offered, the driver SHOULD
NOT send requests with \field{virt_addr} less than
\field{input_range.start} or greater than \field{input_range.end}.

If the VIRTIO_IOMMU_F_IOASID_BITS feature is offered, the driver SHOULD
NOT send requests with \field{address_space} greater than the size
described by \field{ioasid_bits}.

% We mandate truncation to allow a future extension X.Y that would store
% information in addresses and address space IDs.
%
% If device is 0.2 and driver is X.Y, then device ignores ext. bits. But
% if device is X.Y and device is 0.2, then driver *might* set ext. bits to
% garbage. But this extension would be negotiated with a feature bit
% anyway. If it's not, then device must assume that driver is 0.2 and must
% keep truncating the fields.

\devicenormative{\subsubsection}{Device operations}{Device Types / IOMMU Device / Device operations}

The device SHOULD NOT set \field{status} to VIRTIO_IOMMU_S_OK if a request
didn't succeed. \footnote{%
For IMPLEMENTATION DEFINED values of 'succeed'... For example,
virtio_iommu_req_detach.reserved is allowed to be non-zero. If it is
non-zero, the device may consider it to be a failure and abort the
request. Or it may go on with the detach and return OK.}

If a request \field{type} is not recognized, the device SHOULD return the
buffers on the used ring and set the \field{len} field of the used element
to zero.

The device MUST ignore reserved fields of the head and the tail of a
request.

If the VIRTIO_IOMMU_F_INPUT_RANGE feature is offered, the device MUST
truncate the range described by \field{virt_addr} and \field{size} in
requests to fit in the range described by \field{input_range}.

If the VIRTIO_IOMMU_F_IOASID_BITS is offered, the device MUST ignore bits
above \field{ioasid_bits} in field \field{address_space} of requests.

\subsubsection{ATTACH request}\label{sec:Device Types / IOMMU Device / Device operations / ATTACH request}

\begin{lstlisting}
struct virtio_iommu_req_attach {
	le32	address_space;
	le32	device;
	le32	reserved;
};
\end{lstlisting}

Attach an endpoint to an address space. \field{address_space} is an
identifier unique to the virtio-iommu device. If the address space doesn't
exist in the device, it is created. \field{device} is an endpoint
identifier unique to the virtio-iommu device. The host communicates unique
device IDs to the guest using methods outside the scope of this
specification, but the following rules apply:

\begin{itemize}
\item The device ID is unique from the virtio-iommu point of view. Multiple
  endpoints whose DMA transactions are not translated by the same
  virtio-iommu may have the same device ID. Endpoints whose DMA
  transactions may be translated by the same virtio-iommu must have
  different device IDs.

\item Sometimes the host cannot completely isolate two endpoints from each
  others. For example on a legacy PCI bus, endpoints can snoop DMA
  transactions from their neighbours. In this case, the host must
  communicate to the guest that it cannot isolate these endpoints from
  each others, or that the physical IOMMU cannot distinguish transactions
  coming from these endpoints. The method used to communicate this is
  outside the scope of this specification.
\end{itemize}

Multiple endpoints may be added to the same address space. An endpoint
cannot be attached to multiple address spaces in VIRTIO_IOMMU_F_MAP_UNMAP
mode.

\drivernormative{\paragraph}{ATTACH request}{Device Types / IOMMU Device / Device operations / ATTACH request}

The driver SHOULD set \field{reserved} to zero.

The driver SHOULD ensure that endpoints that cannot be isolated by the
host are attached to the same address space.

\devicenormative{\paragraph}{ATTACH request}{Device Types / IOMMU Device / Device operations / ATTACH request}

If the \field{reserved} field of an ATTACH request is not zero, the device
SHOULD set the request \field{status} to VIRTIO_IOMMU_S_INVAL and SHOULD
NOT attach the endpoint to the address space. \footnote{The device should
validate input of ATTACH requests in case the driver attempts to attach in
a mode that is unimplemented by the device, and would be incompatible with
the modes implemented by the device.}

If the endpoint identified by \field{device} doesn't exist, then the
device SHOULD set the request \field{status} to VIRTIO_IOMMU_S_NOENT.

If another endpoint is already attached to the address space identified by
\field{address_space}, then the device MAY attach the endpoint identified
by \field{device} to the address space. If it cannot do so, the device
MUST set the request \field{status} to VIRTIO_IOMMU_S_UNSUPP.

If the endpoint identified by \field{device} is already attached to
another address space, then the device SHOULD first detach it from that
address space and attach it to the one identified by
\field{address_space}. In that case the device behaves as if the driver
issued a DETACH request with this \field{device}, followed by the ATTACH
request. If the device cannot do so, it MUST set the request
\field{status} to VIRTIO_IOMMU_S_UNSUPP.

\subsubsection{DETACH request}

\begin{lstlisting}
struct virtio_iommu_req_detach {
	le32	device;
	le32	reserved;
};
\end{lstlisting}

Detach an endpoint from its address space. When this request completes,
the endpoint cannot access any mapping from that address space anymore.

After all endpoints have been successfully detached from an address space,
it ceases to exist and its ID can be reused by the driver for another
address space.

\drivernormative{\paragraph}{DETACH request}{Device Types / IOMMU Device / Device operations / DETACH request}

The driver SHOULD set \field{reserved} to zero.

\devicenormative{\paragraph}{DETACH request}{Device Types / IOMMU Device / Device operations / DETACH request}

If the \field{reserved} field of an DETACH request is not zero, the device
MAY set the request \field{status} to VIRTIO_IOMMU_S_INVAL, in which case
the device MAY perform the DETACH operation.
% If it returns OK, the device SHOULD go on with the detach, as required
% by the VIRTIO_IOMMU_S_OK rule.

If the endpoint identified by \field{device} doesn't exist, then the
device SHOULD set the request \field{status} to VIRTIO_IOMMU_S_NOENT.

If the endpoint identified by \field{device} wasn't attached to any
address space, then the device MAY set the request \field{status} to
VIRTIO_IOMMU_S_INVAL.

The device MUST ensure that after being detached from an address space,
the endpoint cannot access any mapping from that address space.

\subsubsection{MAP request}\label{sec:Device Types / IOMMU Device / Device operations / MAP request}

\begin{lstlisting}
struct virtio_iommu_req_map {
	le32	address_space;
	le64	phys_addr;
	le64	virt_addr;
	le64	size;
	le32	flags;
};

/* Flags are: */
#define VIRTIO_IOMMU_MAP_F_READ		(1 << 0)
#define VIRTIO_IOMMU_MAP_F_WRITE	(1 << 1)
#define VIRTIO_IOMMU_MAP_F_EXEC		(1 << 2)
\end{lstlisting}

Map a range of virtually-contiguous addresses to a range of
physically-contiguous addresses of the same size. After the request
succeeds, all endpoints attached to this address space can access memory
in the range $[phys\_addr; phys\_addr + size[$. For example, if an endpoint
accesses address $VA \in [virt\_addr; virt\_addr + size[$, the device (or the
physical IOMMU) translates the address: $PA = VA - virt\_addr +
phys\_addr$. If the access parameters are compatible with \field{flags}
(for instance, the access is write and \field{flags} are
VIRTIO_IOMMU_MAP_F_READ | VIRTIO_IOMMU_MAP_F_WRITE) then the IOMMU allows
the access to reach $PA$.

The range defined by (\field{virt_addr}, \field{size}) must be within the
limits specified by \field{input_range}. The range defined by
(\field{phys_addr}, \field{size}) must be within the guest-physical
address space. This includes upper and lower limits, as well as any
carving of guest-physical addresses for use by the host (for instance MSI
doorbells). Guest physical boundaries are set by the host using a firmware
mechanism outside the scope of this specification.

\begin{note}
This format prevents from creating the identity mapping in a single
request \texttt{[0x0; 0xfff....fff] $\rightarrow$ [0x0; 0xfff...fff]},
since it would result in a size of zero. Hopefully allowing
VIRTIO_IOMMU_F_BYPASS eliminates the need for issuing such request. It
would also be unlikely to conform to the physical range restrictions
from the previous paragraph.
\end{note}

\begin{note}
On flags: it is unlikely that all possible combinations of flags will be
supported by the physical IOMMU. For instance, $W \& !R$ or $X \& W$ might
be invalid. We do not have a way to advertise supported and implicit (for
instance $W \rightarrow R$) flags or combination thereof for the moment,
you are free to send any suggestions for describing this. Please keep in
mind that we might soon want to add more flags, such as privileged,
device, transient, shared, etc. (whatever these would mean).
\end{note}

This request is only available when VIRTIO_IOMMU_F_MAP_UNMAP has been
negotiated.

\drivernormative{\paragraph}{MAP request}{Device Types / IOMMU Device / Device operations / MAP request}

The driver SHOULD set undefined \field{flags} bits to zero.

\devicenormative{\paragraph}{MAP request}{Device Types / IOMMU Device / Device operations / MAP request}

If \field{virt_addr}, \field{phys_addr} or \field{size} is not aligned on
the page granularity, the device SHOULD set the request \field{status} to
VIRTIO_IOMMU_S_RANGE and SHOULD NOT create the mapping.

If the device doesn't recognize a \field{flags} bit, it SHOULD set the
request \field{status} to VIRTIO_IOMMU_S_INVAL. In this case the device
SHOULD NOT create the mapping. \footnote{Validating the input is important
here, because the driver might be attempting to map with special flags
that the device doesn't recognize. Creating the mapping with incompatible
flags may introduce a security hazard.}

If \field{address_space} does not exist, the device SHOULD set the request
\field{status} to VIRTIO_IOMMU_S_NOENT.

\subsubsection{UNMAP request}\label{sec:Device Types / IOMMU Device / Device operations / UNMAP request}

\begin{lstlisting}
struct virtio_iommu_req_unmap {
	le32	address_space;
	le64	virt_addr;
	le64	size;
	le32	reserved;
};
\end{lstlisting}

Unmap a range of addresses mapped with VIRTIO_IOMMU_T_MAP. We define here
a mapping as a virtual region created with a single MAP request. All
mappings covered by the range $[virt\_addr; virt\_addr + size [$ are
removed.

The semantics of unmapping are specified below, and illustrated with the
following requests, assuming each example sequence starts with a blank
address space. We define two pseudocode functions \texttt{map(virt\_addr,
size) -> mapping} and \texttt{unmap(virt\_addr, size)}.

\begin{lstlisting}
(1) unmap(addr=0, size=5)        -> succeeds, doesn't unmap anything

(2) a = map(addr=0, size=10);
    unmap(0, 10)                 -> succeeds, unmaps a

(3) a = map(0, 5);
    b = map(5, 5);
    unmap(0, 10)                 -> succeeds, unmaps a and b

(4) a = map(0, 10);
    unmap(0, 5)                  -> faults, doesn't unmap anything

(5) a = map(0, 5);
    b = map(5, 5);
    unmap(0, 5)                  -> succeeds, unmaps a

(6) a = map(0, 5);
    unmap(0, 10)                 -> succeeds, unmaps a

(7) a = map(0, 5);
    b = map(10, 5);
    unmap(0, 15)                 -> succeeds, unmaps a and b
\end{lstlisting}

This request is only available when VIRTIO_IOMMU_F_MAP_UNMAP has been
negotiated.

\drivernormative{\paragraph}{UNMAP request}{Device Types / IOMMU Device / Device operations / UNMAP request}

The driver SHOULD set the \field{reserved} field to zero.

The range, defined by \field{virt_addr} and \field{size}, SHOULD cover one
or more contiguous mappings created with MAP requests. The range MAY spill
over unmapped virtual addresses.

The first address of a range SHOULD either be the first address of a
mapping or be outside any mapping. The last address of a range SHOULD
either be the last address of a mapping or be outside any mapping.

\devicenormative{\paragraph}{UNMAP request}{Device Types / IOMMU Device / Device operations / UNMAP request}

If the \field{reserved} field of an UNMAP request is not zero, the device
MAY set the request \field{status} to VIRTIO_IOMMU_S_INVAL, in which case
the device MAY perform the UNMAP operation.
% If it returns OK, the device SHOULD go on with the unmap, as required by
% the VIRTIO_IOMMU_S_OK rule.

If \field{address_space} does not exist, the device SHOULD set the request
\field{status} to VIRTIO_IOMMU_S_NOENT.

If a mapping affected by the range is not covered in its entirety by the
range (the UNMAP request would split the mapping), then the device SHOULD
set the request \field{status} to VIRTIO_IOMMU_S_RANGE, and SHOULD NOT
remove any mapping.

If part of the range or the full range is not covered by an existing
mapping, then the device SHOULD remove all mappings affected by the range
and set the request \field{status} to VIRTIO_IOMMU_S_OK.

\subsubsection{PROBE request}\label{sec:Device Types / IOMMU Device / Device operations / PROBE request}

If the VIRTIO_IOMMU_F_PROBE feature bit is present, the driver sends a
VIRTIO_IOMMU_T_PROBE request for each endpoint that the virtio-iommu
device manages. This probe is performed before attaching the endpoint to
an address space.

\begin{lstlisting}
struct virtio_iommu_req_probe {
	/* Device-readable */
	le32	device;
	le32	flags;
	u8	reserved[60];

	/* Device-writable when not ACK */
	u8	properties[];
};

/* Flags are: */
#define VIRTIO_IOMMU_PROBE_F_ACK	(1 << 0)
\end{lstlisting}

\begin{description}
\item[\field{device}] has the same meaning as in ATTACH and DETACH
  requests.

\item[\field{flags}] contain additional information about the request.
  The VIRTIO_IOMMU_PROBE_F_ACK flag changes the descriptor chain layout:
  when ACK is clear, the \field{properties} field is device-writable;
  when it is set, the \field{properties} field is device-readable.

\item[\field{reserved}] is used as padding, so that future extensions can
  add fields to the device-readable part.

\item[\field{properties}] contains a list of properties of endpoint
  \field{device}, filled by the device. This field is exactly
  \field{probe_size} bytes. Each property is described with a type, four
    flag bits, a length, and a value:
\begin{lstlisting}
#define VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK	0xfff
#define VIRTIO_IOMMU_PROBE_PROPERTY_F_ACK	(1 << 12)

struct virtio_iommu_probe_property {
	le16	type;
	le16	length;
	u8	value[];
};
\end{lstlisting}

\end{description}

The driver allocates a buffer of adequate size for the probe request,
writes \field{device} and adds it to the request queue. The device fills
the \field{properties} field with a list of properties for this endpoint.

The driver parses the first property by reading \field{type}, then
\field{length}. If the driver recognizes \field{type}, it reads and
handles \field{value}. The driver then reads the next property, that is
located $(\field{length} + 4)$ bytes after the beginning of the first one,
and so on. The driver parses all properties until it reaches a NONE
property or the end of \field{properties}.

The upper nibble of property \field{type} is reserved for flags.
Therefore only 4096 types are available. The actual type of a property is
extracted like this:

\begin{lstlisting}
u16 type = le16_to_cpu(property.type) & VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK;
\end{lstlisting}

If a property is correctly understood by the driver, then it sets the ACK
bit in \field{type}:

\begin{lstlisting}
property.type |= cpu_to_le16(VIRTIO_IOMMU_PROBE_PROPERTY_F_ACK);
\end{lstlisting}

Then, to signal to the device which properties are understood, the device
sends the probe again with the VIRTIO_IOMMU_PROBE_F_ACK flag. In all
properties understood and accepted by the driver, \field{type} has the
VIRTIO_IOMMU_PROBE_PROPERTY_F_ACK bit set. The other properties are left
as is.

This second phase of the probe request allows the device to ensure that
all properties crucial for good operations are recognized and handled by
the driver. This is analogous to the initial feature negotiation of virtio
devices: an endpoint property is \emph{offered} by the device to the
driver during the first PROBE, and it is \emph{negotiated} after the
driver acknowledges it during the second PROBE.

Available property types are described in section
\ref{sec:Device Types / IOMMU Device / Device operations / PROBE properties}.
When attaching multiple devices to the same address space, their
properties are combined. \emph{Combination Rules} are given for each
property, and describe the rules to apply when combining properties
obtained during probe.

\drivernormative{\paragraph}{PROBE request}{Device Types / IOMMU Device / Device operations / PROBE request}

The size of \field{properties} MUST be \field{probe_size} bytes.

The driver SHOULD set undefined \field{flags} to zero.

The driver SHOULD set \field{reserved} to zero.

If the driver doesn't recognize the \field{type} of a property, it SHOULD
ignore the property and continue parsing the list.

The driver SHOULD NOT deduce the property length from \field{type}.

If the driver recognizes a property \field{type} and is able to
handle{\footnotemark} the property, then the driver SHOULD set the
VIRTIO_IOMMU_PROBE_PROPERTY_F_ACK bit of that property.

\footnotetext{A driver's ability to handle a property depends on the
property type. Without a specific definition of the ACK requirements for a
given property type, it simply means that the driver read all fields of
that property.}

The driver SHOULD resend the PROBE request with the
VIRTIO_IOMMU_PROBE_F_ACK bit set after parsing and updating the
\field{properties} list. Depending on the properties encountered in the
list, the driver MAY modify some of their fields between the first and
second probe, but it SHOULD NOT modify the \field{length} field or bits
[11:0] of field \field{type}.

\devicenormative{\paragraph}{PROBE request}{Device Types / IOMMU Device / Device operations / PROBE request}

If an undefined bit is set in \field{flags}, the device MAY set the
request \field{status} to VIRTIO_IOMMU_S_INVAL.

If the \field{reserved} field of a PROBE request is not zero, the device
MAY set the request \field{status} to VIRTIO_IOMMU_S_INVAL.

If the endpoint identified by \field{device} doesn't exist, then the
device SHOULD set the request \field{status} to VIRTIO_IOMMU_S_NOENT.

If the device does not offer the VIRTIO_IOMMU_F_PROBE feature, and if the
driver sends a VIRTIO_IOMMU_T_PROBE request, then the device SHOULD return
the buffers on the used ring and set the \field{len} field of the used
element to zero.

The device SHOULD set bits [15:13] of property \field{type} to zero.

The device MUST write the size of \field{value}, in bytes, into
\field{length}.

When two properties follow each others, the device MUST put the second
property exactly $(\field{length} + 4)$ bytes after the beginning of the
first one.

If the device doesn't fill all \field{probe_size} bytes with properties,
it SHOULD terminate the list with a property of type NONE and size 0. The
device MAY fill the remaining bytes of \field{properties}, if any, with
zeroes. If there isn't enough space remaining in \field{properties} to
terminate the list with a complete NONE property (4 bytes), then the
device SHOULD fill the remaining bytes with zeroes.

If the PROBE request has VIRTIO_IOMMU_PROBE_F_ACK bit set, the device MAY
ignore the request and set the request \field{status} to
VIRTIO_IOMMU_S_OK.

\subsubsection{PROBE properties}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties}

\begin{lstlisting}
#define VIRTIO_IOMMU_PROBE_T_NONE		0
#define VIRTIO_IOMMU_PROBE_T_RESV_MEM		2
\end{lstlisting}

\paragraph{Property NONE}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties / NONE}

Marks the end of the property list. This property doesn't have any value,
and should have \field{length} 0.

\paragraph{Property RESV_MEM}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties / RESV_MEM}

The RESV_MEM property describes a chunk of reserved virtual memory. It may
be used by the device to describe virtual address ranges that shouldn't be
allocated by the driver, or that are special.

\begin{lstlisting}
struct virtio_iommu_probe_resv_mem {
	u8	subtype;
	u8	reserved[3];
	le64	addr;
	le64	size;
	le32	flags;
};
\end{lstlisting}

Fields \field{addr} and \field{size} describe the range of reserved
addresses. \field{subtype} may be one of:

\begin{description}
  \item[VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT (0)]
    Accesses to this region are aborted. This subtype does not accept any
    flag.
  \item[VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS (1)]
    Accesses to this region behave as if the IOMMU was bypassed, and reach
    the bus upstream of the IOMMU untranslated.

    The following \field{flags} are defined for BYPASS regions:
    \begin{description}
      \item[VIRTIO_IOMMU_PROBE_RESV_MEM_F_MSI (1)]
        Provides a hint to the guest that this is a doorbell for Message
        Signaled Interrupts.

        If the device doesn't provide such a region, then MSIs are normal
        write accesses from the IOMMU point of view, and arbitrary virtual
        addresses should be allocated by the driver to map MSI doorbells.
        Otherwise, the guest should use the guest-physical doorbell
        address when programming MSIs for this endpoint.
    \end{description}

  %\item[VIRTIO_IOMMU_PROBE_RESV_MEM_T_IDENTITY (2)]
  %  This region should be identity-mapped by the guest. TODO: is this
  %  useful for anyone?
\end{description}

\propcombination{\subparagraph}{Property RESV_MEM}{Device Types / IOMMU Device / Device operations / PROBE properties / RESV_MEM}

Multiple overlapping RESV_MEM properties are merged together. Difference
in subtype on the intersecting range doesn't make a difference from the
driver point of view.

\drivernormative{\subparagraph}{Property RESV_MEM}{Device Types / IOMMU Device / Device operations / PROBE properties / RESV_MEM}

The driver SHOULD NOT map any virtual address described by a
VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT or
VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS property.

% An old driver that doesn't find or understand this property will
% allocate and map virtual addresses. We really can't do anything about
% that. We're not introducing a regression, MSIs never worked for x86
% before we introduced the F_MSI flag.

The driver SHOULD ignore \field{reserved}.

For a given \field{subtype}, the driver SHOULD ignore undefined
\field{flags} bits.

The driver SHOULD treat any \field{subtype} it doesn't recognize as if it
was VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT.

\devicenormative{\subparagraph}{Property RESV_MEM}{Device Types / IOMMU Device / Device operations / PROBE properties / RESV_MEM}

The device SHOULD set \field{reserved} to zero.

For a given \field{subtype}, the device SHOULD set undefined \field{flags}
bits to zero.

The device MAY abort any transaction targeting a
VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT region.

If an endpoint is attached to an address space, the device SHOULD leave
any access targeting one of its VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS
regions pass through untranslated. In other words, the device SHOULD
handle such a region as if it was identity-mapped (virtual address equal
to physical address). If the endpoint is not attached to any address
space, then the device MAY abort the transaction.

The device MAY abort any transaction that isn't a write access and that
targets a VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS region with flag
VIRTIO_IOMMU_PROBE_RESV_MEM_F_MSI.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [virtio-dev] [RFC] virtio-iommu v0.4 - IOMMU Device
@ 2017-08-04 18:19   ` Jean-Philippe Brucker
  0 siblings, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-08-04 18:19 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: will.deacon, robin.murphy, lorenzo.pieralisi, mst, jasowang,
	marc.zyngier, eric.auger, eric.auger.pro, bharat.bhushan, peterx,
	kevin.tian

The following is roughly the content of device-operations.tex

---
\section{IOMMU device}\label{sec:Device Types / IOMMU Device}

The virtio-iommu device manages Direct Memory Access (DMA) from one or
more endpoints. It may act as a proxy for multiple physical IOMMUs
managing devices assigned to the guest, and as standalone IOMMU for
virtual devices.

The driver first discovers endpoints managed by the virtio-iommu device
using standard firmware mechanisms. It then sends requests to create
virtual address spaces and virtual-to-physical mappings for these
endpoints. In its simplest form, the virtio-iommu supports four request
types:

\begin{enumerate}
\item Create an address space and attach an endpoint to it.  \\
  \texttt{attach(device = 0x104, address space = 1)}
\item Create a mapping between a range of guest-virtual and guest-physical
  address. \\
  \texttt{map(address space = 1, virt = 0x1000, phys = 0xa000,
          size = 0x1000, flags = READ)}

  Endpoint 0x104, for example a hardware PCI endpoint, can now read at
  addresses 0x1000-0x1fff. These accesses are translated into
  system-physical addresses by the IOMMU.

\item Remove the mapping.\\
  \texttt{unmap(address space = 1, virt = 0x1000, size = 0x1000)}

  Any access to addresses 0x1000-0x1fff by endpoint 0x104 would now be
  rejected.
\item Detach the device and remove the address space.\\
  \texttt{detach(device = 0x104)}
\end{enumerate}

\subsection{Device ID}\label{sec:Device Types / IOMMU Device / Device ID}

TBD. During development, use 61216.

\subsection{Virtqueues}\label{sec:Device Types / IOMMU Device / Virtqueues}

\begin{description}
\item[0] requestq
\end{description}

\subsection{Feature bits}\label{sec:Device Types / IOMMU Device / Feature bits}

\begin{description}
\item[VIRTIO_IOMMU_F_INPUT_RANGE (0)]
  Available range of virtual addresses is described in \field{input_range}

\item[VIRTIO_IOMMU_F_IOASID_BITS (1)]
  The number of address spaces supported is described in \field{ioasid_bits}

\item[VIRTIO_IOMMU_F_MAP_UNMAP (2)]
  Map and unmap requests are available.\footnote{Future extensions may add
  different modes of operations. At the moment, only
  VIRTIO_IOMMU_F_MAP_UNMAP is supported.}

\item[VIRTIO_IOMMU_F_BYPASS (3)]
  When not attached to an address space, endpoints downstream of the IOMMU
  can access the guest-physical address space.

\item[VIRTIO_IOMMU_F_PROBE (4)]
  Probe request is available.
\end{description}

\drivernormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device / Feature bits}

The driver SHOULD accept any of the VIRTIO_IOMMU_F_INPUT_RANGE,
VIRTIO_IOMMU_F_IOASID_BITS, VIRTIO_IOMMU_F_MAP_UNMAP and
VIRTIO_IOMMU_F_PROBE feature bits if offered by the device.

% XXX F_MAP_UNMAP will be optional when introducing PTH. But a 0.2 driver
% must implement it (otherwise the device is useless)

\devicenormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device / Feature bits}

If the device offers any of VIRTIO_IOMMU_F_INPUT_RANGE,
VIRTIO_IOMMU_F_IOASID_BITS or VIRTIO_IOMMU_F_PROBE feature bits, and if
the driver did not accept this feature bit, then the device MAY signal
failure by failing to set FEATURES_OK \field{device status} bit when the
driver writes it.

If the device offers the VIRTIO_IOMMU_F_MAP_UNMAP feature bit, and if the
driver did not accept this feature bit, then the device SHOULD behave as
if the feature was negotiated.
% This takes into account all the following "If the F_MAP_UNMAP feature
% was negotiated..."

% If the driver supports F_PTH but not F_MAP_UNMAP, then the driver MUST
% give up upon seeing that the 0.2 device doesn't support F_PTH. If it
% supports F_MAP_UNMAP, then it SHOULD use F_MAP_UNMAP (this is described
% in "Driver Requirements: Feature Bits")

% If the driver supports F_PTH but the device doesn't, and the driver
% stupidly sets F_PTH but not F_MAP_UNMAP, then the device SHOULD reject
% any PTH request and PTH flag in attach.

% When not using the legacy interface, if the driver doesn't negotiate
% F_MAP_UNMAP, then the device may disable it an reject any request.

\subsubsection{Legacy Interface: Feature bits}\label{sec:Device Types / IOMMU Device / Feature bits / Legacy Interface: Feature bits}

When using the legacy interface, transitional devices MUST support guests
which do not negotiate any of the VIRTIO_IOMMU_F_INPUT_RANGE,
VIRTIO_IOMMU_F_IOASID_BITS, VIRTIO_IOMMU_F_MAP_UNMAP or
VIRTIO_IOMMU_F_PROBE features, and MUST behave as if the feature was
negotiated.

%\subsection{Feature Bits Requirements}\label{sec:Device Types / IOMMU Device / Feature bits requirements}

\subsection{Device configuration layout}\label{sec:Device Types / IOMMU Device / Device configuration layout}

The \field{page_size_mask} field is always present. Availability of the
others depend on various feature bits as indicated above.

\begin{lstlisting}
struct virtio_iommu_config {
	u64 page_size_mask;
	struct virtio_iommu_range {
		u64 start;
		u64 end;
	} input_range;
	u8 ioasid_bits;
	u8 padding[3];
	u32 probe_size;
};
\end{lstlisting}

\drivernormative{\subsubsection}{Device configuration layout}{Device Types / IOMMU Device / Device configuration layout}

The driver MUST NOT write to device configuration fields.

\devicenormative{\subsubsection}{Device configuration layout}{Device Types / IOMMU Device / Device configuration layout}

The device SHOULD set \field{padding} to zero.

The device MUST set at least one bit in \field{page_size_mask}, describing
the page granularity. The device MAY set more than one bit in
\field{page_size_mask}.

\subsubsection{Legacy Interface: Device configuration layout}\label{sec:Device Types / IOMMU Device / Device configuration layout / Legacy Interface: Device configuration layout}
When using the legacy interface, transitional devices and drivers
MUST format the fields in struct virtio_iommu_config
according to the native endian of the guest rather than
(necessarily when not using the legacy interface) little-endian.


\subsection{Device initialization}\label{sec:Device Types / IOMMU Device / Device initialization}

When the device is reset, endpoints are not attached to any address space.
If the VIRTIO_IOMMU_F_BYPASS feature is negotiated, all endpoints can
access guest-physical addresses ("bypass mode"). If the feature is not
negotiated, then any memory access from endpoints will fault. Upon
attaching an endpoint in bypass mode to a new address space, any memory
access from the endpoint will fault, since the address space does not
contain any mapping.

The driver chooses operating mode depending on its capabilities. In this
revision of the virtio-iommu specification, the only supported mode is
VIRTIO_IOMMU_F_MAP_UNMAP.

\drivernormative{\subsubsection}{Device Initialization}{Device Types / IOMMU Device / Device Initialization}

The driver MUST NOT negotiate VIRTIO_IOMMU_F_MAP_UNMAP if it is incapable
of sending VIRTIO_IOMMU_T_MAP and VIRTIO_IOMMU_T_UNMAP requests.

If the VIRTIO_IOMMU_F_PROBE feature is offered, the driver SHOULD send a
VIRTIO_IOMMU_T_PROBE request for each endpoint before attaching the
endpoint to an address space.

\devicenormative{\subsubsection}{Device Initialization}{Device Types / IOMMU Device / Device Initialization}

If the driver does not accept the VIRTIO_IOMMU_F_BYPASS feature, the
device SHOULD NOT let endpoints access the guest-physical address space.
However, the device MAY let endpoints access memory regions negotiated
with VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS (see
\ref{devicenormative:Device Types / IOMMU Device / Device operations / PROBE properties / RESV_MEM}).

\subsection{Device operations}\label{sec:Device Types / IOMMU Device / Device operations}

Driver send requests on the request virtqueue, notifies the device and
waits for the device to return the request with a status in the used ring.
All requests are split in two parts: one device-readable, one device-
writable. Each request is therefore described with at least two
descriptors, as illustrated below.

\begin{figure}[htb]
  \centering
  \includegraphics[width=0.7\textwidth]{img/request-wrapping.png}
  \caption{Anatomy of a virtio-iommu request}
\end{figure}

\begin{lstlisting}
struct virtio_iommu_req_head {
	u8	type;
	u8	reserved[3];
};

struct virtio_iommu_req_tail {
	u8	status;
	u8	reserved[3];
};
\end{lstlisting}

\rfc{% TODO
There is a problem with framing using multiple chains... If a request
isn't recognized by the device and is scattered across multiple descriptor
chains, how would the device know where this request ends and where the
next one begins? It assumes that a transition from WO descriptors to RO
descriptors is a new request, ok. But if this unknown request is from a
future extension that uses interleaved RO, WO, RO, WO descriptors for a
single request, then the device is doomed.\\
The extension will have to force that third buffer to start with 0 so our
device, that thinks it's a new request, doesn't recognize the request type
and waits for the next one.\\
Alternatively, we could add a 16-bit 'size' field into
virtio_iommu_req_head. We'd loose some space for future flags, and force
requests to be at most 64k, or 256k if we drop size bits [1:0]. (head
couldn't be extended in the future, with a field to describe a bigger
size, because our legacy device here would ignore it.) We'd have to resort
to multiple "container" requests transporting a single big one.\\
Personally I think we can let future extensions deal with interleaved
RO/WO/RO descriptors, but do you think a 'size' field would be better?
}

Type may be one of:

\begin{lstlisting}
#define VIRTIO_IOMMU_T_ATTACH			1
#define VIRTIO_IOMMU_T_DETACH			2
#define VIRTIO_IOMMU_T_MAP			3
#define VIRTIO_IOMMU_T_UNMAP			4
#define VIRTIO_IOMMU_T_PROBE			5
\end{lstlisting}

A few general-purpose status codes are defined here. Unless explicitly
described in a \textbf{Requirements} section, these values are hints to
make troubleshooting easier.

When the device fails to parse a request, for instance if a request seems
too small for its type and the device cannot find the tail, then it will
be unable to set \field{status}. In that case, it should return the
buffers without writing in them.

\begin{lstlisting}
/* All good! Carry on. */
#define VIRTIO_IOMMU_S_OK			0
/* Virtio communication error */
#define VIRTIO_IOMMU_S_IOERR			1
/* Unsupported request */
#define VIRTIO_IOMMU_S_UNSUPP			2
/* Internal device error */
#define VIRTIO_IOMMU_S_DEVERR			3
/* Invalid parameters */
#define VIRTIO_IOMMU_S_INVAL			4
/* Out-of-range parameters */
#define VIRTIO_IOMMU_S_RANGE			5
/* Entry not found */
#define VIRTIO_IOMMU_S_NOENT			6
/* Bad address */
#define VIRTIO_IOMMU_S_FAULT			7
\end{lstlisting}

Range limits of some request fields are described in the device
configuration:

\begin{itemize}
\item \field{page_size_mask} contains the bitmask of all page sizes that
  can be mapped. The least significant bit set defines the page
  granularity of IOMMU mappings. Other bits in the mask are hints
  describing page sizes that the IOMMU can merge into a single mapping
  (page blocks).

  The smallest page granularity supported by the IOMMU is one byte. It is
  legal for the driver to map one byte at a time if bit 0 of
  \field{page_size_mask} is set.

\item If the VIRTIO_IOMMU_F_IOASID_BITS feature is offered,
  \field{ioasid_bits} contains the number of bits supported in an I/O
  Address Space ID, the identifier used in most requests. A value of 0 is
  valid, and means that a single address space is supported.

  If the feature is not negotiated, address space identifiers can use up
  to 32 bits.

\item If the VIRTIO_IOMMU_F_INPUT_RANGE feature is offered,
  \field{input_range} contains the virtual address range that the IOMMU is
  able to translate. Any mapping request to virtual addresses outside of
  this range will fail.

  If the feature is not negotiated, virtual mappings span over the whole
  64-bit address space (\texttt{start = 0, end = 0xffffffff ffffffff})
\end{itemize}

\drivernormative{\subsubsection}{Device operations}{Device Types / IOMMU Device / Device operations}

The driver SHOULD set reserved fields of the head and the tail of a
request to zero.

When a device returns a complete request in the used queue without having
written to it, the driver SHOULD interpret it as a failure from the device
to parse the request.

If the VIRTIO_IOMMU_F_INPUT_RANGE feature is offered, the driver SHOULD
NOT send requests with \field{virt_addr} less than
\field{input_range.start} or greater than \field{input_range.end}.

If the VIRTIO_IOMMU_F_IOASID_BITS feature is offered, the driver SHOULD
NOT send requests with \field{address_space} greater than the size
described by \field{ioasid_bits}.

% We mandate truncation to allow a future extension X.Y that would store
% information in addresses and address space IDs.
%
% If device is 0.2 and driver is X.Y, then device ignores ext. bits. But
% if device is X.Y and device is 0.2, then driver *might* set ext. bits to
% garbage. But this extension would be negotiated with a feature bit
% anyway. If it's not, then device must assume that driver is 0.2 and must
% keep truncating the fields.

\devicenormative{\subsubsection}{Device operations}{Device Types / IOMMU Device / Device operations}

The device SHOULD NOT set \field{status} to VIRTIO_IOMMU_S_OK if a request
didn't succeed. \footnote{%
For IMPLEMENTATION DEFINED values of 'succeed'... For example,
virtio_iommu_req_detach.reserved is allowed to be non-zero. If it is
non-zero, the device may consider it to be a failure and abort the
request. Or it may go on with the detach and return OK.}

If a request \field{type} is not recognized, the device SHOULD return the
buffers on the used ring and set the \field{len} field of the used element
to zero.

The device MUST ignore reserved fields of the head and the tail of a
request.

If the VIRTIO_IOMMU_F_INPUT_RANGE feature is offered, the device MUST
truncate the range described by \field{virt_addr} and \field{size} in
requests to fit in the range described by \field{input_range}.

If the VIRTIO_IOMMU_F_IOASID_BITS is offered, the device MUST ignore bits
above \field{ioasid_bits} in field \field{address_space} of requests.

\subsubsection{ATTACH request}\label{sec:Device Types / IOMMU Device / Device operations / ATTACH request}

\begin{lstlisting}
struct virtio_iommu_req_attach {
	le32	address_space;
	le32	device;
	le32	reserved;
};
\end{lstlisting}

Attach an endpoint to an address space. \field{address_space} is an
identifier unique to the virtio-iommu device. If the address space doesn't
exist in the device, it is created. \field{device} is an endpoint
identifier unique to the virtio-iommu device. The host communicates unique
device IDs to the guest using methods outside the scope of this
specification, but the following rules apply:

\begin{itemize}
\item The device ID is unique from the virtio-iommu point of view. Multiple
  endpoints whose DMA transactions are not translated by the same
  virtio-iommu may have the same device ID. Endpoints whose DMA
  transactions may be translated by the same virtio-iommu must have
  different device IDs.

\item Sometimes the host cannot completely isolate two endpoints from each
  others. For example on a legacy PCI bus, endpoints can snoop DMA
  transactions from their neighbours. In this case, the host must
  communicate to the guest that it cannot isolate these endpoints from
  each others, or that the physical IOMMU cannot distinguish transactions
  coming from these endpoints. The method used to communicate this is
  outside the scope of this specification.
\end{itemize}

Multiple endpoints may be added to the same address space. An endpoint
cannot be attached to multiple address spaces in VIRTIO_IOMMU_F_MAP_UNMAP
mode.

\drivernormative{\paragraph}{ATTACH request}{Device Types / IOMMU Device / Device operations / ATTACH request}

The driver SHOULD set \field{reserved} to zero.

The driver SHOULD ensure that endpoints that cannot be isolated by the
host are attached to the same address space.

\devicenormative{\paragraph}{ATTACH request}{Device Types / IOMMU Device / Device operations / ATTACH request}

If the \field{reserved} field of an ATTACH request is not zero, the device
SHOULD set the request \field{status} to VIRTIO_IOMMU_S_INVAL and SHOULD
NOT attach the endpoint to the address space. \footnote{The device should
validate input of ATTACH requests in case the driver attempts to attach in
a mode that is unimplemented by the device, and would be incompatible with
the modes implemented by the device.}

If the endpoint identified by \field{device} doesn't exist, then the
device SHOULD set the request \field{status} to VIRTIO_IOMMU_S_NOENT.

If another endpoint is already attached to the address space identified by
\field{address_space}, then the device MAY attach the endpoint identified
by \field{device} to the address space. If it cannot do so, the device
MUST set the request \field{status} to VIRTIO_IOMMU_S_UNSUPP.

If the endpoint identified by \field{device} is already attached to
another address space, then the device SHOULD first detach it from that
address space and attach it to the one identified by
\field{address_space}. In that case the device behaves as if the driver
issued a DETACH request with this \field{device}, followed by the ATTACH
request. If the device cannot do so, it MUST set the request
\field{status} to VIRTIO_IOMMU_S_UNSUPP.

\subsubsection{DETACH request}

\begin{lstlisting}
struct virtio_iommu_req_detach {
	le32	device;
	le32	reserved;
};
\end{lstlisting}

Detach an endpoint from its address space. When this request completes,
the endpoint cannot access any mapping from that address space anymore.

After all endpoints have been successfully detached from an address space,
it ceases to exist and its ID can be reused by the driver for another
address space.

\drivernormative{\paragraph}{DETACH request}{Device Types / IOMMU Device / Device operations / DETACH request}

The driver SHOULD set \field{reserved} to zero.

\devicenormative{\paragraph}{DETACH request}{Device Types / IOMMU Device / Device operations / DETACH request}

If the \field{reserved} field of an DETACH request is not zero, the device
MAY set the request \field{status} to VIRTIO_IOMMU_S_INVAL, in which case
the device MAY perform the DETACH operation.
% If it returns OK, the device SHOULD go on with the detach, as required
% by the VIRTIO_IOMMU_S_OK rule.

If the endpoint identified by \field{device} doesn't exist, then the
device SHOULD set the request \field{status} to VIRTIO_IOMMU_S_NOENT.

If the endpoint identified by \field{device} wasn't attached to any
address space, then the device MAY set the request \field{status} to
VIRTIO_IOMMU_S_INVAL.

The device MUST ensure that after being detached from an address space,
the endpoint cannot access any mapping from that address space.

\subsubsection{MAP request}\label{sec:Device Types / IOMMU Device / Device operations / MAP request}

\begin{lstlisting}
struct virtio_iommu_req_map {
	le32	address_space;
	le64	phys_addr;
	le64	virt_addr;
	le64	size;
	le32	flags;
};

/* Flags are: */
#define VIRTIO_IOMMU_MAP_F_READ		(1 << 0)
#define VIRTIO_IOMMU_MAP_F_WRITE	(1 << 1)
#define VIRTIO_IOMMU_MAP_F_EXEC		(1 << 2)
\end{lstlisting}

Map a range of virtually-contiguous addresses to a range of
physically-contiguous addresses of the same size. After the request
succeeds, all endpoints attached to this address space can access memory
in the range $[phys\_addr; phys\_addr + size[$. For example, if an endpoint
accesses address $VA \in [virt\_addr; virt\_addr + size[$, the device (or the
physical IOMMU) translates the address: $PA = VA - virt\_addr +
phys\_addr$. If the access parameters are compatible with \field{flags}
(for instance, the access is write and \field{flags} are
VIRTIO_IOMMU_MAP_F_READ | VIRTIO_IOMMU_MAP_F_WRITE) then the IOMMU allows
the access to reach $PA$.

The range defined by (\field{virt_addr}, \field{size}) must be within the
limits specified by \field{input_range}. The range defined by
(\field{phys_addr}, \field{size}) must be within the guest-physical
address space. This includes upper and lower limits, as well as any
carving of guest-physical addresses for use by the host (for instance MSI
doorbells). Guest physical boundaries are set by the host using a firmware
mechanism outside the scope of this specification.

\begin{note}
This format prevents from creating the identity mapping in a single
request \texttt{[0x0; 0xfff....fff] $\rightarrow$ [0x0; 0xfff...fff]},
since it would result in a size of zero. Hopefully allowing
VIRTIO_IOMMU_F_BYPASS eliminates the need for issuing such request. It
would also be unlikely to conform to the physical range restrictions
from the previous paragraph.
\end{note}

\begin{note}
On flags: it is unlikely that all possible combinations of flags will be
supported by the physical IOMMU. For instance, $W \& !R$ or $X \& W$ might
be invalid. We do not have a way to advertise supported and implicit (for
instance $W \rightarrow R$) flags or combination thereof for the moment,
you are free to send any suggestions for describing this. Please keep in
mind that we might soon want to add more flags, such as privileged,
device, transient, shared, etc. (whatever these would mean).
\end{note}

This request is only available when VIRTIO_IOMMU_F_MAP_UNMAP has been
negotiated.

\drivernormative{\paragraph}{MAP request}{Device Types / IOMMU Device / Device operations / MAP request}

The driver SHOULD set undefined \field{flags} bits to zero.

\devicenormative{\paragraph}{MAP request}{Device Types / IOMMU Device / Device operations / MAP request}

If \field{virt_addr}, \field{phys_addr} or \field{size} is not aligned on
the page granularity, the device SHOULD set the request \field{status} to
VIRTIO_IOMMU_S_RANGE and SHOULD NOT create the mapping.

If the device doesn't recognize a \field{flags} bit, it SHOULD set the
request \field{status} to VIRTIO_IOMMU_S_INVAL. In this case the device
SHOULD NOT create the mapping. \footnote{Validating the input is important
here, because the driver might be attempting to map with special flags
that the device doesn't recognize. Creating the mapping with incompatible
flags may introduce a security hazard.}

If \field{address_space} does not exist, the device SHOULD set the request
\field{status} to VIRTIO_IOMMU_S_NOENT.

\subsubsection{UNMAP request}\label{sec:Device Types / IOMMU Device / Device operations / UNMAP request}

\begin{lstlisting}
struct virtio_iommu_req_unmap {
	le32	address_space;
	le64	virt_addr;
	le64	size;
	le32	reserved;
};
\end{lstlisting}

Unmap a range of addresses mapped with VIRTIO_IOMMU_T_MAP. We define here
a mapping as a virtual region created with a single MAP request. All
mappings covered by the range $[virt\_addr; virt\_addr + size [$ are
removed.

The semantics of unmapping are specified below, and illustrated with the
following requests, assuming each example sequence starts with a blank
address space. We define two pseudocode functions \texttt{map(virt\_addr,
size) -> mapping} and \texttt{unmap(virt\_addr, size)}.

\begin{lstlisting}
(1) unmap(addr=0, size=5)        -> succeeds, doesn't unmap anything

(2) a = map(addr=0, size=10);
    unmap(0, 10)                 -> succeeds, unmaps a

(3) a = map(0, 5);
    b = map(5, 5);
    unmap(0, 10)                 -> succeeds, unmaps a and b

(4) a = map(0, 10);
    unmap(0, 5)                  -> faults, doesn't unmap anything

(5) a = map(0, 5);
    b = map(5, 5);
    unmap(0, 5)                  -> succeeds, unmaps a

(6) a = map(0, 5);
    unmap(0, 10)                 -> succeeds, unmaps a

(7) a = map(0, 5);
    b = map(10, 5);
    unmap(0, 15)                 -> succeeds, unmaps a and b
\end{lstlisting}

This request is only available when VIRTIO_IOMMU_F_MAP_UNMAP has been
negotiated.

\drivernormative{\paragraph}{UNMAP request}{Device Types / IOMMU Device / Device operations / UNMAP request}

The driver SHOULD set the \field{reserved} field to zero.

The range, defined by \field{virt_addr} and \field{size}, SHOULD cover one
or more contiguous mappings created with MAP requests. The range MAY spill
over unmapped virtual addresses.

The first address of a range SHOULD either be the first address of a
mapping or be outside any mapping. The last address of a range SHOULD
either be the last address of a mapping or be outside any mapping.

\devicenormative{\paragraph}{UNMAP request}{Device Types / IOMMU Device / Device operations / UNMAP request}

If the \field{reserved} field of an UNMAP request is not zero, the device
MAY set the request \field{status} to VIRTIO_IOMMU_S_INVAL, in which case
the device MAY perform the UNMAP operation.
% If it returns OK, the device SHOULD go on with the unmap, as required by
% the VIRTIO_IOMMU_S_OK rule.

If \field{address_space} does not exist, the device SHOULD set the request
\field{status} to VIRTIO_IOMMU_S_NOENT.

If a mapping affected by the range is not covered in its entirety by the
range (the UNMAP request would split the mapping), then the device SHOULD
set the request \field{status} to VIRTIO_IOMMU_S_RANGE, and SHOULD NOT
remove any mapping.

If part of the range or the full range is not covered by an existing
mapping, then the device SHOULD remove all mappings affected by the range
and set the request \field{status} to VIRTIO_IOMMU_S_OK.

\subsubsection{PROBE request}\label{sec:Device Types / IOMMU Device / Device operations / PROBE request}

If the VIRTIO_IOMMU_F_PROBE feature bit is present, the driver sends a
VIRTIO_IOMMU_T_PROBE request for each endpoint that the virtio-iommu
device manages. This probe is performed before attaching the endpoint to
an address space.

\begin{lstlisting}
struct virtio_iommu_req_probe {
	/* Device-readable */
	le32	device;
	le32	flags;
	u8	reserved[60];

	/* Device-writable when not ACK */
	u8	properties[];
};

/* Flags are: */
#define VIRTIO_IOMMU_PROBE_F_ACK	(1 << 0)
\end{lstlisting}

\begin{description}
\item[\field{device}] has the same meaning as in ATTACH and DETACH
  requests.

\item[\field{flags}] contain additional information about the request.
  The VIRTIO_IOMMU_PROBE_F_ACK flag changes the descriptor chain layout:
  when ACK is clear, the \field{properties} field is device-writable;
  when it is set, the \field{properties} field is device-readable.

\item[\field{reserved}] is used as padding, so that future extensions can
  add fields to the device-readable part.

\item[\field{properties}] contains a list of properties of endpoint
  \field{device}, filled by the device. This field is exactly
  \field{probe_size} bytes. Each property is described with a type, four
    flag bits, a length, and a value:
\begin{lstlisting}
#define VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK	0xfff
#define VIRTIO_IOMMU_PROBE_PROPERTY_F_ACK	(1 << 12)

struct virtio_iommu_probe_property {
	le16	type;
	le16	length;
	u8	value[];
};
\end{lstlisting}

\end{description}

The driver allocates a buffer of adequate size for the probe request,
writes \field{device} and adds it to the request queue. The device fills
the \field{properties} field with a list of properties for this endpoint.

The driver parses the first property by reading \field{type}, then
\field{length}. If the driver recognizes \field{type}, it reads and
handles \field{value}. The driver then reads the next property, that is
located $(\field{length} + 4)$ bytes after the beginning of the first one,
and so on. The driver parses all properties until it reaches a NONE
property or the end of \field{properties}.

The upper nibble of property \field{type} is reserved for flags.
Therefore only 4096 types are available. The actual type of a property is
extracted like this:

\begin{lstlisting}
u16 type = le16_to_cpu(property.type) & VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK;
\end{lstlisting}

If a property is correctly understood by the driver, then it sets the ACK
bit in \field{type}:

\begin{lstlisting}
property.type |= cpu_to_le16(VIRTIO_IOMMU_PROBE_PROPERTY_F_ACK);
\end{lstlisting}

Then, to signal to the device which properties are understood, the device
sends the probe again with the VIRTIO_IOMMU_PROBE_F_ACK flag. In all
properties understood and accepted by the driver, \field{type} has the
VIRTIO_IOMMU_PROBE_PROPERTY_F_ACK bit set. The other properties are left
as is.

This second phase of the probe request allows the device to ensure that
all properties crucial for good operations are recognized and handled by
the driver. This is analogous to the initial feature negotiation of virtio
devices: an endpoint property is \emph{offered} by the device to the
driver during the first PROBE, and it is \emph{negotiated} after the
driver acknowledges it during the second PROBE.

Available property types are described in section
\ref{sec:Device Types / IOMMU Device / Device operations / PROBE properties}.
When attaching multiple devices to the same address space, their
properties are combined. \emph{Combination Rules} are given for each
property, and describe the rules to apply when combining properties
obtained during probe.

\drivernormative{\paragraph}{PROBE request}{Device Types / IOMMU Device / Device operations / PROBE request}

The size of \field{properties} MUST be \field{probe_size} bytes.

The driver SHOULD set undefined \field{flags} to zero.

The driver SHOULD set \field{reserved} to zero.

If the driver doesn't recognize the \field{type} of a property, it SHOULD
ignore the property and continue parsing the list.

The driver SHOULD NOT deduce the property length from \field{type}.

If the driver recognizes a property \field{type} and is able to
handle{\footnotemark} the property, then the driver SHOULD set the
VIRTIO_IOMMU_PROBE_PROPERTY_F_ACK bit of that property.

\footnotetext{A driver's ability to handle a property depends on the
property type. Without a specific definition of the ACK requirements for a
given property type, it simply means that the driver read all fields of
that property.}

The driver SHOULD resend the PROBE request with the
VIRTIO_IOMMU_PROBE_F_ACK bit set after parsing and updating the
\field{properties} list. Depending on the properties encountered in the
list, the driver MAY modify some of their fields between the first and
second probe, but it SHOULD NOT modify the \field{length} field or bits
[11:0] of field \field{type}.

\devicenormative{\paragraph}{PROBE request}{Device Types / IOMMU Device / Device operations / PROBE request}

If an undefined bit is set in \field{flags}, the device MAY set the
request \field{status} to VIRTIO_IOMMU_S_INVAL.

If the \field{reserved} field of a PROBE request is not zero, the device
MAY set the request \field{status} to VIRTIO_IOMMU_S_INVAL.

If the endpoint identified by \field{device} doesn't exist, then the
device SHOULD set the request \field{status} to VIRTIO_IOMMU_S_NOENT.

If the device does not offer the VIRTIO_IOMMU_F_PROBE feature, and if the
driver sends a VIRTIO_IOMMU_T_PROBE request, then the device SHOULD return
the buffers on the used ring and set the \field{len} field of the used
element to zero.

The device SHOULD set bits [15:13] of property \field{type} to zero.

The device MUST write the size of \field{value}, in bytes, into
\field{length}.

When two properties follow each others, the device MUST put the second
property exactly $(\field{length} + 4)$ bytes after the beginning of the
first one.

If the device doesn't fill all \field{probe_size} bytes with properties,
it SHOULD terminate the list with a property of type NONE and size 0. The
device MAY fill the remaining bytes of \field{properties}, if any, with
zeroes. If there isn't enough space remaining in \field{properties} to
terminate the list with a complete NONE property (4 bytes), then the
device SHOULD fill the remaining bytes with zeroes.

If the PROBE request has VIRTIO_IOMMU_PROBE_F_ACK bit set, the device MAY
ignore the request and set the request \field{status} to
VIRTIO_IOMMU_S_OK.

\subsubsection{PROBE properties}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties}

\begin{lstlisting}
#define VIRTIO_IOMMU_PROBE_T_NONE		0
#define VIRTIO_IOMMU_PROBE_T_RESV_MEM		2
\end{lstlisting}

\paragraph{Property NONE}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties / NONE}

Marks the end of the property list. This property doesn't have any value,
and should have \field{length} 0.

\paragraph{Property RESV_MEM}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties / RESV_MEM}

The RESV_MEM property describes a chunk of reserved virtual memory. It may
be used by the device to describe virtual address ranges that shouldn't be
allocated by the driver, or that are special.

\begin{lstlisting}
struct virtio_iommu_probe_resv_mem {
	u8	subtype;
	u8	reserved[3];
	le64	addr;
	le64	size;
	le32	flags;
};
\end{lstlisting}

Fields \field{addr} and \field{size} describe the range of reserved
addresses. \field{subtype} may be one of:

\begin{description}
  \item[VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT (0)]
    Accesses to this region are aborted. This subtype does not accept any
    flag.
  \item[VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS (1)]
    Accesses to this region behave as if the IOMMU was bypassed, and reach
    the bus upstream of the IOMMU untranslated.

    The following \field{flags} are defined for BYPASS regions:
    \begin{description}
      \item[VIRTIO_IOMMU_PROBE_RESV_MEM_F_MSI (1)]
        Provides a hint to the guest that this is a doorbell for Message
        Signaled Interrupts.

        If the device doesn't provide such a region, then MSIs are normal
        write accesses from the IOMMU point of view, and arbitrary virtual
        addresses should be allocated by the driver to map MSI doorbells.
        Otherwise, the guest should use the guest-physical doorbell
        address when programming MSIs for this endpoint.
    \end{description}

  %\item[VIRTIO_IOMMU_PROBE_RESV_MEM_T_IDENTITY (2)]
  %  This region should be identity-mapped by the guest. TODO: is this
  %  useful for anyone?
\end{description}

\propcombination{\subparagraph}{Property RESV_MEM}{Device Types / IOMMU Device / Device operations / PROBE properties / RESV_MEM}

Multiple overlapping RESV_MEM properties are merged together. Difference
in subtype on the intersecting range doesn't make a difference from the
driver point of view.

\drivernormative{\subparagraph}{Property RESV_MEM}{Device Types / IOMMU Device / Device operations / PROBE properties / RESV_MEM}

The driver SHOULD NOT map any virtual address described by a
VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT or
VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS property.

% An old driver that doesn't find or understand this property will
% allocate and map virtual addresses. We really can't do anything about
% that. We're not introducing a regression, MSIs never worked for x86
% before we introduced the F_MSI flag.

The driver SHOULD ignore \field{reserved}.

For a given \field{subtype}, the driver SHOULD ignore undefined
\field{flags} bits.

The driver SHOULD treat any \field{subtype} it doesn't recognize as if it
was VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT.

\devicenormative{\subparagraph}{Property RESV_MEM}{Device Types / IOMMU Device / Device operations / PROBE properties / RESV_MEM}

The device SHOULD set \field{reserved} to zero.

For a given \field{subtype}, the device SHOULD set undefined \field{flags}
bits to zero.

The device MAY abort any transaction targeting a
VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT region.

If an endpoint is attached to an address space, the device SHOULD leave
any access targeting one of its VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS
regions pass through untranslated. In other words, the device SHOULD
handle such a region as if it was identity-mapped (virtual address equal
to physical address). If the endpoint is not attached to any address
space, then the device MAY abort the transaction.

The device MAY abort any transaction that isn't a write access and that
targets a VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS region with flag
VIRTIO_IOMMU_PROBE_RESV_MEM_F_MSI.



---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [RFC] virtio-iommu v0.4 - Implementation notes
  2017-08-04 18:19 ` [virtio-dev] " Jean-Philippe Brucker
@ 2017-08-04 18:19   ` Jean-Philippe Brucker
  -1 siblings, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-08-04 18:19 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: will.deacon, robin.murphy, lorenzo.pieralisi, mst, jasowang,
	marc.zyngier, eric.auger, eric.auger.pro, bharat.bhushan, peterx,
	kevin.tian

The following is roughly the content of topology.tex and MSI.tex

---
\section{Implementation notes}\label{sec:viommu}

\subsection{Virtual system topology}\label{sec:viommu / Virtual topology}

\subsubsection{Example virtual topology}\label{sec:viommu / Virtual topology / Example}

\begin{figure}[htb]
  \centering
  \includegraphics[width=\textwidth]{img/virtual-topology.png}
  \caption{An example IOMMU topology}
  \label{fig:viommu / Virtual topology / Topology}
\end{figure}

Diagram~\ref{fig:viommu / Virtual topology / Topology} shows an example
system topology centered around an IOMMU. On the left, the IOMMU manages
traffic from two PCI root complexes. On the right, the IOMMU manages
traffic from three platform devices (or "integrated devices").

Within a PCI domain, devices are identified by a Requester ID. It is a
16-bit identifier also called Bus/Device/Function (BDF). In a BDF, Bus is
8 bits, Device is 5 bits, and Function is 3 bits.

The bottom PCI domain has four endpoints connected to the root complex via
two bridges. The first endpoint is identified by BDF 01:04.0. On the other
bus, the first endpoint is identified by BDF 02:00.0, and the two other
endpoints are two functions of the same device, identified by BDFs 02:01.0
and 02:01.1. The bridges and the root complex may also issue transactions
with BDFs 00:00.0, 00:01.0 and 00:02.0.

In order for the IOMMU to differentiate devices in multiple PCI domains,
the root bridge expands the BDF with a domain ID. In example
\ref{fig:viommu / Virtual topology / Topology},
the PCI domain on top gets ID 0 and the one on the bottom gets ID 1.
Therefore when reaching the IOMMU, a transaction coming from endpoint
01:04.0 (= 0x0120) is identified by Device ID 0x10120.

We define here "platform" devices as endpoints that are on the system bus,
as opposed to behind a PCI host bridge. Unlike PCI devices, platform
devices do not have a standardized identifier scheme to be used with the
IOMMU. Their Device IDs are chosen arbitrarily during system integration
in such way that they don't overlap PCI domains or each others.

\subsubsection{Firmware description}\label{sec:viommu / Virtual topology / Firmware description}

The host describes the relation between IOMMU and devices to the guest
using either device-tree or ACPI. Topology description is outside the
scope of virtio-iommu, because the virtio-iommu does not and should not
need to know about vendor-specific buses. The virtual IOMMU identifies
each virtual endpoint with an abstract 32-bit ID, that is called "Device
ID" in this document\footnote{Other IOMMU architectures use different
names, such as "stream ID" on ARM SMMU or "source ID" on Intel VT-d}.
Device IDs are not necessarily unique system-wide, but they should not
overlap within a single virtio-iommu. Device IDs of physical endpoints do
not need to match IDs seen by the physical IOMMU.

We strongly advise to implement the virtio-iommu using virtio-mmio
transport. Nothing prevents an implementation to use virtio-pci instead,
but existing firmware interfaces do not easily allow to describe an IOMMU
$\leftrightarrow$ master relations between PCI endpoints. Device models in
Operating Systems might not be designed to support such complicated
system.

Device-tree offers a way to describe the IOMMU topology for PCI and
platform devices. Here's an excerpt of the device-tree describing examples
\ref{fig:viommu / Virtual topology / Topology}.

\begin{lstlisting}
/* The virtual IOMMU is described with a virtio-mmio node */
viommu: virtio@9050000 {
	compatible = "virtio,mmio";
	reg = <0x09050000 0x200>;
	dma-coherent;
	interrupts = <0x0 0x5 0x1>;

	#iommu-cells = <1>
};

/* PCI domain 0 */
pcie@3eff0000 {
	...
	/* Identity map */
	iommu-map = <0x0 &viommu 0x0 0x10000>;
};

/* PCI domain 1 */
pcie@3f000000 {
	...
	/* Linear map: deviceID = RID + 0x10000 */
	iommu-map = <0x0 &viommu 0x10000 0x10000>;
};

someplatformdevice@a000000 {
	...
	iommus = <&viommu 0x20000>;
};
\end{lstlisting}

For more details, please refer to \hyperref[intro:IOMMU DT Bindings]{[IOMMU DT]}.

In ACPI, the plan would be to add a new node type to the IO Remapping
Table specification \hyperref[intro:ACPI IORT]{[ACPI IORT]}, that provides
a mechanism similar to DT for describing IOMMU topology.

The OS would parse the IORT table to build a map of ID relations between
IOMMU and devices. ID Array is used to find correspondence between IOMMU
IDs and PCI or platform devices. Later on, the virtio-iommu driver finds
the associated LNRO0005 descriptor via the "Device object name" field, and
probes the virtio device to find out more about its capabilities. Since
all properties of the IOMMU will be obtained during virtio probing, the
IORT node can stay simple.

The following table shows the possible\protect\footnotemark\ format for a
paravirtualized IOMMU IORT node.

\footnotetext{This table IS NOT authoritative, only a suggestion.
  Such a node would be described in \hyperref[intro:ACPI IORT]{[ACPI IORT]}}.

\begin{center}
\begin{tabular}{| l | l | l | p{.4\textwidth} |}
\hline
\textbf{Field} & \textbf{Length} & \textbf{Offset} & \textbf{Description} \\
\hline
Type                  & 1     & 0     & 5: Paravirtualized IOMMU \\
\hline
Length                & 2     & 1     & The length of the node. \\
\hline
Revision              & 1     & 3     & 0 \\
\hline
Reserved              & 4     & 4     & Must be zero. \\
\hline
Number of ID mapping  & 4     & 8     & \\
\hline
Reference to ID Array & 4     & 12    &
  Offset from the start of the ID Array IORT node to the start of its
  Array ID mappings.\\
\hline
Model                 & 4     & 16    & 0: virtio-iommu \\
\hline
Device object name    &       & 20    &
  ASCII Null terminated string with the full path to the entry in the
  namespace for this IOMMU. \\
\hline
Padding & & & To keep 32-bit alignment and leave space for future models. \\
\hline
Array of ID mappings  & 20xN  &       & ID Array. \\
\hline
\end{tabular}
\end{center}

---
\subsection{Message Signaled Interrupts}\label{sec:viommu / MSI}

Some buses, such as PCI, implement Message Signaled Interrupts. Instead of
requesting an interrupt via a wire that runs from the endpoint to the irqchip,
the endpoint can request interrupts by performing a memory write to a specific
register (the "doorbell").

By combining the data written to the doorbell, the address itself, and the
originator of the write, the IRQ chip deduces the destination interrupt
number and destination processing units. Additional devices between the
endpoint and the IRQ chip may translate the doorbell address, the IRQ
number and verify that the endpoint is allowed to send this interrupt.

Different platforms implement IRQ remapping and routing in different ways.
This section describes three ways of dealing with Message Signaled
Interrupts in virtio-iommu devices and drivers.

In simplest systems, the endpoint writes the plain interrupt number to the
doorbell, and the IRQ chip signals the interruption to destination CPUs
programmed by software. Section \ref{sec:viommu / MSI / Address bypass}
describes how to implement a simple system with virtio-iommu. Section
\ref{sec:viommu / MSI / Address translation} describes the added complexity
(from the host point of view) of translating the IRQ chip doorbell.

More complex systems add a level of indirection in the MSI message. The address
or data contains an index into a remapping table, that describes interrupt
delivery in details and is programmed by software either into the IRQ chip or
the IOMMU. Section \ref{sec:viommu / MSI / IRQ remapping} describes how to use
the remapping feature of virtio-iommu.

\subsubsection{Address bypass}\label{sec:viommu / MSI / Address bypass}

\begin{figure}[htb]
  \centering
  \includegraphics{img/MSI-addr-noremap.png}
  \caption{MSI remapping with address bypass}
\end{figure}

Bypassing translation for MSIs is the simplest implementation from the host
perspective. The virtio-iommu device has a special IOVA window that it does not
translate. Any access from devices to that region is forwarded upstream of the
IOMMU without being translated or even checked.

The IRQ chip may or may not have an IRQ remapping component. It may be as
simple as generating the interrupt number described in data, without checking
if the device was allowed to send that interrupt. If there is another
component performing the isolation, one might consider translating the
doorbell address superfluous.

With virtio-iommu, the device can advertise the doorbell address as
untranslated by using the PROBE request with a reserved region (see
\ref{sec:Device Types / IOMMU Device / Device operations / PROBE properties / RESV_MEM}).

For example, if the virtual platform has an IRQ remapping module with a
doorbell in the physical address range 0xfee00000-0xfeefffff, then the
device can present the following property to the driver:

\begin{lstlisting}
struct __attribute__((packed)) {
	struct virtio_iommu_probe_property	head;
	struct virtio_iommu_probe_resv_mem	mem;
} doorbell = {
	.head = {
		.type		= VIRTIO_IOMMU_PROBE_T_RESV_MEM,
		.length		= sizeof(doorbell.mem),
	},
	.mem = {
		.subtype	= VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS,
		.flags		= VIRTIO_IOMMU_PROBE_RESV_MEM_F_MSI,
		.addr		= 0xfee00000,
		.size		= 0x00100000,
	},
};
\end{lstlisting}

\subsubsection{Address translation}\label{sec:viommu / MSI / Address translation}

\begin{figure}[htb]
  \centering
  \includegraphics{img/MSI-addr-remap.png}
  \caption{MSI remapping with address translation}
\end{figure}

On some systems (e.g. ARM-based platforms) the IOMMU does not have a special
MSI window, and MSIs are treated like any other memory write. The MSI address
therefore has to be translated by the IOMMU before reaching the IRQ chip.

Address translation may be used as a rudimentary form of MSI isolation,
but multiple endpoints will typically access the same doorbell. Address
translation can only forbid an endpoint from sending interrupts. If it is
allowed to send MSIs, the endpoint can easily spoof another endpoint by
sending interrupts that were not assigned to it.

>From the virtio-iommu point of view, this is the simplest to implement, because
there is no special address range. The whole address space is treated the same
by the virtio-iommu device.

However, this mode of operations may add significant complexity in the host
implementation.


\subsubsection{IRQ remapping}\label{sec:viommu / MSI / IRQ remapping}

Some IOMMUs (e.g. Intel and AMD IOMMUs) are able to remap IRQs themselves. 

\begin{figure}[htb]
  \centering
  \includegraphics{img/MSI-irq-remap.png}
  \caption{MSI remapping with address bypass}
\end{figure}

This version of virtio-iommu doesn't support IRQ remapping.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [RFC] virtio-iommu v0.4 - Implementation notes
  2017-08-04 18:19 ` [virtio-dev] " Jean-Philippe Brucker
                   ` (2 preceding siblings ...)
  (?)
@ 2017-08-04 18:19 ` Jean-Philippe Brucker
  -1 siblings, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-08-04 18:19 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: lorenzo.pieralisi, mst, marc.zyngier, will.deacon, eric.auger,
	robin.murphy, eric.auger.pro

The following is roughly the content of topology.tex and MSI.tex

---
\section{Implementation notes}\label{sec:viommu}

\subsection{Virtual system topology}\label{sec:viommu / Virtual topology}

\subsubsection{Example virtual topology}\label{sec:viommu / Virtual topology / Example}

\begin{figure}[htb]
  \centering
  \includegraphics[width=\textwidth]{img/virtual-topology.png}
  \caption{An example IOMMU topology}
  \label{fig:viommu / Virtual topology / Topology}
\end{figure}

Diagram~\ref{fig:viommu / Virtual topology / Topology} shows an example
system topology centered around an IOMMU. On the left, the IOMMU manages
traffic from two PCI root complexes. On the right, the IOMMU manages
traffic from three platform devices (or "integrated devices").

Within a PCI domain, devices are identified by a Requester ID. It is a
16-bit identifier also called Bus/Device/Function (BDF). In a BDF, Bus is
8 bits, Device is 5 bits, and Function is 3 bits.

The bottom PCI domain has four endpoints connected to the root complex via
two bridges. The first endpoint is identified by BDF 01:04.0. On the other
bus, the first endpoint is identified by BDF 02:00.0, and the two other
endpoints are two functions of the same device, identified by BDFs 02:01.0
and 02:01.1. The bridges and the root complex may also issue transactions
with BDFs 00:00.0, 00:01.0 and 00:02.0.

In order for the IOMMU to differentiate devices in multiple PCI domains,
the root bridge expands the BDF with a domain ID. In example
\ref{fig:viommu / Virtual topology / Topology},
the PCI domain on top gets ID 0 and the one on the bottom gets ID 1.
Therefore when reaching the IOMMU, a transaction coming from endpoint
01:04.0 (= 0x0120) is identified by Device ID 0x10120.

We define here "platform" devices as endpoints that are on the system bus,
as opposed to behind a PCI host bridge. Unlike PCI devices, platform
devices do not have a standardized identifier scheme to be used with the
IOMMU. Their Device IDs are chosen arbitrarily during system integration
in such way that they don't overlap PCI domains or each others.

\subsubsection{Firmware description}\label{sec:viommu / Virtual topology / Firmware description}

The host describes the relation between IOMMU and devices to the guest
using either device-tree or ACPI. Topology description is outside the
scope of virtio-iommu, because the virtio-iommu does not and should not
need to know about vendor-specific buses. The virtual IOMMU identifies
each virtual endpoint with an abstract 32-bit ID, that is called "Device
ID" in this document\footnote{Other IOMMU architectures use different
names, such as "stream ID" on ARM SMMU or "source ID" on Intel VT-d}.
Device IDs are not necessarily unique system-wide, but they should not
overlap within a single virtio-iommu. Device IDs of physical endpoints do
not need to match IDs seen by the physical IOMMU.

We strongly advise to implement the virtio-iommu using virtio-mmio
transport. Nothing prevents an implementation to use virtio-pci instead,
but existing firmware interfaces do not easily allow to describe an IOMMU
$\leftrightarrow$ master relations between PCI endpoints. Device models in
Operating Systems might not be designed to support such complicated
system.

Device-tree offers a way to describe the IOMMU topology for PCI and
platform devices. Here's an excerpt of the device-tree describing examples
\ref{fig:viommu / Virtual topology / Topology}.

\begin{lstlisting}
/* The virtual IOMMU is described with a virtio-mmio node */
viommu: virtio@9050000 {
	compatible = "virtio,mmio";
	reg = <0x09050000 0x200>;
	dma-coherent;
	interrupts = <0x0 0x5 0x1>;

	#iommu-cells = <1>
};

/* PCI domain 0 */
pcie@3eff0000 {
	...
	/* Identity map */
	iommu-map = <0x0 &viommu 0x0 0x10000>;
};

/* PCI domain 1 */
pcie@3f000000 {
	...
	/* Linear map: deviceID = RID + 0x10000 */
	iommu-map = <0x0 &viommu 0x10000 0x10000>;
};

someplatformdevice@a000000 {
	...
	iommus = <&viommu 0x20000>;
};
\end{lstlisting}

For more details, please refer to \hyperref[intro:IOMMU DT Bindings]{[IOMMU DT]}.

In ACPI, the plan would be to add a new node type to the IO Remapping
Table specification \hyperref[intro:ACPI IORT]{[ACPI IORT]}, that provides
a mechanism similar to DT for describing IOMMU topology.

The OS would parse the IORT table to build a map of ID relations between
IOMMU and devices. ID Array is used to find correspondence between IOMMU
IDs and PCI or platform devices. Later on, the virtio-iommu driver finds
the associated LNRO0005 descriptor via the "Device object name" field, and
probes the virtio device to find out more about its capabilities. Since
all properties of the IOMMU will be obtained during virtio probing, the
IORT node can stay simple.

The following table shows the possible\protect\footnotemark\ format for a
paravirtualized IOMMU IORT node.

\footnotetext{This table IS NOT authoritative, only a suggestion.
  Such a node would be described in \hyperref[intro:ACPI IORT]{[ACPI IORT]}}.

\begin{center}
\begin{tabular}{| l | l | l | p{.4\textwidth} |}
\hline
\textbf{Field} & \textbf{Length} & \textbf{Offset} & \textbf{Description} \\
\hline
Type                  & 1     & 0     & 5: Paravirtualized IOMMU \\
\hline
Length                & 2     & 1     & The length of the node. \\
\hline
Revision              & 1     & 3     & 0 \\
\hline
Reserved              & 4     & 4     & Must be zero. \\
\hline
Number of ID mapping  & 4     & 8     & \\
\hline
Reference to ID Array & 4     & 12    &
  Offset from the start of the ID Array IORT node to the start of its
  Array ID mappings.\\
\hline
Model                 & 4     & 16    & 0: virtio-iommu \\
\hline
Device object name    &       & 20    &
  ASCII Null terminated string with the full path to the entry in the
  namespace for this IOMMU. \\
\hline
Padding & & & To keep 32-bit alignment and leave space for future models. \\
\hline
Array of ID mappings  & 20xN  &       & ID Array. \\
\hline
\end{tabular}
\end{center}

---
\subsection{Message Signaled Interrupts}\label{sec:viommu / MSI}

Some buses, such as PCI, implement Message Signaled Interrupts. Instead of
requesting an interrupt via a wire that runs from the endpoint to the irqchip,
the endpoint can request interrupts by performing a memory write to a specific
register (the "doorbell").

By combining the data written to the doorbell, the address itself, and the
originator of the write, the IRQ chip deduces the destination interrupt
number and destination processing units. Additional devices between the
endpoint and the IRQ chip may translate the doorbell address, the IRQ
number and verify that the endpoint is allowed to send this interrupt.

Different platforms implement IRQ remapping and routing in different ways.
This section describes three ways of dealing with Message Signaled
Interrupts in virtio-iommu devices and drivers.

In simplest systems, the endpoint writes the plain interrupt number to the
doorbell, and the IRQ chip signals the interruption to destination CPUs
programmed by software. Section \ref{sec:viommu / MSI / Address bypass}
describes how to implement a simple system with virtio-iommu. Section
\ref{sec:viommu / MSI / Address translation} describes the added complexity
(from the host point of view) of translating the IRQ chip doorbell.

More complex systems add a level of indirection in the MSI message. The address
or data contains an index into a remapping table, that describes interrupt
delivery in details and is programmed by software either into the IRQ chip or
the IOMMU. Section \ref{sec:viommu / MSI / IRQ remapping} describes how to use
the remapping feature of virtio-iommu.

\subsubsection{Address bypass}\label{sec:viommu / MSI / Address bypass}

\begin{figure}[htb]
  \centering
  \includegraphics{img/MSI-addr-noremap.png}
  \caption{MSI remapping with address bypass}
\end{figure}

Bypassing translation for MSIs is the simplest implementation from the host
perspective. The virtio-iommu device has a special IOVA window that it does not
translate. Any access from devices to that region is forwarded upstream of the
IOMMU without being translated or even checked.

The IRQ chip may or may not have an IRQ remapping component. It may be as
simple as generating the interrupt number described in data, without checking
if the device was allowed to send that interrupt. If there is another
component performing the isolation, one might consider translating the
doorbell address superfluous.

With virtio-iommu, the device can advertise the doorbell address as
untranslated by using the PROBE request with a reserved region (see
\ref{sec:Device Types / IOMMU Device / Device operations / PROBE properties / RESV_MEM}).

For example, if the virtual platform has an IRQ remapping module with a
doorbell in the physical address range 0xfee00000-0xfeefffff, then the
device can present the following property to the driver:

\begin{lstlisting}
struct __attribute__((packed)) {
	struct virtio_iommu_probe_property	head;
	struct virtio_iommu_probe_resv_mem	mem;
} doorbell = {
	.head = {
		.type		= VIRTIO_IOMMU_PROBE_T_RESV_MEM,
		.length		= sizeof(doorbell.mem),
	},
	.mem = {
		.subtype	= VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS,
		.flags		= VIRTIO_IOMMU_PROBE_RESV_MEM_F_MSI,
		.addr		= 0xfee00000,
		.size		= 0x00100000,
	},
};
\end{lstlisting}

\subsubsection{Address translation}\label{sec:viommu / MSI / Address translation}

\begin{figure}[htb]
  \centering
  \includegraphics{img/MSI-addr-remap.png}
  \caption{MSI remapping with address translation}
\end{figure}

On some systems (e.g. ARM-based platforms) the IOMMU does not have a special
MSI window, and MSIs are treated like any other memory write. The MSI address
therefore has to be translated by the IOMMU before reaching the IRQ chip.

Address translation may be used as a rudimentary form of MSI isolation,
but multiple endpoints will typically access the same doorbell. Address
translation can only forbid an endpoint from sending interrupts. If it is
allowed to send MSIs, the endpoint can easily spoof another endpoint by
sending interrupts that were not assigned to it.

From the virtio-iommu point of view, this is the simplest to implement, because
there is no special address range. The whole address space is treated the same
by the virtio-iommu device.

However, this mode of operations may add significant complexity in the host
implementation.


\subsubsection{IRQ remapping}\label{sec:viommu / MSI / IRQ remapping}

Some IOMMUs (e.g. Intel and AMD IOMMUs) are able to remap IRQs themselves. 

\begin{figure}[htb]
  \centering
  \includegraphics{img/MSI-irq-remap.png}
  \caption{MSI remapping with address bypass}
\end{figure}

This version of virtio-iommu doesn't support IRQ remapping.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [virtio-dev] [RFC] virtio-iommu v0.4 - Implementation notes
@ 2017-08-04 18:19   ` Jean-Philippe Brucker
  0 siblings, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-08-04 18:19 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: will.deacon, robin.murphy, lorenzo.pieralisi, mst, jasowang,
	marc.zyngier, eric.auger, eric.auger.pro, bharat.bhushan, peterx,
	kevin.tian

The following is roughly the content of topology.tex and MSI.tex

---
\section{Implementation notes}\label{sec:viommu}

\subsection{Virtual system topology}\label{sec:viommu / Virtual topology}

\subsubsection{Example virtual topology}\label{sec:viommu / Virtual topology / Example}

\begin{figure}[htb]
  \centering
  \includegraphics[width=\textwidth]{img/virtual-topology.png}
  \caption{An example IOMMU topology}
  \label{fig:viommu / Virtual topology / Topology}
\end{figure}

Diagram~\ref{fig:viommu / Virtual topology / Topology} shows an example
system topology centered around an IOMMU. On the left, the IOMMU manages
traffic from two PCI root complexes. On the right, the IOMMU manages
traffic from three platform devices (or "integrated devices").

Within a PCI domain, devices are identified by a Requester ID. It is a
16-bit identifier also called Bus/Device/Function (BDF). In a BDF, Bus is
8 bits, Device is 5 bits, and Function is 3 bits.

The bottom PCI domain has four endpoints connected to the root complex via
two bridges. The first endpoint is identified by BDF 01:04.0. On the other
bus, the first endpoint is identified by BDF 02:00.0, and the two other
endpoints are two functions of the same device, identified by BDFs 02:01.0
and 02:01.1. The bridges and the root complex may also issue transactions
with BDFs 00:00.0, 00:01.0 and 00:02.0.

In order for the IOMMU to differentiate devices in multiple PCI domains,
the root bridge expands the BDF with a domain ID. In example
\ref{fig:viommu / Virtual topology / Topology},
the PCI domain on top gets ID 0 and the one on the bottom gets ID 1.
Therefore when reaching the IOMMU, a transaction coming from endpoint
01:04.0 (= 0x0120) is identified by Device ID 0x10120.

We define here "platform" devices as endpoints that are on the system bus,
as opposed to behind a PCI host bridge. Unlike PCI devices, platform
devices do not have a standardized identifier scheme to be used with the
IOMMU. Their Device IDs are chosen arbitrarily during system integration
in such way that they don't overlap PCI domains or each others.

\subsubsection{Firmware description}\label{sec:viommu / Virtual topology / Firmware description}

The host describes the relation between IOMMU and devices to the guest
using either device-tree or ACPI. Topology description is outside the
scope of virtio-iommu, because the virtio-iommu does not and should not
need to know about vendor-specific buses. The virtual IOMMU identifies
each virtual endpoint with an abstract 32-bit ID, that is called "Device
ID" in this document\footnote{Other IOMMU architectures use different
names, such as "stream ID" on ARM SMMU or "source ID" on Intel VT-d}.
Device IDs are not necessarily unique system-wide, but they should not
overlap within a single virtio-iommu. Device IDs of physical endpoints do
not need to match IDs seen by the physical IOMMU.

We strongly advise to implement the virtio-iommu using virtio-mmio
transport. Nothing prevents an implementation to use virtio-pci instead,
but existing firmware interfaces do not easily allow to describe an IOMMU
$\leftrightarrow$ master relations between PCI endpoints. Device models in
Operating Systems might not be designed to support such complicated
system.

Device-tree offers a way to describe the IOMMU topology for PCI and
platform devices. Here's an excerpt of the device-tree describing examples
\ref{fig:viommu / Virtual topology / Topology}.

\begin{lstlisting}
/* The virtual IOMMU is described with a virtio-mmio node */
viommu: virtio@9050000 {
	compatible = "virtio,mmio";
	reg = <0x09050000 0x200>;
	dma-coherent;
	interrupts = <0x0 0x5 0x1>;

	#iommu-cells = <1>
};

/* PCI domain 0 */
pcie@3eff0000 {
	...
	/* Identity map */
	iommu-map = <0x0 &viommu 0x0 0x10000>;
};

/* PCI domain 1 */
pcie@3f000000 {
	...
	/* Linear map: deviceID = RID + 0x10000 */
	iommu-map = <0x0 &viommu 0x10000 0x10000>;
};

someplatformdevice@a000000 {
	...
	iommus = <&viommu 0x20000>;
};
\end{lstlisting}

For more details, please refer to \hyperref[intro:IOMMU DT Bindings]{[IOMMU DT]}.

In ACPI, the plan would be to add a new node type to the IO Remapping
Table specification \hyperref[intro:ACPI IORT]{[ACPI IORT]}, that provides
a mechanism similar to DT for describing IOMMU topology.

The OS would parse the IORT table to build a map of ID relations between
IOMMU and devices. ID Array is used to find correspondence between IOMMU
IDs and PCI or platform devices. Later on, the virtio-iommu driver finds
the associated LNRO0005 descriptor via the "Device object name" field, and
probes the virtio device to find out more about its capabilities. Since
all properties of the IOMMU will be obtained during virtio probing, the
IORT node can stay simple.

The following table shows the possible\protect\footnotemark\ format for a
paravirtualized IOMMU IORT node.

\footnotetext{This table IS NOT authoritative, only a suggestion.
  Such a node would be described in \hyperref[intro:ACPI IORT]{[ACPI IORT]}}.

\begin{center}
\begin{tabular}{| l | l | l | p{.4\textwidth} |}
\hline
\textbf{Field} & \textbf{Length} & \textbf{Offset} & \textbf{Description} \\
\hline
Type                  & 1     & 0     & 5: Paravirtualized IOMMU \\
\hline
Length                & 2     & 1     & The length of the node. \\
\hline
Revision              & 1     & 3     & 0 \\
\hline
Reserved              & 4     & 4     & Must be zero. \\
\hline
Number of ID mapping  & 4     & 8     & \\
\hline
Reference to ID Array & 4     & 12    &
  Offset from the start of the ID Array IORT node to the start of its
  Array ID mappings.\\
\hline
Model                 & 4     & 16    & 0: virtio-iommu \\
\hline
Device object name    &       & 20    &
  ASCII Null terminated string with the full path to the entry in the
  namespace for this IOMMU. \\
\hline
Padding & & & To keep 32-bit alignment and leave space for future models. \\
\hline
Array of ID mappings  & 20xN  &       & ID Array. \\
\hline
\end{tabular}
\end{center}

---
\subsection{Message Signaled Interrupts}\label{sec:viommu / MSI}

Some buses, such as PCI, implement Message Signaled Interrupts. Instead of
requesting an interrupt via a wire that runs from the endpoint to the irqchip,
the endpoint can request interrupts by performing a memory write to a specific
register (the "doorbell").

By combining the data written to the doorbell, the address itself, and the
originator of the write, the IRQ chip deduces the destination interrupt
number and destination processing units. Additional devices between the
endpoint and the IRQ chip may translate the doorbell address, the IRQ
number and verify that the endpoint is allowed to send this interrupt.

Different platforms implement IRQ remapping and routing in different ways.
This section describes three ways of dealing with Message Signaled
Interrupts in virtio-iommu devices and drivers.

In simplest systems, the endpoint writes the plain interrupt number to the
doorbell, and the IRQ chip signals the interruption to destination CPUs
programmed by software. Section \ref{sec:viommu / MSI / Address bypass}
describes how to implement a simple system with virtio-iommu. Section
\ref{sec:viommu / MSI / Address translation} describes the added complexity
(from the host point of view) of translating the IRQ chip doorbell.

More complex systems add a level of indirection in the MSI message. The address
or data contains an index into a remapping table, that describes interrupt
delivery in details and is programmed by software either into the IRQ chip or
the IOMMU. Section \ref{sec:viommu / MSI / IRQ remapping} describes how to use
the remapping feature of virtio-iommu.

\subsubsection{Address bypass}\label{sec:viommu / MSI / Address bypass}

\begin{figure}[htb]
  \centering
  \includegraphics{img/MSI-addr-noremap.png}
  \caption{MSI remapping with address bypass}
\end{figure}

Bypassing translation for MSIs is the simplest implementation from the host
perspective. The virtio-iommu device has a special IOVA window that it does not
translate. Any access from devices to that region is forwarded upstream of the
IOMMU without being translated or even checked.

The IRQ chip may or may not have an IRQ remapping component. It may be as
simple as generating the interrupt number described in data, without checking
if the device was allowed to send that interrupt. If there is another
component performing the isolation, one might consider translating the
doorbell address superfluous.

With virtio-iommu, the device can advertise the doorbell address as
untranslated by using the PROBE request with a reserved region (see
\ref{sec:Device Types / IOMMU Device / Device operations / PROBE properties / RESV_MEM}).

For example, if the virtual platform has an IRQ remapping module with a
doorbell in the physical address range 0xfee00000-0xfeefffff, then the
device can present the following property to the driver:

\begin{lstlisting}
struct __attribute__((packed)) {
	struct virtio_iommu_probe_property	head;
	struct virtio_iommu_probe_resv_mem	mem;
} doorbell = {
	.head = {
		.type		= VIRTIO_IOMMU_PROBE_T_RESV_MEM,
		.length		= sizeof(doorbell.mem),
	},
	.mem = {
		.subtype	= VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS,
		.flags		= VIRTIO_IOMMU_PROBE_RESV_MEM_F_MSI,
		.addr		= 0xfee00000,
		.size		= 0x00100000,
	},
};
\end{lstlisting}

\subsubsection{Address translation}\label{sec:viommu / MSI / Address translation}

\begin{figure}[htb]
  \centering
  \includegraphics{img/MSI-addr-remap.png}
  \caption{MSI remapping with address translation}
\end{figure}

On some systems (e.g. ARM-based platforms) the IOMMU does not have a special
MSI window, and MSIs are treated like any other memory write. The MSI address
therefore has to be translated by the IOMMU before reaching the IRQ chip.

Address translation may be used as a rudimentary form of MSI isolation,
but multiple endpoints will typically access the same doorbell. Address
translation can only forbid an endpoint from sending interrupts. If it is
allowed to send MSIs, the endpoint can easily spoof another endpoint by
sending interrupts that were not assigned to it.

From the virtio-iommu point of view, this is the simplest to implement, because
there is no special address range. The whole address space is treated the same
by the virtio-iommu device.

However, this mode of operations may add significant complexity in the host
implementation.


\subsubsection{IRQ remapping}\label{sec:viommu / MSI / IRQ remapping}

Some IOMMUs (e.g. Intel and AMD IOMMUs) are able to remap IRQs themselves. 

\begin{figure}[htb]
  \centering
  \includegraphics{img/MSI-irq-remap.png}
  \caption{MSI remapping with address bypass}
\end{figure}

This version of virtio-iommu doesn't support IRQ remapping.

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [RFC] virtio-iommu version 0.4
  2017-08-04 18:19 ` [virtio-dev] " Jean-Philippe Brucker
                   ` (4 preceding siblings ...)
  (?)
@ 2017-08-14  8:27 ` Tian, Kevin
       [not found]   ` <AADFC41AFE54684AB9EE6CBC0274A5D190D7173C-0J0gbvR4kThpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2017-08-14  9:38   ` Jean-Philippe Brucker
  -1 siblings, 2 replies; 71+ messages in thread
From: Tian, Kevin @ 2017-08-14  8:27 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: will.deacon, robin.murphy, lorenzo.pieralisi, mst, jasowang,
	marc.zyngier, eric.auger, eric.auger.pro, bharat.bhushan, peterx

> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
> Sent: Saturday, August 5, 2017 2:19 AM
> 
> This is the continuation of my proposal for virtio-iommu, the para-
> virtualized IOMMU. Here is a summary of the changes since last time [1]:
> 
> * The virtio-iommu document now resembles an actual specification. It is
>   split into a formal description of the virtio device, and implementation
>   notes. Please find sources and binaries at [2].
> 
> * Added a probe request to describe to the guest different properties that
>   do not fit in firmware or in the virtio config space. This is a
>   necessary stepping stone for extending the virtio-iommu.
> 
> * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
>   Bhushan.
> 
> You can find the Linux driver and kvmtool device at [4] and [5]. I
> plan to rework driver and kvmtool device slightly before sending the
> patches.
> 
> To understand the virtio-iommu, I advise to first read introduction and
> motivation, then skim through implementation notes and finally look at the
> device specification.
> 
> I wasn't sure how to organize the review. For those who prefer to comment
> inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex
> to this thread. They are the biggest chunks of the document. But LaTeX
> isn't very pleasant to read, so you can simply send a list of comments in
> relation to section numbers and a few words of context, we'll manage.
> 
> ---
> Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare
> more easily differences since the RFC (see [6]), but haven't been made
> public so far. This is the first public posting since initial proposal
> [1], and the following describes all changes.
> 
> ## v0.1 ##
> 
> Content is the same as the RFC, but formatted to LaTeX. 'make' generates
> one PDF and one HTML document.
> 
> ## v0.2 ##
> 
> Add introductions, improve topology example and firmware description
> based
> on feedback and a number of useful discussions.
> 
> ## v0.3 ##
> 
> Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten
> the device and driver behaviour. Unmap semantics are consolidated; they
> are now closer to VFIO Type1 v2 semantics.
> 
> ## v0.4 ##
> 
> Introduce PROBE requests. They provide per-endpoint information to the
> driver that couldn't be described otherwise.
> 
> For the moment, they allow to handle MSIs on x86 virtual platforms (see
> 3.2). To do that we communicate reserved IOVA regions, that will also be
> useful for describing regions that cannot be mapped for a given endpoint,
> for instance addresses that correspond to a PCI bridge window.
> 
> Introducing such a large framework for this tiny feature may seem
> overkill, but it is needed for future extensions of the virtio-iommu and I
> believe it really is worth the effort.
> 
> ## Future ##
> 
> Other extensions are in preparation. I won't detail them here because v0.4
> already is a lot to digest, but in short, building on top of PROBE:
> 
> * First, since the IOMMU is paravirtualized, the device can expose some
>   properties of the physical topology to the guest, and let it allocate
>   resources more efficiently. For example, when the virtio-iommu manages
>   both physical and emulated endpoints, with different underlying IOMMUs,
>   we now have a way to describe multiple page and block granularities,
>   instead of forcing the guest to use the most restricted one for all
>   endpoints. This will most likely be in v0.5.

emulated IOMMU has similar requirement, e.g. available PASID bits,
address widths, etc. which may break guest usage if not aligned to
physical limitation. Suppose we can introduce a general interface
through VFIO for all vIOMMU incarnations. 

> 
> * Then on top of that, a major improvement will describe hardware
>   acceleration features available to the guest. There is what I call "Page
>   Table Handover" (or simply, from the host POV, "Nested"), the ability
>   for the guest to manipulate its own page tables instead of sending
>   MAP/UNMAP requests to the host. This, along with IO Page Fault
>   reporting, will also permit SVM virtualization on different platforms.

what's your planned cadence for future versions? :-)

> 
> Thanks,
> Jean
> 
> [1] http://www.spinics.net/lists/kvm/msg147990.html
> [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
>     http://www.linux-arm.org/git?p=virtio-
> iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
>     I reiterate the disclaimers: don't use this document as a reference,
>     it's a draft. It's also not an OASIS document yet. It may be riddled
>     with mistakes. As this is a working draft, it is unstable and I do not
>     guarantee backward compatibility of future versions.
> [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
> [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
>     Warning: UAPI headers have changed! They didn't follow the spec,
>     please update. (Use branch v0.1, that has the old headers, for the
>     Qemu prototype [3])
> [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
>     Warning: command-line has changed! Use --viommu vfio[,opts] and
>     --viommu virtio[,opts] to instantiate a device.
> [6] http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [RFC] virtio-iommu version 0.4
  2017-08-04 18:19 ` [virtio-dev] " Jean-Philippe Brucker
                   ` (5 preceding siblings ...)
  (?)
@ 2017-08-14  8:27 ` Tian, Kevin
  -1 siblings, 0 replies; 71+ messages in thread
From: Tian, Kevin @ 2017-08-14  8:27 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: lorenzo.pieralisi, mst, marc.zyngier, will.deacon, eric.auger,
	robin.murphy, eric.auger.pro

> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
> Sent: Saturday, August 5, 2017 2:19 AM
> 
> This is the continuation of my proposal for virtio-iommu, the para-
> virtualized IOMMU. Here is a summary of the changes since last time [1]:
> 
> * The virtio-iommu document now resembles an actual specification. It is
>   split into a formal description of the virtio device, and implementation
>   notes. Please find sources and binaries at [2].
> 
> * Added a probe request to describe to the guest different properties that
>   do not fit in firmware or in the virtio config space. This is a
>   necessary stepping stone for extending the virtio-iommu.
> 
> * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
>   Bhushan.
> 
> You can find the Linux driver and kvmtool device at [4] and [5]. I
> plan to rework driver and kvmtool device slightly before sending the
> patches.
> 
> To understand the virtio-iommu, I advise to first read introduction and
> motivation, then skim through implementation notes and finally look at the
> device specification.
> 
> I wasn't sure how to organize the review. For those who prefer to comment
> inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex
> to this thread. They are the biggest chunks of the document. But LaTeX
> isn't very pleasant to read, so you can simply send a list of comments in
> relation to section numbers and a few words of context, we'll manage.
> 
> ---
> Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare
> more easily differences since the RFC (see [6]), but haven't been made
> public so far. This is the first public posting since initial proposal
> [1], and the following describes all changes.
> 
> ## v0.1 ##
> 
> Content is the same as the RFC, but formatted to LaTeX. 'make' generates
> one PDF and one HTML document.
> 
> ## v0.2 ##
> 
> Add introductions, improve topology example and firmware description
> based
> on feedback and a number of useful discussions.
> 
> ## v0.3 ##
> 
> Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten
> the device and driver behaviour. Unmap semantics are consolidated; they
> are now closer to VFIO Type1 v2 semantics.
> 
> ## v0.4 ##
> 
> Introduce PROBE requests. They provide per-endpoint information to the
> driver that couldn't be described otherwise.
> 
> For the moment, they allow to handle MSIs on x86 virtual platforms (see
> 3.2). To do that we communicate reserved IOVA regions, that will also be
> useful for describing regions that cannot be mapped for a given endpoint,
> for instance addresses that correspond to a PCI bridge window.
> 
> Introducing such a large framework for this tiny feature may seem
> overkill, but it is needed for future extensions of the virtio-iommu and I
> believe it really is worth the effort.
> 
> ## Future ##
> 
> Other extensions are in preparation. I won't detail them here because v0.4
> already is a lot to digest, but in short, building on top of PROBE:
> 
> * First, since the IOMMU is paravirtualized, the device can expose some
>   properties of the physical topology to the guest, and let it allocate
>   resources more efficiently. For example, when the virtio-iommu manages
>   both physical and emulated endpoints, with different underlying IOMMUs,
>   we now have a way to describe multiple page and block granularities,
>   instead of forcing the guest to use the most restricted one for all
>   endpoints. This will most likely be in v0.5.

emulated IOMMU has similar requirement, e.g. available PASID bits,
address widths, etc. which may break guest usage if not aligned to
physical limitation. Suppose we can introduce a general interface
through VFIO for all vIOMMU incarnations. 

> 
> * Then on top of that, a major improvement will describe hardware
>   acceleration features available to the guest. There is what I call "Page
>   Table Handover" (or simply, from the host POV, "Nested"), the ability
>   for the guest to manipulate its own page tables instead of sending
>   MAP/UNMAP requests to the host. This, along with IO Page Fault
>   reporting, will also permit SVM virtualization on different platforms.

what's your planned cadence for future versions? :-)

> 
> Thanks,
> Jean
> 
> [1] http://www.spinics.net/lists/kvm/msg147990.html
> [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
>     http://www.linux-arm.org/git?p=virtio-
> iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
>     I reiterate the disclaimers: don't use this document as a reference,
>     it's a draft. It's also not an OASIS document yet. It may be riddled
>     with mistakes. As this is a working draft, it is unstable and I do not
>     guarantee backward compatibility of future versions.
> [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
> [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
>     Warning: UAPI headers have changed! They didn't follow the spec,
>     please update. (Use branch v0.1, that has the old headers, for the
>     Qemu prototype [3])
> [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
>     Warning: command-line has changed! Use --viommu vfio[,opts] and
>     --viommu virtio[,opts] to instantiate a device.
> [6] http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] virtio-iommu version 0.4
  2017-08-14  8:27 ` [RFC] virtio-iommu version 0.4 Tian, Kevin
@ 2017-08-14  9:38       ` Jean-Philippe Brucker
  2017-08-14  9:38   ` Jean-Philippe Brucker
  1 sibling, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-08-14  9:38 UTC (permalink / raw)
  To: Tian, Kevin, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kvm-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	virtio-dev-sDuHXQ4OtrM4h7I2RyI4rWD2FQJk+8+b
  Cc: mst-H+wXaHxf7aLQT0dZR+AlfA, marc.zyngier-5wv7dgnIgG8,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA, will.deacon-5wv7dgnIgG8,
	eric.auger.pro-Re5JQEeQqe8AvxtiuMwx3w

On 14/08/17 09:27, Tian, Kevin wrote:
>> * First, since the IOMMU is paravirtualized, the device can expose some
>>   properties of the physical topology to the guest, and let it allocate
>>   resources more efficiently. For example, when the virtio-iommu manages
>>   both physical and emulated endpoints, with different underlying IOMMUs,
>>   we now have a way to describe multiple page and block granularities,
>>   instead of forcing the guest to use the most restricted one for all
>>   endpoints. This will most likely be in v0.5.
> 
> emulated IOMMU has similar requirement, e.g. available PASID bits,
> address widths, etc. which may break guest usage if not aligned to
> physical limitation. Suppose we can introduce a general interface
> through VFIO for all vIOMMU incarnations. 

A nice location for this kind of info would be sysfs, as discussed in the
SVM virtualization thread [1]. Properties of an IOMMU could be described
in /sys/class/iommu/<dev>. Properties of a PCI device are available in its
PASID/PRI capabilities. For platform devices we'll have to look at DT and
ACPI properties in /sys/firmware.

>> * Then on top of that, a major improvement will describe hardware
>>   acceleration features available to the guest. There is what I call "Page
>>   Table Handover" (or simply, from the host POV, "Nested"), the ability
>>   for the guest to manipulate its own page tables instead of sending
>>   MAP/UNMAP requests to the host. This, along with IO Page Fault
>>   reporting, will also permit SVM virtualization on different platforms.
> 
> what's your planned cadence for future versions? :-)

Hard to say, it depends on a number of things. I have various other tasks
eating up my bandwidth at the moment and I may have to considerably rework
this version depending on the feedback it gets. Ideally, I would like to
get the base driver merged and a proposal for hardware acceleration out by
the end of the year, but I obviously can't make any guarantee.

Thanks,
Jean

[1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg05731.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] virtio-iommu version 0.4
  2017-08-14  8:27 ` [RFC] virtio-iommu version 0.4 Tian, Kevin
       [not found]   ` <AADFC41AFE54684AB9EE6CBC0274A5D190D7173C-0J0gbvR4kThpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2017-08-14  9:38   ` Jean-Philippe Brucker
  1 sibling, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-08-14  9:38 UTC (permalink / raw)
  To: Tian, Kevin, iommu, kvm, virtualization, virtio-dev
  Cc: lorenzo.pieralisi, mst, marc.zyngier, will.deacon, eric.auger,
	robin.murphy, eric.auger.pro

On 14/08/17 09:27, Tian, Kevin wrote:
>> * First, since the IOMMU is paravirtualized, the device can expose some
>>   properties of the physical topology to the guest, and let it allocate
>>   resources more efficiently. For example, when the virtio-iommu manages
>>   both physical and emulated endpoints, with different underlying IOMMUs,
>>   we now have a way to describe multiple page and block granularities,
>>   instead of forcing the guest to use the most restricted one for all
>>   endpoints. This will most likely be in v0.5.
> 
> emulated IOMMU has similar requirement, e.g. available PASID bits,
> address widths, etc. which may break guest usage if not aligned to
> physical limitation. Suppose we can introduce a general interface
> through VFIO for all vIOMMU incarnations. 

A nice location for this kind of info would be sysfs, as discussed in the
SVM virtualization thread [1]. Properties of an IOMMU could be described
in /sys/class/iommu/<dev>. Properties of a PCI device are available in its
PASID/PRI capabilities. For platform devices we'll have to look at DT and
ACPI properties in /sys/firmware.

>> * Then on top of that, a major improvement will describe hardware
>>   acceleration features available to the guest. There is what I call "Page
>>   Table Handover" (or simply, from the host POV, "Nested"), the ability
>>   for the guest to manipulate its own page tables instead of sending
>>   MAP/UNMAP requests to the host. This, along with IO Page Fault
>>   reporting, will also permit SVM virtualization on different platforms.
> 
> what's your planned cadence for future versions? :-)

Hard to say, it depends on a number of things. I have various other tasks
eating up my bandwidth at the moment and I may have to considerably rework
this version depending on the feedback it gets. Ideally, I would like to
get the base driver merged and a proposal for hardware acceleration out by
the end of the year, but I obviously can't make any guarantee.

Thanks,
Jean

[1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg05731.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [virtio-dev] Re: [RFC] virtio-iommu version 0.4
@ 2017-08-14  9:38       ` Jean-Philippe Brucker
  0 siblings, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-08-14  9:38 UTC (permalink / raw)
  To: Tian, Kevin, iommu, kvm, virtualization, virtio-dev
  Cc: will.deacon, robin.murphy, lorenzo.pieralisi, mst, jasowang,
	marc.zyngier, eric.auger, eric.auger.pro, bharat.bhushan, peterx

On 14/08/17 09:27, Tian, Kevin wrote:
>> * First, since the IOMMU is paravirtualized, the device can expose some
>>   properties of the physical topology to the guest, and let it allocate
>>   resources more efficiently. For example, when the virtio-iommu manages
>>   both physical and emulated endpoints, with different underlying IOMMUs,
>>   we now have a way to describe multiple page and block granularities,
>>   instead of forcing the guest to use the most restricted one for all
>>   endpoints. This will most likely be in v0.5.
> 
> emulated IOMMU has similar requirement, e.g. available PASID bits,
> address widths, etc. which may break guest usage if not aligned to
> physical limitation. Suppose we can introduce a general interface
> through VFIO for all vIOMMU incarnations. 

A nice location for this kind of info would be sysfs, as discussed in the
SVM virtualization thread [1]. Properties of an IOMMU could be described
in /sys/class/iommu/<dev>. Properties of a PCI device are available in its
PASID/PRI capabilities. For platform devices we'll have to look at DT and
ACPI properties in /sys/firmware.

>> * Then on top of that, a major improvement will describe hardware
>>   acceleration features available to the guest. There is what I call "Page
>>   Table Handover" (or simply, from the host POV, "Nested"), the ability
>>   for the guest to manipulate its own page tables instead of sending
>>   MAP/UNMAP requests to the host. This, along with IO Page Fault
>>   reporting, will also permit SVM virtualization on different platforms.
> 
> what's your planned cadence for future versions? :-)

Hard to say, it depends on a number of things. I have various other tasks
eating up my bandwidth at the moment and I may have to considerably rework
this version depending on the feedback it gets. Ideally, I would like to
get the base driver merged and a proposal for hardware acceleration out by
the end of the year, but I obviously can't make any guarantee.

Thanks,
Jean

[1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg05731.html

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [virtio-dev] [RFC] virtio-iommu version 0.4
  2017-08-04 18:19 ` [virtio-dev] " Jean-Philippe Brucker
  (?)
@ 2017-08-16  4:08   ` Adam Tao
  -1 siblings, 0 replies; 71+ messages in thread
From: Adam Tao @ 2017-08-16  4:08 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: iommu, kvm, virtualization, virtio-dev, will.deacon,
	robin.murphy, lorenzo.pieralisi, mst, jasowang, marc.zyngier,
	eric.auger, eric.auger.pro, bharat.bhushan, peterx, kevin.tian

On Fri, Aug 04, 2017 at 07:19:25PM +0100, Jean-Philippe Brucker wrote:
> This is the continuation of my proposal for virtio-iommu, the para-
> virtualized IOMMU. Here is a summary of the changes since last time [1]:
> 
> * The virtio-iommu document now resembles an actual specification. It is
>   split into a formal description of the virtio device, and implementation
>   notes. Please find sources and binaries at [2].
> 
> * Added a probe request to describe to the guest different properties that
>   do not fit in firmware or in the virtio config space. This is a
>   necessary stepping stone for extending the virtio-iommu.
> 
> * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
>   Bhushan.
Hi, Brucker
I read the related spec for virtio IOMMU,
I am wondering if we support both the virtual and physical devices in
the guest to use the virtio IOMMU, how can we config the related address
translation for the different type of devices.
e.g.
1. for Virtio devs(e.g. virtio-net), we can change the memmap table used
by the vhost device.
2. for physical devices we can directly config the IOMMU/SMMU on the
host.

So do you know are there any RFCs for the back-end implementation for the virtio-IOMMU
thanks
Adam
> 
> You can find the Linux driver and kvmtool device at [4] and [5]. I
> plan to rework driver and kvmtool device slightly before sending the
> patches.
> 
> To understand the virtio-iommu, I advise to first read introduction and
> motivation, then skim through implementation notes and finally look at the
> device specification.
> 
> I wasn't sure how to organize the review. For those who prefer to comment
> inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex
> to this thread. They are the biggest chunks of the document. But LaTeX
> isn't very pleasant to read, so you can simply send a list of comments in
> relation to section numbers and a few words of context, we'll manage.
> 
> ---
> Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare
> more easily differences since the RFC (see [6]), but haven't been made
> public so far. This is the first public posting since initial proposal
> [1], and the following describes all changes.
> 
> ## v0.1 ##
> 
> Content is the same as the RFC, but formatted to LaTeX. 'make' generates
> one PDF and one HTML document.
> 
> ## v0.2 ##
> 
> Add introductions, improve topology example and firmware description based
> on feedback and a number of useful discussions.
> 
> ## v0.3 ##
> 
> Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten
> the device and driver behaviour. Unmap semantics are consolidated; they
> are now closer to VFIO Type1 v2 semantics.
> 
> ## v0.4 ##
> 
> Introduce PROBE requests. They provide per-endpoint information to the
> driver that couldn't be described otherwise.
> 
> For the moment, they allow to handle MSIs on x86 virtual platforms (see
> 3.2). To do that we communicate reserved IOVA regions, that will also be
> useful for describing regions that cannot be mapped for a given endpoint,
> for instance addresses that correspond to a PCI bridge window.
> 
> Introducing such a large framework for this tiny feature may seem
> overkill, but it is needed for future extensions of the virtio-iommu and I
> believe it really is worth the effort.
> 
> ## Future ##
> 
> Other extensions are in preparation. I won't detail them here because v0.4
> already is a lot to digest, but in short, building on top of PROBE:
> 
> * First, since the IOMMU is paravirtualized, the device can expose some
>   properties of the physical topology to the guest, and let it allocate
>   resources more efficiently. For example, when the virtio-iommu manages
>   both physical and emulated endpoints, with different underlying IOMMUs,
>   we now have a way to describe multiple page and block granularities,
>   instead of forcing the guest to use the most restricted one for all
>   endpoints. This will most likely be in v0.5.
> 
> * Then on top of that, a major improvement will describe hardware
>   acceleration features available to the guest. There is what I call "Page
>   Table Handover" (or simply, from the host POV, "Nested"), the ability
>   for the guest to manipulate its own page tables instead of sending
>   MAP/UNMAP requests to the host. This, along with IO Page Fault
>   reporting, will also permit SVM virtualization on different platforms.
> 
> Thanks,
> Jean
> 
> [1] http://www.spinics.net/lists/kvm/msg147990.html
> [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
>     http://www.linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
>     I reiterate the disclaimers: don't use this document as a reference,
>     it's a draft. It's also not an OASIS document yet. It may be riddled
>     with mistakes. As this is a working draft, it is unstable and I do not
>     guarantee backward compatibility of future versions.
> [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
> [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
>     Warning: UAPI headers have changed! They didn't follow the spec,
>     please update. (Use branch v0.1, that has the old headers, for the
>     Qemu prototype [3])
> [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
>     Warning: command-line has changed! Use --viommu vfio[,opts] and
>     --viommu virtio[,opts] to instantiate a device.
> [6] http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [virtio-dev] [RFC] virtio-iommu version 0.4
@ 2017-08-16  4:08   ` Adam Tao
  0 siblings, 0 replies; 71+ messages in thread
From: Adam Tao @ 2017-08-16  4:08 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: iommu, kvm, virtualization, virtio-dev, will.deacon,
	robin.murphy, lorenzo.pieralisi, mst, jasowang, marc.zyngier,
	eric.auger, eric.auger.pro, bharat.bhushan, peterx, kevin.tian

On Fri, Aug 04, 2017 at 07:19:25PM +0100, Jean-Philippe Brucker wrote:
> This is the continuation of my proposal for virtio-iommu, the para-
> virtualized IOMMU. Here is a summary of the changes since last time [1]:
> 
> * The virtio-iommu document now resembles an actual specification. It is
>   split into a formal description of the virtio device, and implementation
>   notes. Please find sources and binaries at [2].
> 
> * Added a probe request to describe to the guest different properties that
>   do not fit in firmware or in the virtio config space. This is a
>   necessary stepping stone for extending the virtio-iommu.
> 
> * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
>   Bhushan.
Hi, Brucker
I read the related spec for virtio IOMMU,
I am wondering if we support both the virtual and physical devices in
the guest to use the virtio IOMMU, how can we config the related address
translation for the different type of devices.
e.g.
1. for Virtio devs(e.g. virtio-net), we can change the memmap table used
by the vhost device.
2. for physical devices we can directly config the IOMMU/SMMU on the
host.

So do you know are there any RFCs for the back-end implementation for the virtio-IOMMU
thanks
Adam
> 
> You can find the Linux driver and kvmtool device at [4] and [5]. I
> plan to rework driver and kvmtool device slightly before sending the
> patches.
> 
> To understand the virtio-iommu, I advise to first read introduction and
> motivation, then skim through implementation notes and finally look at the
> device specification.
> 
> I wasn't sure how to organize the review. For those who prefer to comment
> inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex
> to this thread. They are the biggest chunks of the document. But LaTeX
> isn't very pleasant to read, so you can simply send a list of comments in
> relation to section numbers and a few words of context, we'll manage.
> 
> ---
> Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare
> more easily differences since the RFC (see [6]), but haven't been made
> public so far. This is the first public posting since initial proposal
> [1], and the following describes all changes.
> 
> ## v0.1 ##
> 
> Content is the same as the RFC, but formatted to LaTeX. 'make' generates
> one PDF and one HTML document.
> 
> ## v0.2 ##
> 
> Add introductions, improve topology example and firmware description based
> on feedback and a number of useful discussions.
> 
> ## v0.3 ##
> 
> Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten
> the device and driver behaviour. Unmap semantics are consolidated; they
> are now closer to VFIO Type1 v2 semantics.
> 
> ## v0.4 ##
> 
> Introduce PROBE requests. They provide per-endpoint information to the
> driver that couldn't be described otherwise.
> 
> For the moment, they allow to handle MSIs on x86 virtual platforms (see
> 3.2). To do that we communicate reserved IOVA regions, that will also be
> useful for describing regions that cannot be mapped for a given endpoint,
> for instance addresses that correspond to a PCI bridge window.
> 
> Introducing such a large framework for this tiny feature may seem
> overkill, but it is needed for future extensions of the virtio-iommu and I
> believe it really is worth the effort.
> 
> ## Future ##
> 
> Other extensions are in preparation. I won't detail them here because v0.4
> already is a lot to digest, but in short, building on top of PROBE:
> 
> * First, since the IOMMU is paravirtualized, the device can expose some
>   properties of the physical topology to the guest, and let it allocate
>   resources more efficiently. For example, when the virtio-iommu manages
>   both physical and emulated endpoints, with different underlying IOMMUs,
>   we now have a way to describe multiple page and block granularities,
>   instead of forcing the guest to use the most restricted one for all
>   endpoints. This will most likely be in v0.5.
> 
> * Then on top of that, a major improvement will describe hardware
>   acceleration features available to the guest. There is what I call "Page
>   Table Handover" (or simply, from the host POV, "Nested"), the ability
>   for the guest to manipulate its own page tables instead of sending
>   MAP/UNMAP requests to the host. This, along with IO Page Fault
>   reporting, will also permit SVM virtualization on different platforms.
> 
> Thanks,
> Jean
> 
> [1] http://www.spinics.net/lists/kvm/msg147990.html
> [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
>     http://www.linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
>     I reiterate the disclaimers: don't use this document as a reference,
>     it's a draft. It's also not an OASIS document yet. It may be riddled
>     with mistakes. As this is a working draft, it is unstable and I do not
>     guarantee backward compatibility of future versions.
> [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
> [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
>     Warning: UAPI headers have changed! They didn't follow the spec,
>     please update. (Use branch v0.1, that has the old headers, for the
>     Qemu prototype [3])
> [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
>     Warning: command-line has changed! Use --viommu vfio[,opts] and
>     --viommu virtio[,opts] to instantiate a device.
> [6] http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [virtio-dev] [RFC] virtio-iommu version 0.4
  2017-08-04 18:19 ` [virtio-dev] " Jean-Philippe Brucker
                   ` (7 preceding siblings ...)
  (?)
@ 2017-08-16  4:08 ` Adam Tao
  -1 siblings, 0 replies; 71+ messages in thread
From: Adam Tao @ 2017-08-16  4:08 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: virtio-dev, lorenzo.pieralisi, kvm, mst, marc.zyngier,
	will.deacon, virtualization, eric.auger, iommu, robin.murphy,
	eric.auger.pro

On Fri, Aug 04, 2017 at 07:19:25PM +0100, Jean-Philippe Brucker wrote:
> This is the continuation of my proposal for virtio-iommu, the para-
> virtualized IOMMU. Here is a summary of the changes since last time [1]:
> 
> * The virtio-iommu document now resembles an actual specification. It is
>   split into a formal description of the virtio device, and implementation
>   notes. Please find sources and binaries at [2].
> 
> * Added a probe request to describe to the guest different properties that
>   do not fit in firmware or in the virtio config space. This is a
>   necessary stepping stone for extending the virtio-iommu.
> 
> * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
>   Bhushan.
Hi, Brucker
I read the related spec for virtio IOMMU,
I am wondering if we support both the virtual and physical devices in
the guest to use the virtio IOMMU, how can we config the related address
translation for the different type of devices.
e.g.
1. for Virtio devs(e.g. virtio-net), we can change the memmap table used
by the vhost device.
2. for physical devices we can directly config the IOMMU/SMMU on the
host.

So do you know are there any RFCs for the back-end implementation for the virtio-IOMMU
thanks
Adam
> 
> You can find the Linux driver and kvmtool device at [4] and [5]. I
> plan to rework driver and kvmtool device slightly before sending the
> patches.
> 
> To understand the virtio-iommu, I advise to first read introduction and
> motivation, then skim through implementation notes and finally look at the
> device specification.
> 
> I wasn't sure how to organize the review. For those who prefer to comment
> inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex
> to this thread. They are the biggest chunks of the document. But LaTeX
> isn't very pleasant to read, so you can simply send a list of comments in
> relation to section numbers and a few words of context, we'll manage.
> 
> ---
> Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare
> more easily differences since the RFC (see [6]), but haven't been made
> public so far. This is the first public posting since initial proposal
> [1], and the following describes all changes.
> 
> ## v0.1 ##
> 
> Content is the same as the RFC, but formatted to LaTeX. 'make' generates
> one PDF and one HTML document.
> 
> ## v0.2 ##
> 
> Add introductions, improve topology example and firmware description based
> on feedback and a number of useful discussions.
> 
> ## v0.3 ##
> 
> Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten
> the device and driver behaviour. Unmap semantics are consolidated; they
> are now closer to VFIO Type1 v2 semantics.
> 
> ## v0.4 ##
> 
> Introduce PROBE requests. They provide per-endpoint information to the
> driver that couldn't be described otherwise.
> 
> For the moment, they allow to handle MSIs on x86 virtual platforms (see
> 3.2). To do that we communicate reserved IOVA regions, that will also be
> useful for describing regions that cannot be mapped for a given endpoint,
> for instance addresses that correspond to a PCI bridge window.
> 
> Introducing such a large framework for this tiny feature may seem
> overkill, but it is needed for future extensions of the virtio-iommu and I
> believe it really is worth the effort.
> 
> ## Future ##
> 
> Other extensions are in preparation. I won't detail them here because v0.4
> already is a lot to digest, but in short, building on top of PROBE:
> 
> * First, since the IOMMU is paravirtualized, the device can expose some
>   properties of the physical topology to the guest, and let it allocate
>   resources more efficiently. For example, when the virtio-iommu manages
>   both physical and emulated endpoints, with different underlying IOMMUs,
>   we now have a way to describe multiple page and block granularities,
>   instead of forcing the guest to use the most restricted one for all
>   endpoints. This will most likely be in v0.5.
> 
> * Then on top of that, a major improvement will describe hardware
>   acceleration features available to the guest. There is what I call "Page
>   Table Handover" (or simply, from the host POV, "Nested"), the ability
>   for the guest to manipulate its own page tables instead of sending
>   MAP/UNMAP requests to the host. This, along with IO Page Fault
>   reporting, will also permit SVM virtualization on different platforms.
> 
> Thanks,
> Jean
> 
> [1] http://www.spinics.net/lists/kvm/msg147990.html
> [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
>     http://www.linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
>     I reiterate the disclaimers: don't use this document as a reference,
>     it's a draft. It's also not an OASIS document yet. It may be riddled
>     with mistakes. As this is a working draft, it is unstable and I do not
>     guarantee backward compatibility of future versions.
> [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
> [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
>     Warning: UAPI headers have changed! They didn't follow the spec,
>     please update. (Use branch v0.1, that has the old headers, for the
>     Qemu prototype [3])
> [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
>     Warning: command-line has changed! Use --viommu vfio[,opts] and
>     --viommu virtio[,opts] to instantiate a device.
> [6] http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [virtio-dev] [RFC] virtio-iommu version 0.4
@ 2017-08-16  4:08   ` Adam Tao
  0 siblings, 0 replies; 71+ messages in thread
From: Adam Tao @ 2017-08-16  4:08 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: iommu, kvm, virtualization, virtio-dev, will.deacon,
	robin.murphy, lorenzo.pieralisi, mst, jasowang, marc.zyngier,
	eric.auger, eric.auger.pro, bharat.bhushan, peterx, kevin.tian

On Fri, Aug 04, 2017 at 07:19:25PM +0100, Jean-Philippe Brucker wrote:
> This is the continuation of my proposal for virtio-iommu, the para-
> virtualized IOMMU. Here is a summary of the changes since last time [1]:
> 
> * The virtio-iommu document now resembles an actual specification. It is
>   split into a formal description of the virtio device, and implementation
>   notes. Please find sources and binaries at [2].
> 
> * Added a probe request to describe to the guest different properties that
>   do not fit in firmware or in the virtio config space. This is a
>   necessary stepping stone for extending the virtio-iommu.
> 
> * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
>   Bhushan.
Hi, Brucker
I read the related spec for virtio IOMMU,
I am wondering if we support both the virtual and physical devices in
the guest to use the virtio IOMMU, how can we config the related address
translation for the different type of devices.
e.g.
1. for Virtio devs(e.g. virtio-net), we can change the memmap table used
by the vhost device.
2. for physical devices we can directly config the IOMMU/SMMU on the
host.

So do you know are there any RFCs for the back-end implementation for the virtio-IOMMU
thanks
Adam
> 
> You can find the Linux driver and kvmtool device at [4] and [5]. I
> plan to rework driver and kvmtool device slightly before sending the
> patches.
> 
> To understand the virtio-iommu, I advise to first read introduction and
> motivation, then skim through implementation notes and finally look at the
> device specification.
> 
> I wasn't sure how to organize the review. For those who prefer to comment
> inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex
> to this thread. They are the biggest chunks of the document. But LaTeX
> isn't very pleasant to read, so you can simply send a list of comments in
> relation to section numbers and a few words of context, we'll manage.
> 
> ---
> Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare
> more easily differences since the RFC (see [6]), but haven't been made
> public so far. This is the first public posting since initial proposal
> [1], and the following describes all changes.
> 
> ## v0.1 ##
> 
> Content is the same as the RFC, but formatted to LaTeX. 'make' generates
> one PDF and one HTML document.
> 
> ## v0.2 ##
> 
> Add introductions, improve topology example and firmware description based
> on feedback and a number of useful discussions.
> 
> ## v0.3 ##
> 
> Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten
> the device and driver behaviour. Unmap semantics are consolidated; they
> are now closer to VFIO Type1 v2 semantics.
> 
> ## v0.4 ##
> 
> Introduce PROBE requests. They provide per-endpoint information to the
> driver that couldn't be described otherwise.
> 
> For the moment, they allow to handle MSIs on x86 virtual platforms (see
> 3.2). To do that we communicate reserved IOVA regions, that will also be
> useful for describing regions that cannot be mapped for a given endpoint,
> for instance addresses that correspond to a PCI bridge window.
> 
> Introducing such a large framework for this tiny feature may seem
> overkill, but it is needed for future extensions of the virtio-iommu and I
> believe it really is worth the effort.
> 
> ## Future ##
> 
> Other extensions are in preparation. I won't detail them here because v0.4
> already is a lot to digest, but in short, building on top of PROBE:
> 
> * First, since the IOMMU is paravirtualized, the device can expose some
>   properties of the physical topology to the guest, and let it allocate
>   resources more efficiently. For example, when the virtio-iommu manages
>   both physical and emulated endpoints, with different underlying IOMMUs,
>   we now have a way to describe multiple page and block granularities,
>   instead of forcing the guest to use the most restricted one for all
>   endpoints. This will most likely be in v0.5.
> 
> * Then on top of that, a major improvement will describe hardware
>   acceleration features available to the guest. There is what I call "Page
>   Table Handover" (or simply, from the host POV, "Nested"), the ability
>   for the guest to manipulate its own page tables instead of sending
>   MAP/UNMAP requests to the host. This, along with IO Page Fault
>   reporting, will also permit SVM virtualization on different platforms.
> 
> Thanks,
> Jean
> 
> [1] http://www.spinics.net/lists/kvm/msg147990.html
> [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
>     http://www.linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
>     I reiterate the disclaimers: don't use this document as a reference,
>     it's a draft. It's also not an OASIS document yet. It may be riddled
>     with mistakes. As this is a working draft, it is unstable and I do not
>     guarantee backward compatibility of future versions.
> [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
> [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
>     Warning: UAPI headers have changed! They didn't follow the spec,
>     please update. (Use branch v0.1, that has the old headers, for the
>     Qemu prototype [3])
> [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
>     Warning: command-line has changed! Use --viommu vfio[,opts] and
>     --viommu virtio[,opts] to instantiate a device.
> [6] http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [virtio-dev] [RFC] virtio-iommu version 0.4
  2017-08-16  4:08   ` Adam Tao
@ 2017-08-17 10:12     ` Jean-Philippe Brucker
  -1 siblings, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-08-17 10:12 UTC (permalink / raw)
  To: Adam Tao
  Cc: iommu, kvm, virtualization, virtio-dev, will.deacon,
	robin.murphy, lorenzo.pieralisi, mst, jasowang, marc.zyngier,
	eric.auger, eric.auger.pro, bharat.bhushan, peterx, kevin.tian

Hi Adam,

On 16/08/17 05:08, Adam Tao wrote:
>> * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
>>   Bhushan.
> Hi, Brucker
> I read the related spec for virtio IOMMU,
> I am wondering if we support both the virtual and physical devices in
> the guest to use the virtio IOMMU, how can we config the related address
> translation for the different type of devices.
> e.g.
> 1. for Virtio devs(e.g. virtio-net), we can change the memmap table used
> by the vhost device.
> 2. for physical devices we can directly config the IOMMU/SMMU on the
> host.
> 
> So do you know are there any RFCs for the back-end implementation for the virtio-IOMMU
> thanks
> Adam

Eric's series [3] adds support for virtual devices (I'v only had time to
test virtio-net so far, but as vhost-iotlb seems to be supported by Qemu,
it should work with vhost-net as well). Bharat is working on the VFIO
backend for physical devices. As far as I'm aware, his latest version is:

[7] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg04094.html

Thanks,
Jean


>> [1] http://www.spinics.net/lists/kvm/msg147990.html
>> [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
>>     http://www.linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
>>     I reiterate the disclaimers: don't use this document as a reference,
>>     it's a draft. It's also not an OASIS document yet. It may be riddled
>>     with mistakes. As this is a working draft, it is unstable and I do not
>>     guarantee backward compatibility of future versions.
>> [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
>> [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
>>     Warning: UAPI headers have changed! They didn't follow the spec,
>>     please update. (Use branch v0.1, that has the old headers, for the
>>     Qemu prototype [3])
>> [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
>>     Warning: command-line has changed! Use --viommu vfio[,opts] and
>>     --viommu virtio[,opts] to instantiate a device.
>> [6] http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [virtio-dev] [RFC] virtio-iommu version 0.4
  2017-08-16  4:08   ` Adam Tao
                     ` (2 preceding siblings ...)
  (?)
@ 2017-08-17 10:12   ` Jean-Philippe Brucker
  -1 siblings, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-08-17 10:12 UTC (permalink / raw)
  To: Adam Tao
  Cc: virtio-dev, lorenzo.pieralisi, kvm, mst, marc.zyngier,
	will.deacon, virtualization, eric.auger, iommu, robin.murphy,
	eric.auger.pro

Hi Adam,

On 16/08/17 05:08, Adam Tao wrote:
>> * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
>>   Bhushan.
> Hi, Brucker
> I read the related spec for virtio IOMMU,
> I am wondering if we support both the virtual and physical devices in
> the guest to use the virtio IOMMU, how can we config the related address
> translation for the different type of devices.
> e.g.
> 1. for Virtio devs(e.g. virtio-net), we can change the memmap table used
> by the vhost device.
> 2. for physical devices we can directly config the IOMMU/SMMU on the
> host.
> 
> So do you know are there any RFCs for the back-end implementation for the virtio-IOMMU
> thanks
> Adam

Eric's series [3] adds support for virtual devices (I'v only had time to
test virtio-net so far, but as vhost-iotlb seems to be supported by Qemu,
it should work with vhost-net as well). Bharat is working on the VFIO
backend for physical devices. As far as I'm aware, his latest version is:

[7] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg04094.html

Thanks,
Jean


>> [1] http://www.spinics.net/lists/kvm/msg147990.html
>> [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
>>     http://www.linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
>>     I reiterate the disclaimers: don't use this document as a reference,
>>     it's a draft. It's also not an OASIS document yet. It may be riddled
>>     with mistakes. As this is a working draft, it is unstable and I do not
>>     guarantee backward compatibility of future versions.
>> [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
>> [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
>>     Warning: UAPI headers have changed! They didn't follow the spec,
>>     please update. (Use branch v0.1, that has the old headers, for the
>>     Qemu prototype [3])
>> [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
>>     Warning: command-line has changed! Use --viommu vfio[,opts] and
>>     --viommu virtio[,opts] to instantiate a device.
>> [6] http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [virtio-dev] [RFC] virtio-iommu version 0.4
@ 2017-08-17 10:12     ` Jean-Philippe Brucker
  0 siblings, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-08-17 10:12 UTC (permalink / raw)
  To: Adam Tao
  Cc: iommu, kvm, virtualization, virtio-dev, will.deacon,
	robin.murphy, lorenzo.pieralisi, mst, jasowang, marc.zyngier,
	eric.auger, eric.auger.pro, bharat.bhushan, peterx, kevin.tian

Hi Adam,

On 16/08/17 05:08, Adam Tao wrote:
>> * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
>>   Bhushan.
> Hi, Brucker
> I read the related spec for virtio IOMMU,
> I am wondering if we support both the virtual and physical devices in
> the guest to use the virtio IOMMU, how can we config the related address
> translation for the different type of devices.
> e.g.
> 1. for Virtio devs(e.g. virtio-net), we can change the memmap table used
> by the vhost device.
> 2. for physical devices we can directly config the IOMMU/SMMU on the
> host.
> 
> So do you know are there any RFCs for the back-end implementation for the virtio-IOMMU
> thanks
> Adam

Eric's series [3] adds support for virtual devices (I'v only had time to
test virtio-net so far, but as vhost-iotlb seems to be supported by Qemu,
it should work with vhost-net as well). Bharat is working on the VFIO
backend for physical devices. As far as I'm aware, his latest version is:

[7] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg04094.html

Thanks,
Jean


>> [1] http://www.spinics.net/lists/kvm/msg147990.html
>> [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
>>     http://www.linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
>>     I reiterate the disclaimers: don't use this document as a reference,
>>     it's a draft. It's also not an OASIS document yet. It may be riddled
>>     with mistakes. As this is a working draft, it is unstable and I do not
>>     guarantee backward compatibility of future versions.
>> [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
>> [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
>>     Warning: UAPI headers have changed! They didn't follow the spec,
>>     please update. (Use branch v0.1, that has the old headers, for the
>>     Qemu prototype [3])
>> [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
>>     Warning: command-line has changed! Use --viommu vfio[,opts] and
>>     --viommu virtio[,opts] to instantiate a device.
>> [6] http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [virtio-dev] [RFC] virtio-iommu version 0.4
  2017-08-17 10:12     ` Jean-Philippe Brucker
@ 2017-08-17 11:27         ` Bharat Bhushan
  -1 siblings, 0 replies; 71+ messages in thread
From: Bharat Bhushan @ 2017-08-17 11:27 UTC (permalink / raw)
  To: Jean-Philippe Brucker, Adam Tao
  Cc: virtio-dev-sDuHXQ4OtrM4h7I2RyI4rWD2FQJk+8+b,
	kevin.tian-ral2JQCrhuEAvxtiuMwx3w, kvm-u79uwXL29TY76Z2rM5mHXA,
	mst-H+wXaHxf7aLQT0dZR+AlfA, marc.zyngier-5wv7dgnIgG8,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA, will.deacon-5wv7dgnIgG8,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	eric.auger.pro-Re5JQEeQqe8AvxtiuMwx3w

Hi,

> -----Original Message-----
> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org]
> Sent: Thursday, August 17, 2017 3:42 PM
> To: Adam Tao <taozhe1-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org; kvm-u79uwXL29TY76Z2rM5mHXA@public.gmane.org;
> virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org; virtio-dev-sDuHXQ4OtrM4h7I2RyI4rWD2FQJk+8+b@public.gmane.org;
> will.deacon-5wv7dgnIgG8@public.gmane.org; robin.murphy-5wv7dgnIgG8@public.gmane.org; lorenzo.pieralisi-5wv7dgnIgG8@public.gmane.org;
> mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; jasowang-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; marc.zyngier-5wv7dgnIgG8@public.gmane.org;
> eric.auger-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; eric.auger.pro-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org; Bharat Bhushan
> <bharat.bhushan-3arQi8VN3Tc@public.gmane.org>; peterx-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; kevin.tian-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org
> Subject: Re: [virtio-dev] [RFC] virtio-iommu version 0.4
> 
> Hi Adam,
> 
> On 16/08/17 05:08, Adam Tao wrote:
> >> * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
> >>   Bhushan.
> > Hi, Brucker
> > I read the related spec for virtio IOMMU, I am wondering if we support
> > both the virtual and physical devices in the guest to use the virtio
> > IOMMU, how can we config the related address translation for the
> > different type of devices.
> > e.g.
> > 1. for Virtio devs(e.g. virtio-net), we can change the memmap table
> > used by the vhost device.
> > 2. for physical devices we can directly config the IOMMU/SMMU on the
> > host.
> >
> > So do you know are there any RFCs for the back-end implementation for
> > the virtio-IOMMU thanks Adam
> 
> Eric's series [3] adds support for virtual devices (I'v only had time to test
> virtio-net so far, but as vhost-iotlb seems to be supported by Qemu, it should
> work with vhost-net as well). Bharat is working on the VFIO backend for
> physical devices. As far as I'm aware, his latest version is:
> 
> [7] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg04094.html

I am working on newer version which will be based on Eric-v3.
Actually I wanted to include the fix for the issue reported by Eric, it fails if we assign more than one interface.
I reproduced the issues on my side and know the root cause. I am working on fixing this.

Thanks
-Bharat

> 
> Thanks,
> Jean
> 
> 
> >> [1] http://www.spinics.net/lists/kvm/msg147990.html
> >> [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
> >>     http://www.linux-arm.org/git?p=virtio-
> iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
> >>     I reiterate the disclaimers: don't use this document as a reference,
> >>     it's a draft. It's also not an OASIS document yet. It may be riddled
> >>     with mistakes. As this is a working draft, it is unstable and I do not
> >>     guarantee backward compatibility of future versions.
> >> [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
> >> [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
> >>     Warning: UAPI headers have changed! They didn't follow the spec,
> >>     please update. (Use branch v0.1, that has the old headers, for the
> >>     Qemu prototype [3])
> >> [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
> >>     Warning: command-line has changed! Use --viommu vfio[,opts] and
> >>     --viommu virtio[,opts] to instantiate a device.
> >> [6]
> >> http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [virtio-dev] [RFC] virtio-iommu version 0.4
  2017-08-17 10:12     ` Jean-Philippe Brucker
  (?)
  (?)
@ 2017-08-17 11:27     ` Bharat Bhushan
  -1 siblings, 0 replies; 71+ messages in thread
From: Bharat Bhushan @ 2017-08-17 11:27 UTC (permalink / raw)
  To: Jean-Philippe Brucker, Adam Tao
  Cc: virtio-dev, lorenzo.pieralisi, kvm, mst, marc.zyngier,
	will.deacon, virtualization, eric.auger, iommu, robin.murphy,
	eric.auger.pro

Hi,

> -----Original Message-----
> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
> Sent: Thursday, August 17, 2017 3:42 PM
> To: Adam Tao <taozhe1@huawei.com>
> Cc: iommu@lists.linux-foundation.org; kvm@vger.kernel.org;
> virtualization@lists.linux-foundation.org; virtio-dev@lists.oasis-open.org;
> will.deacon@arm.com; robin.murphy@arm.com; lorenzo.pieralisi@arm.com;
> mst@redhat.com; jasowang@redhat.com; marc.zyngier@arm.com;
> eric.auger@redhat.com; eric.auger.pro@gmail.com; Bharat Bhushan
> <bharat.bhushan@nxp.com>; peterx@redhat.com; kevin.tian@intel.com
> Subject: Re: [virtio-dev] [RFC] virtio-iommu version 0.4
> 
> Hi Adam,
> 
> On 16/08/17 05:08, Adam Tao wrote:
> >> * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
> >>   Bhushan.
> > Hi, Brucker
> > I read the related spec for virtio IOMMU, I am wondering if we support
> > both the virtual and physical devices in the guest to use the virtio
> > IOMMU, how can we config the related address translation for the
> > different type of devices.
> > e.g.
> > 1. for Virtio devs(e.g. virtio-net), we can change the memmap table
> > used by the vhost device.
> > 2. for physical devices we can directly config the IOMMU/SMMU on the
> > host.
> >
> > So do you know are there any RFCs for the back-end implementation for
> > the virtio-IOMMU thanks Adam
> 
> Eric's series [3] adds support for virtual devices (I'v only had time to test
> virtio-net so far, but as vhost-iotlb seems to be supported by Qemu, it should
> work with vhost-net as well). Bharat is working on the VFIO backend for
> physical devices. As far as I'm aware, his latest version is:
> 
> [7] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg04094.html

I am working on newer version which will be based on Eric-v3.
Actually I wanted to include the fix for the issue reported by Eric, it fails if we assign more than one interface.
I reproduced the issues on my side and know the root cause. I am working on fixing this.

Thanks
-Bharat

> 
> Thanks,
> Jean
> 
> 
> >> [1] http://www.spinics.net/lists/kvm/msg147990.html
> >> [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
> >>     http://www.linux-arm.org/git?p=virtio-
> iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
> >>     I reiterate the disclaimers: don't use this document as a reference,
> >>     it's a draft. It's also not an OASIS document yet. It may be riddled
> >>     with mistakes. As this is a working draft, it is unstable and I do not
> >>     guarantee backward compatibility of future versions.
> >> [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
> >> [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
> >>     Warning: UAPI headers have changed! They didn't follow the spec,
> >>     please update. (Use branch v0.1, that has the old headers, for the
> >>     Qemu prototype [3])
> >> [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
> >>     Warning: command-line has changed! Use --viommu vfio[,opts] and
> >>     --viommu virtio[,opts] to instantiate a device.
> >> [6]
> >> http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [virtio-dev] [RFC] virtio-iommu version 0.4
@ 2017-08-17 11:27         ` Bharat Bhushan
  0 siblings, 0 replies; 71+ messages in thread
From: Bharat Bhushan @ 2017-08-17 11:27 UTC (permalink / raw)
  To: Jean-Philippe Brucker, Adam Tao
  Cc: iommu, kvm, virtualization, virtio-dev, will.deacon,
	robin.murphy, lorenzo.pieralisi, mst, jasowang, marc.zyngier,
	eric.auger, eric.auger.pro, peterx, kevin.tian

Hi,

> -----Original Message-----
> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
> Sent: Thursday, August 17, 2017 3:42 PM
> To: Adam Tao <taozhe1@huawei.com>
> Cc: iommu@lists.linux-foundation.org; kvm@vger.kernel.org;
> virtualization@lists.linux-foundation.org; virtio-dev@lists.oasis-open.org;
> will.deacon@arm.com; robin.murphy@arm.com; lorenzo.pieralisi@arm.com;
> mst@redhat.com; jasowang@redhat.com; marc.zyngier@arm.com;
> eric.auger@redhat.com; eric.auger.pro@gmail.com; Bharat Bhushan
> <bharat.bhushan@nxp.com>; peterx@redhat.com; kevin.tian@intel.com
> Subject: Re: [virtio-dev] [RFC] virtio-iommu version 0.4
> 
> Hi Adam,
> 
> On 16/08/17 05:08, Adam Tao wrote:
> >> * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
> >>   Bhushan.
> > Hi, Brucker
> > I read the related spec for virtio IOMMU, I am wondering if we support
> > both the virtual and physical devices in the guest to use the virtio
> > IOMMU, how can we config the related address translation for the
> > different type of devices.
> > e.g.
> > 1. for Virtio devs(e.g. virtio-net), we can change the memmap table
> > used by the vhost device.
> > 2. for physical devices we can directly config the IOMMU/SMMU on the
> > host.
> >
> > So do you know are there any RFCs for the back-end implementation for
> > the virtio-IOMMU thanks Adam
> 
> Eric's series [3] adds support for virtual devices (I'v only had time to test
> virtio-net so far, but as vhost-iotlb seems to be supported by Qemu, it should
> work with vhost-net as well). Bharat is working on the VFIO backend for
> physical devices. As far as I'm aware, his latest version is:
> 
> [7] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg04094.html

I am working on newer version which will be based on Eric-v3.
Actually I wanted to include the fix for the issue reported by Eric, it fails if we assign more than one interface.
I reproduced the issues on my side and know the root cause. I am working on fixing this.

Thanks
-Bharat

> 
> Thanks,
> Jean
> 
> 
> >> [1] http://www.spinics.net/lists/kvm/msg147990.html
> >> [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
> >>     http://www.linux-arm.org/git?p=virtio-
> iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
> >>     I reiterate the disclaimers: don't use this document as a reference,
> >>     it's a draft. It's also not an OASIS document yet. It may be riddled
> >>     with mistakes. As this is a working draft, it is unstable and I do not
> >>     guarantee backward compatibility of future versions.
> >> [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
> >> [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
> >>     Warning: UAPI headers have changed! They didn't follow the spec,
> >>     please update. (Use branch v0.1, that has the old headers, for the
> >>     Qemu prototype [3])
> >> [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
> >>     Warning: command-line has changed! Use --viommu vfio[,opts] and
> >>     --viommu virtio[,opts] to instantiate a device.
> >> [6]
> >> http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] virtio-iommu version 0.4
  2017-08-04 18:19 ` [virtio-dev] " Jean-Philippe Brucker
@ 2017-08-23 10:01   ` Jean-Philippe Brucker
  -1 siblings, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-08-23 10:01 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: will.deacon, robin.murphy, lorenzo.pieralisi, mst, jasowang,
	marc.zyngier, eric.auger, eric.auger.pro, bharat.bhushan, peterx,
	kevin.tian

On 04/08/17 19:19, Jean-Philippe Brucker wrote:
> Other extensions are in preparation. I won't detail them here because v0.4
> already is a lot to digest, but in short, building on top of PROBE:
> 
> * First, since the IOMMU is paravirtualized, the device can expose some
>   properties of the physical topology to the guest, and let it allocate
>   resources more efficiently. For example, when the virtio-iommu manages
>   both physical and emulated endpoints, with different underlying IOMMUs,
>   we now have a way to describe multiple page and block granularities,
>   instead of forcing the guest to use the most restricted one for all
>   endpoints. This will most likely be in v0.5.

In order to extend requests with PASIDs and (later) nested mode, I intend
to rename "address_space" field to "domain", since it is a lot more
precise about what the field is referring to and the current name would
make these extensions confusing. Please find the rationale at [1].
"ioasid_bits" will be "domain_bits" and "VIRTIO_IOMMU_F_IOASID_BITS" will
be "VIRTIO_IOMMU_F_DOMAIN_BITS".

For those that had time to read this version, do you have other comments
and suggestions about v0.4? Otherwise it is the only update I have for
v0.5 (along with fine-grained address range and page size properties from
the quoted text) and I will send it soon.

In particular, please tell me now if you see the need for other
destructive changes like this one. They will be impossible to introduce
once a driver or device is upstream.

Thanks,
Jean

[1] https://www.spinics.net/lists/kvm/msg154573.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] virtio-iommu version 0.4
  2017-08-04 18:19 ` [virtio-dev] " Jean-Philippe Brucker
                   ` (8 preceding siblings ...)
  (?)
@ 2017-08-23 10:01 ` Jean-Philippe Brucker
  -1 siblings, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-08-23 10:01 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: lorenzo.pieralisi, mst, marc.zyngier, will.deacon, eric.auger,
	robin.murphy, eric.auger.pro

On 04/08/17 19:19, Jean-Philippe Brucker wrote:
> Other extensions are in preparation. I won't detail them here because v0.4
> already is a lot to digest, but in short, building on top of PROBE:
> 
> * First, since the IOMMU is paravirtualized, the device can expose some
>   properties of the physical topology to the guest, and let it allocate
>   resources more efficiently. For example, when the virtio-iommu manages
>   both physical and emulated endpoints, with different underlying IOMMUs,
>   we now have a way to describe multiple page and block granularities,
>   instead of forcing the guest to use the most restricted one for all
>   endpoints. This will most likely be in v0.5.

In order to extend requests with PASIDs and (later) nested mode, I intend
to rename "address_space" field to "domain", since it is a lot more
precise about what the field is referring to and the current name would
make these extensions confusing. Please find the rationale at [1].
"ioasid_bits" will be "domain_bits" and "VIRTIO_IOMMU_F_IOASID_BITS" will
be "VIRTIO_IOMMU_F_DOMAIN_BITS".

For those that had time to read this version, do you have other comments
and suggestions about v0.4? Otherwise it is the only update I have for
v0.5 (along with fine-grained address range and page size properties from
the quoted text) and I will send it soon.

In particular, please tell me now if you see the need for other
destructive changes like this one. They will be impossible to introduce
once a driver or device is upstream.

Thanks,
Jean

[1] https://www.spinics.net/lists/kvm/msg154573.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [virtio-dev] Re: [RFC] virtio-iommu version 0.4
@ 2017-08-23 10:01   ` Jean-Philippe Brucker
  0 siblings, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-08-23 10:01 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: will.deacon, robin.murphy, lorenzo.pieralisi, mst, jasowang,
	marc.zyngier, eric.auger, eric.auger.pro, bharat.bhushan, peterx,
	kevin.tian

On 04/08/17 19:19, Jean-Philippe Brucker wrote:
> Other extensions are in preparation. I won't detail them here because v0.4
> already is a lot to digest, but in short, building on top of PROBE:
> 
> * First, since the IOMMU is paravirtualized, the device can expose some
>   properties of the physical topology to the guest, and let it allocate
>   resources more efficiently. For example, when the virtio-iommu manages
>   both physical and emulated endpoints, with different underlying IOMMUs,
>   we now have a way to describe multiple page and block granularities,
>   instead of forcing the guest to use the most restricted one for all
>   endpoints. This will most likely be in v0.5.

In order to extend requests with PASIDs and (later) nested mode, I intend
to rename "address_space" field to "domain", since it is a lot more
precise about what the field is referring to and the current name would
make these extensions confusing. Please find the rationale at [1].
"ioasid_bits" will be "domain_bits" and "VIRTIO_IOMMU_F_IOASID_BITS" will
be "VIRTIO_IOMMU_F_DOMAIN_BITS".

For those that had time to read this version, do you have other comments
and suggestions about v0.4? Otherwise it is the only update I have for
v0.5 (along with fine-grained address range and page size properties from
the quoted text) and I will send it soon.

In particular, please tell me now if you see the need for other
destructive changes like this one. They will be impossible to introduce
once a driver or device is upstream.

Thanks,
Jean

[1] https://www.spinics.net/lists/kvm/msg154573.html

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] virtio-iommu version 0.4
  2017-08-04 18:19 ` [virtio-dev] " Jean-Philippe Brucker
@ 2017-08-23 13:55   ` Auger Eric
  -1 siblings, 0 replies; 71+ messages in thread
From: Auger Eric @ 2017-08-23 13:55 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: will.deacon, robin.murphy, lorenzo.pieralisi, mst, jasowang,
	marc.zyngier, eric.auger.pro, bharat.bhushan, peterx, kevin.tian

Hi Jean-Philippe,

On 04/08/2017 20:19, Jean-Philippe Brucker wrote:
> This is the continuation of my proposal for virtio-iommu, the para-
> virtualized IOMMU. Here is a summary of the changes since last time [1]:
> 
> * The virtio-iommu document now resembles an actual specification. It is
>   split into a formal description of the virtio device, and implementation
>   notes. Please find sources and binaries at [2].
> 
> * Added a probe request to describe to the guest different properties that
>   do not fit in firmware or in the virtio config space. This is a
>   necessary stepping stone for extending the virtio-iommu.
> 
> * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
>   Bhushan.
> 
> You can find the Linux driver and kvmtool device at [4] and [5]. I
> plan to rework driver and kvmtool device slightly before sending the
> patches.
> 
> To understand the virtio-iommu, I advise to first read introduction and
> motivation, then skim through implementation notes and finally look at the
> device specification.
> 
> I wasn't sure how to organize the review. For those who prefer to comment
> inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex
> to this thread. They are the biggest chunks of the document. But LaTeX
> isn't very pleasant to read, so you can simply send a list of comments in
> relation to section numbers and a few words of context, we'll manage.

Please find some comments/questions below:

2.6.7:1
I do not understand the footnode #6 sentence: 'Without a specific
definition of the ACK requirements for a given property type, it simply
means that the driver read all fields of that property."
2.6.7:2
what if the driver has not provided a big enough buffer and the device
cannot report all supported properties?
2.6.8.2:
Couldn't we use the same terminology as iommu_resv_type in iommu.h?
the negative form is not the easiest to understand for me.
Why F_MSI?
2.6.8.2.1
Multiple overlapping RESV_MEM properties are merged together. Device
requirement? if same types I assume?

what if the device is a physical assigned device. does the device report
the host doorbell or the guest doorbell, may be worth to clarify.

3.1.1 last paragraph
you talk about device ID but this is a terminology only known by ARM I
think. Actually it is introduced later in 3.1.2.

3.1.2
About ACPI integration. Isn't IORT ARM specific? Will other
architectures use that table? In IORT we also have Named Component node
which look pretty the same. Could you elaborate of the difference?
Device Object name: isn't it the path to the LNR0005 in the namespace?

Thanks

Eric
> 
> ---
> Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare
> more easily differences since the RFC (see [6]), but haven't been made
> public so far. This is the first public posting since initial proposal
> [1], and the following describes all changes.
> 
> ## v0.1 ##
> 
> Content is the same as the RFC, but formatted to LaTeX. 'make' generates
> one PDF and one HTML document.
> 
> ## v0.2 ##
> 
> Add introductions, improve topology example and firmware description based
> on feedback and a number of useful discussions.
> 
> ## v0.3 ##
> 
> Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten
> the device and driver behaviour. Unmap semantics are consolidated; they
> are now closer to VFIO Type1 v2 semantics.
> 
> ## v0.4 ##
> 
> Introduce PROBE requests. They provide per-endpoint information to the
> driver that couldn't be described otherwise.
> 
> For the moment, they allow to handle MSIs on x86 virtual platforms (see
> 3.2). To do that we communicate reserved IOVA regions, that will also be
> useful for describing regions that cannot be mapped for a given endpoint,
> for instance addresses that correspond to a PCI bridge window.
> 
> Introducing such a large framework for this tiny feature may seem
> overkill, but it is needed for future extensions of the virtio-iommu and I
> believe it really is worth the effort.
> 
> ## Future ##
> 
> Other extensions are in preparation. I won't detail them here because v0.4
> already is a lot to digest, but in short, building on top of PROBE:
> 
> * First, since the IOMMU is paravirtualized, the device can expose some
>   properties of the physical topology to the guest, and let it allocate
>   resources more efficiently. For example, when the virtio-iommu manages
>   both physical and emulated endpoints, with different underlying IOMMUs,
>   we now have a way to describe multiple page and block granularities,
>   instead of forcing the guest to use the most restricted one for all
>   endpoints. This will most likely be in v0.5.
> 
> * Then on top of that, a major improvement will describe hardware
>   acceleration features available to the guest. There is what I call "Page
>   Table Handover" (or simply, from the host POV, "Nested"), the ability
>   for the guest to manipulate its own page tables instead of sending
>   MAP/UNMAP requests to the host. This, along with IO Page Fault
>   reporting, will also permit SVM virtualization on different platforms.
> 
> Thanks,
> Jean
> 
> [1] http://www.spinics.net/lists/kvm/msg147990.html
> [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
>     http://www.linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
>     I reiterate the disclaimers: don't use this document as a reference,
>     it's a draft. It's also not an OASIS document yet. It may be riddled
>     with mistakes. As this is a working draft, it is unstable and I do not
>     guarantee backward compatibility of future versions.
> [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
> [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
>     Warning: UAPI headers have changed! They didn't follow the spec,
>     please update. (Use branch v0.1, that has the old headers, for the
>     Qemu prototype [3])
> [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
>     Warning: command-line has changed! Use --viommu vfio[,opts] and
>     --viommu virtio[,opts] to instantiate a device.
> [6] http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [virtio-dev] [RFC] virtio-iommu version 0.4
  2017-08-04 18:19 ` [virtio-dev] " Jean-Philippe Brucker
                   ` (11 preceding siblings ...)
  (?)
@ 2017-08-23 13:55 ` Auger Eric
  -1 siblings, 0 replies; 71+ messages in thread
From: Auger Eric @ 2017-08-23 13:55 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: lorenzo.pieralisi, mst, marc.zyngier, will.deacon, robin.murphy,
	eric.auger.pro

Hi Jean-Philippe,

On 04/08/2017 20:19, Jean-Philippe Brucker wrote:
> This is the continuation of my proposal for virtio-iommu, the para-
> virtualized IOMMU. Here is a summary of the changes since last time [1]:
> 
> * The virtio-iommu document now resembles an actual specification. It is
>   split into a formal description of the virtio device, and implementation
>   notes. Please find sources and binaries at [2].
> 
> * Added a probe request to describe to the guest different properties that
>   do not fit in firmware or in the virtio config space. This is a
>   necessary stepping stone for extending the virtio-iommu.
> 
> * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
>   Bhushan.
> 
> You can find the Linux driver and kvmtool device at [4] and [5]. I
> plan to rework driver and kvmtool device slightly before sending the
> patches.
> 
> To understand the virtio-iommu, I advise to first read introduction and
> motivation, then skim through implementation notes and finally look at the
> device specification.
> 
> I wasn't sure how to organize the review. For those who prefer to comment
> inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex
> to this thread. They are the biggest chunks of the document. But LaTeX
> isn't very pleasant to read, so you can simply send a list of comments in
> relation to section numbers and a few words of context, we'll manage.

Please find some comments/questions below:

2.6.7:1
I do not understand the footnode #6 sentence: 'Without a specific
definition of the ACK requirements for a given property type, it simply
means that the driver read all fields of that property."
2.6.7:2
what if the driver has not provided a big enough buffer and the device
cannot report all supported properties?
2.6.8.2:
Couldn't we use the same terminology as iommu_resv_type in iommu.h?
the negative form is not the easiest to understand for me.
Why F_MSI?
2.6.8.2.1
Multiple overlapping RESV_MEM properties are merged together. Device
requirement? if same types I assume?

what if the device is a physical assigned device. does the device report
the host doorbell or the guest doorbell, may be worth to clarify.

3.1.1 last paragraph
you talk about device ID but this is a terminology only known by ARM I
think. Actually it is introduced later in 3.1.2.

3.1.2
About ACPI integration. Isn't IORT ARM specific? Will other
architectures use that table? In IORT we also have Named Component node
which look pretty the same. Could you elaborate of the difference?
Device Object name: isn't it the path to the LNR0005 in the namespace?

Thanks

Eric
> 
> ---
> Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare
> more easily differences since the RFC (see [6]), but haven't been made
> public so far. This is the first public posting since initial proposal
> [1], and the following describes all changes.
> 
> ## v0.1 ##
> 
> Content is the same as the RFC, but formatted to LaTeX. 'make' generates
> one PDF and one HTML document.
> 
> ## v0.2 ##
> 
> Add introductions, improve topology example and firmware description based
> on feedback and a number of useful discussions.
> 
> ## v0.3 ##
> 
> Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten
> the device and driver behaviour. Unmap semantics are consolidated; they
> are now closer to VFIO Type1 v2 semantics.
> 
> ## v0.4 ##
> 
> Introduce PROBE requests. They provide per-endpoint information to the
> driver that couldn't be described otherwise.
> 
> For the moment, they allow to handle MSIs on x86 virtual platforms (see
> 3.2). To do that we communicate reserved IOVA regions, that will also be
> useful for describing regions that cannot be mapped for a given endpoint,
> for instance addresses that correspond to a PCI bridge window.
> 
> Introducing such a large framework for this tiny feature may seem
> overkill, but it is needed for future extensions of the virtio-iommu and I
> believe it really is worth the effort.
> 
> ## Future ##
> 
> Other extensions are in preparation. I won't detail them here because v0.4
> already is a lot to digest, but in short, building on top of PROBE:
> 
> * First, since the IOMMU is paravirtualized, the device can expose some
>   properties of the physical topology to the guest, and let it allocate
>   resources more efficiently. For example, when the virtio-iommu manages
>   both physical and emulated endpoints, with different underlying IOMMUs,
>   we now have a way to describe multiple page and block granularities,
>   instead of forcing the guest to use the most restricted one for all
>   endpoints. This will most likely be in v0.5.
> 
> * Then on top of that, a major improvement will describe hardware
>   acceleration features available to the guest. There is what I call "Page
>   Table Handover" (or simply, from the host POV, "Nested"), the ability
>   for the guest to manipulate its own page tables instead of sending
>   MAP/UNMAP requests to the host. This, along with IO Page Fault
>   reporting, will also permit SVM virtualization on different platforms.
> 
> Thanks,
> Jean
> 
> [1] http://www.spinics.net/lists/kvm/msg147990.html
> [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
>     http://www.linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
>     I reiterate the disclaimers: don't use this document as a reference,
>     it's a draft. It's also not an OASIS document yet. It may be riddled
>     with mistakes. As this is a working draft, it is unstable and I do not
>     guarantee backward compatibility of future versions.
> [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
> [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
>     Warning: UAPI headers have changed! They didn't follow the spec,
>     please update. (Use branch v0.1, that has the old headers, for the
>     Qemu prototype [3])
> [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
>     Warning: command-line has changed! Use --viommu vfio[,opts] and
>     --viommu virtio[,opts] to instantiate a device.
> [6] http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [virtio-dev] [RFC] virtio-iommu version 0.4
@ 2017-08-23 13:55   ` Auger Eric
  0 siblings, 0 replies; 71+ messages in thread
From: Auger Eric @ 2017-08-23 13:55 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: will.deacon, robin.murphy, lorenzo.pieralisi, mst, jasowang,
	marc.zyngier, eric.auger.pro, bharat.bhushan, peterx, kevin.tian

Hi Jean-Philippe,

On 04/08/2017 20:19, Jean-Philippe Brucker wrote:
> This is the continuation of my proposal for virtio-iommu, the para-
> virtualized IOMMU. Here is a summary of the changes since last time [1]:
> 
> * The virtio-iommu document now resembles an actual specification. It is
>   split into a formal description of the virtio device, and implementation
>   notes. Please find sources and binaries at [2].
> 
> * Added a probe request to describe to the guest different properties that
>   do not fit in firmware or in the virtio config space. This is a
>   necessary stepping stone for extending the virtio-iommu.
> 
> * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
>   Bhushan.
> 
> You can find the Linux driver and kvmtool device at [4] and [5]. I
> plan to rework driver and kvmtool device slightly before sending the
> patches.
> 
> To understand the virtio-iommu, I advise to first read introduction and
> motivation, then skim through implementation notes and finally look at the
> device specification.
> 
> I wasn't sure how to organize the review. For those who prefer to comment
> inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex
> to this thread. They are the biggest chunks of the document. But LaTeX
> isn't very pleasant to read, so you can simply send a list of comments in
> relation to section numbers and a few words of context, we'll manage.

Please find some comments/questions below:

2.6.7:1
I do not understand the footnode #6 sentence: 'Without a specific
definition of the ACK requirements for a given property type, it simply
means that the driver read all fields of that property."
2.6.7:2
what if the driver has not provided a big enough buffer and the device
cannot report all supported properties?
2.6.8.2:
Couldn't we use the same terminology as iommu_resv_type in iommu.h?
the negative form is not the easiest to understand for me.
Why F_MSI?
2.6.8.2.1
Multiple overlapping RESV_MEM properties are merged together. Device
requirement? if same types I assume?

what if the device is a physical assigned device. does the device report
the host doorbell or the guest doorbell, may be worth to clarify.

3.1.1 last paragraph
you talk about device ID but this is a terminology only known by ARM I
think. Actually it is introduced later in 3.1.2.

3.1.2
About ACPI integration. Isn't IORT ARM specific? Will other
architectures use that table? In IORT we also have Named Component node
which look pretty the same. Could you elaborate of the difference?
Device Object name: isn't it the path to the LNR0005 in the namespace?

Thanks

Eric
> 
> ---
> Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare
> more easily differences since the RFC (see [6]), but haven't been made
> public so far. This is the first public posting since initial proposal
> [1], and the following describes all changes.
> 
> ## v0.1 ##
> 
> Content is the same as the RFC, but formatted to LaTeX. 'make' generates
> one PDF and one HTML document.
> 
> ## v0.2 ##
> 
> Add introductions, improve topology example and firmware description based
> on feedback and a number of useful discussions.
> 
> ## v0.3 ##
> 
> Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten
> the device and driver behaviour. Unmap semantics are consolidated; they
> are now closer to VFIO Type1 v2 semantics.
> 
> ## v0.4 ##
> 
> Introduce PROBE requests. They provide per-endpoint information to the
> driver that couldn't be described otherwise.
> 
> For the moment, they allow to handle MSIs on x86 virtual platforms (see
> 3.2). To do that we communicate reserved IOVA regions, that will also be
> useful for describing regions that cannot be mapped for a given endpoint,
> for instance addresses that correspond to a PCI bridge window.
> 
> Introducing such a large framework for this tiny feature may seem
> overkill, but it is needed for future extensions of the virtio-iommu and I
> believe it really is worth the effort.
> 
> ## Future ##
> 
> Other extensions are in preparation. I won't detail them here because v0.4
> already is a lot to digest, but in short, building on top of PROBE:
> 
> * First, since the IOMMU is paravirtualized, the device can expose some
>   properties of the physical topology to the guest, and let it allocate
>   resources more efficiently. For example, when the virtio-iommu manages
>   both physical and emulated endpoints, with different underlying IOMMUs,
>   we now have a way to describe multiple page and block granularities,
>   instead of forcing the guest to use the most restricted one for all
>   endpoints. This will most likely be in v0.5.
> 
> * Then on top of that, a major improvement will describe hardware
>   acceleration features available to the guest. There is what I call "Page
>   Table Handover" (or simply, from the host POV, "Nested"), the ability
>   for the guest to manipulate its own page tables instead of sending
>   MAP/UNMAP requests to the host. This, along with IO Page Fault
>   reporting, will also permit SVM virtualization on different platforms.
> 
> Thanks,
> Jean
> 
> [1] http://www.spinics.net/lists/kvm/msg147990.html
> [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
>     http://www.linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
>     I reiterate the disclaimers: don't use this document as a reference,
>     it's a draft. It's also not an OASIS document yet. It may be riddled
>     with mistakes. As this is a working draft, it is unstable and I do not
>     guarantee backward compatibility of future versions.
> [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
> [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
>     Warning: UAPI headers have changed! They didn't follow the spec,
>     please update. (Use branch v0.1, that has the old headers, for the
>     Qemu prototype [3])
> [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
>     Warning: command-line has changed! Use --viommu vfio[,opts] and
>     --viommu virtio[,opts] to instantiate a device.
> [6] http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [RFC] virtio-iommu version 0.4
  2017-08-23 10:01   ` [virtio-dev] " Jean-Philippe Brucker
@ 2017-08-28  7:39     ` Tian, Kevin
  -1 siblings, 0 replies; 71+ messages in thread
From: Tian, Kevin @ 2017-08-28  7:39 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: lorenzo.pieralisi, mst, marc.zyngier, will.deacon, eric.auger,
	robin.murphy, eric.auger.pro

> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
> Sent: Wednesday, August 23, 2017 6:01 PM
> 
> On 04/08/17 19:19, Jean-Philippe Brucker wrote:
> > Other extensions are in preparation. I won't detail them here because
> v0.4
> > already is a lot to digest, but in short, building on top of PROBE:
> >
> > * First, since the IOMMU is paravirtualized, the device can expose some
> >   properties of the physical topology to the guest, and let it allocate
> >   resources more efficiently. For example, when the virtio-iommu
> manages
> >   both physical and emulated endpoints, with different underlying
> IOMMUs,
> >   we now have a way to describe multiple page and block granularities,
> >   instead of forcing the guest to use the most restricted one for all
> >   endpoints. This will most likely be in v0.5.
> 
> In order to extend requests with PASIDs and (later) nested mode, I intend
> to rename "address_space" field to "domain", since it is a lot more
> precise about what the field is referring to and the current name would
> make these extensions confusing. Please find the rationale at [1].
> "ioasid_bits" will be "domain_bits" and "VIRTIO_IOMMU_F_IOASID_BITS"
> will
> be "VIRTIO_IOMMU_F_DOMAIN_BITS".
> 
> For those that had time to read this version, do you have other comments
> and suggestions about v0.4? Otherwise it is the only update I have for
> v0.5 (along with fine-grained address range and page size properties from
> the quoted text) and I will send it soon.
> 
> In particular, please tell me now if you see the need for other
> destructive changes like this one. They will be impossible to introduce
> once a driver or device is upstream.
> 
> Thanks,
> Jean
> 
> [1] https://www.spinics.net/lists/kvm/msg154573.html

Here comes some comments:

1.1 Motivation

You describe I/O page faults handling as future work. Seems you considered
only recoverable fault (since "aka. PCI PRI" being used). What about other
unrecoverable faults e.g. what to do if a virtual DMA request doesn't find 
a valid mapping? Even when there is no PRI support, we need some basic
form of fault reporting mechanism to indicate such errors to guest.

2.6.8.2 Property RESV_MEM

I'm not immediately clear when VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT
should be explicitly reported. Is there any real example on bare metal IOMMU?
usually reserved memory is reported to CPU through other method (e.g. e820
on x86 platform). Of course MSI is a special case which is covered by BYPASS 
and MSI flag... If yes, maybe you can also include an example in implementation 
notes.

Another thing I want to ask your opinion, about whether there is value of
adding another subtype (MEM_T_IDENTITY), asking for identity mapping
in the address space. It's similar to Reserved Memory Region Reporting
(RMRR) structure defined in VT-d, to indicate BIOS allocated reserved
memory ranges which may be DMA target and has to be identity mapped
when DMA remapping is enabled. I'm not sure whether ARM has similar
capability and whether there might be a general usage beyond VT-d. For
now the only usage in my mind is to assign a device with RMRR associated
on VT-d (Intel GPU, or some USB controllers) where the RMRR info needs
propagated to the guest (since identity mapping also means reservation
of virtual address space).

2.6.8.2.3 Device Requirements: Property RESV_MEM

--citation start--
If an endpoint is attached to an address space, the device SHOULD leave 
any access targeting one of its VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS 
regions pass through untranslated. In other words, the device SHOULD 
handle such a region as if it was identity-mapped (virtual address equal to
physical address). If the endpoint is not attached to any address space, 
then the device MAY abort the transaction.
--citation end

I have a question for the last sentence. From definition of BYPASS, it's
orthogonal to whether there is an address space attached, then should
we still allow "May abort" behavior? 

Thanks
Kevin 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [virtio-dev] RE: [RFC] virtio-iommu version 0.4
@ 2017-08-28  7:39     ` Tian, Kevin
  0 siblings, 0 replies; 71+ messages in thread
From: Tian, Kevin @ 2017-08-28  7:39 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: will.deacon, robin.murphy, lorenzo.pieralisi, mst, jasowang,
	marc.zyngier, eric.auger, eric.auger.pro, bharat.bhushan, peterx

> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
> Sent: Wednesday, August 23, 2017 6:01 PM
> 
> On 04/08/17 19:19, Jean-Philippe Brucker wrote:
> > Other extensions are in preparation. I won't detail them here because
> v0.4
> > already is a lot to digest, but in short, building on top of PROBE:
> >
> > * First, since the IOMMU is paravirtualized, the device can expose some
> >   properties of the physical topology to the guest, and let it allocate
> >   resources more efficiently. For example, when the virtio-iommu
> manages
> >   both physical and emulated endpoints, with different underlying
> IOMMUs,
> >   we now have a way to describe multiple page and block granularities,
> >   instead of forcing the guest to use the most restricted one for all
> >   endpoints. This will most likely be in v0.5.
> 
> In order to extend requests with PASIDs and (later) nested mode, I intend
> to rename "address_space" field to "domain", since it is a lot more
> precise about what the field is referring to and the current name would
> make these extensions confusing. Please find the rationale at [1].
> "ioasid_bits" will be "domain_bits" and "VIRTIO_IOMMU_F_IOASID_BITS"
> will
> be "VIRTIO_IOMMU_F_DOMAIN_BITS".
> 
> For those that had time to read this version, do you have other comments
> and suggestions about v0.4? Otherwise it is the only update I have for
> v0.5 (along with fine-grained address range and page size properties from
> the quoted text) and I will send it soon.
> 
> In particular, please tell me now if you see the need for other
> destructive changes like this one. They will be impossible to introduce
> once a driver or device is upstream.
> 
> Thanks,
> Jean
> 
> [1] https://www.spinics.net/lists/kvm/msg154573.html

Here comes some comments:

1.1 Motivation

You describe I/O page faults handling as future work. Seems you considered
only recoverable fault (since "aka. PCI PRI" being used). What about other
unrecoverable faults e.g. what to do if a virtual DMA request doesn't find 
a valid mapping? Even when there is no PRI support, we need some basic
form of fault reporting mechanism to indicate such errors to guest.

2.6.8.2 Property RESV_MEM

I'm not immediately clear when VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT
should be explicitly reported. Is there any real example on bare metal IOMMU?
usually reserved memory is reported to CPU through other method (e.g. e820
on x86 platform). Of course MSI is a special case which is covered by BYPASS 
and MSI flag... If yes, maybe you can also include an example in implementation 
notes.

Another thing I want to ask your opinion, about whether there is value of
adding another subtype (MEM_T_IDENTITY), asking for identity mapping
in the address space. It's similar to Reserved Memory Region Reporting
(RMRR) structure defined in VT-d, to indicate BIOS allocated reserved
memory ranges which may be DMA target and has to be identity mapped
when DMA remapping is enabled. I'm not sure whether ARM has similar
capability and whether there might be a general usage beyond VT-d. For
now the only usage in my mind is to assign a device with RMRR associated
on VT-d (Intel GPU, or some USB controllers) where the RMRR info needs
propagated to the guest (since identity mapping also means reservation
of virtual address space).

2.6.8.2.3 Device Requirements: Property RESV_MEM

--citation start--
If an endpoint is attached to an address space, the device SHOULD leave 
any access targeting one of its VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS 
regions pass through untranslated. In other words, the device SHOULD 
handle such a region as if it was identity-mapped (virtual address equal to
physical address). If the endpoint is not attached to any address space, 
then the device MAY abort the transaction.
--citation end

I have a question for the last sentence. From definition of BYPASS, it's
orthogonal to whether there is an address space attached, then should
we still allow "May abort" behavior? 

Thanks
Kevin 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [virtio-dev] [RFC] virtio-iommu version 0.4
  2017-08-23 13:55   ` [virtio-dev] " Auger Eric
@ 2017-09-06 11:48       ` Jean-Philippe Brucker
  -1 siblings, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-09-06 11:48 UTC (permalink / raw)
  To: Auger Eric, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kvm-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	virtio-dev-sDuHXQ4OtrM4h7I2RyI4rWD2FQJk+8+b
  Cc: mst-H+wXaHxf7aLQT0dZR+AlfA, marc.zyngier-5wv7dgnIgG8,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA, will.deacon-5wv7dgnIgG8,
	eric.auger.pro-Re5JQEeQqe8AvxtiuMwx3w

Hi Eric,

On 23/08/17 14:55, Auger Eric wrote:
> Please find some comments/questions below:

Thanks a lot for this. Sorry for the delay, I was on holiday and it took
me a while to sort out the details.

> 2.6.7:1
> I do not understand the footnode #6 sentence: 'Without a specific
> definition of the ACK requirements for a given property type, it simply
> means that the driver read all fields of that property."

That doesn't serve any purpose, I'll remove the footnote.

> 2.6.7:2
> what if the driver has not provided a big enough buffer and the device
> cannot report all supported properties?

Good point, how about adding the following to 2.6.7.2 - Device
Requirements: PROBE Request

    If the properties array is smaller than \field{probe_size}, then the
device SHOULD NOT write any property and SHOULD set the request
\field{status} to VIRTIO_IOMMU_S_INVAL.

> 2.6.8.2:
> Couldn't we use the same terminology as iommu_resv_type in iommu.h?
> the negative form is not the easiest to understand for me.
> Why F_MSI?

It also needs to be understandable by someone not familiar with the Linux
APIs. It's not entirely obvious to me what the iommu.h regions are without
looking at the implementation, maybe using names from the device point of
view is more clear. I don't mind simplifying though, as I don't see a
reason for the driver to know about BYPASS regions other than
HW MSIs.

How about I remove ABORT and BYPASS, make the only subtype RESV_MEM_T_MSI
(1)? We'd use subtype 0 as "don't care, just don't map this". Can call it
RESV_MEM_T_RESERVED. Flags won't be used for these subtypes.

Another subtype will be RESV_MEM_T_IDENTITY (see my reply to Kevin).
IDENTITY regions must be identity-mapped by the guest and, as I understand
it, are virtual addresses used for DMA by firmware. The flags field will
then contain read/write protection flags.

> 2.6.8.2.1
> Multiple overlapping RESV_MEM properties are merged together. Device
> requirement? if same types I assume?

Combination rules apply to device and driver. When the driver puts
multiple endpoints in the same domain, combination rules apply. They will
become important once the guest attempts to do things like combining
endpoints with different PASID capabilities in the same domain. I'm
considering these endpoints to be behind different physical IOMMUs.

For reserved regions, we specify what the driver should do and what the
device should enforce with regard to mapping IOVAs of overlapping regions.
For example, if a device has some virtual address RESV_T_MSI and an other
device has the same virtual address RESV_T_IDENTITY, what should the
driver do?

I think it should apply the RESV_T_IDENTITY. RESV_T_MSI is just a special
case of RESV_T_RESERVED, it's a hint for the IRQ subsystem and doesn't
have a meaning within a domain. From DMA mappings point of view, it is
effectively the same as RESV_T_RESERVED. When merging RESV_T_RESERVED and
RESV_T_IDENTITY, we should make it RESV_T_IDENTITY. Because it is required
for one endpoint to work (the one with IDENTITY) and is not accessed by
the other (the driver must not allocate this IOVA range for DMA).

More generally, I'd like to add the following combination table to the
spec, that shows the resulting reserved region type after merging regions
from two endpoints. N: no reserved region, R: RESV_T_RESERVED, M:
RESV_T_MSI, I: RESV_T_IDENTITY. It is symmetric so I didn't fill the
bottom half.

      | N | R | M | I
    ------------------
    N | N | R | M | ?
    ------------------
    R |   | R | R | I
    ------------------
    M |   |   | M | I
    ------------------
    I |   |   |   | I

The 'I' column is problematic. If one endpoint has an IDENTITY region and
the other doesn't have anything at that virtual address, then making that
region an identity mapping will result in the second endpoint being able
to access firmware memory. If using nested translation, stage-2 can help
us here. But in general we should allow the device to reject an attach
that would result in a N+I combination if the host considers it too dodgy.
I think the other combinations are safe enough.

> what if the device is a physical assigned device. does the device report
> the host doorbell or the guest doorbell, may be worth to clarify.

Indeed. For a physical assigned device it is the guest doorbell as it
corresponds to the virtual ITS or APIC (though with the APIC it's the same
region).

> 3.1.1 last paragraph
> you talk about device ID but this is a terminology only known by ARM I
> think. Actually it is introduced later in 3.1.2.

Indeed, I'll move the explanation up.

> 3.1.2
> About ACPI integration. Isn't IORT ARM specific?

IORT is currently edited by ARM, as is virtio-iommu. Their contents are
not ARM specific however, they describe software interfaces that solve
mostly generic problems, and aim to be open.

IORT describes the relation between endpoints and their translation
agents. This problem has already been solved by IVRS, DMAR and IORT. From
a technical point of view the latter is more flexible and can host a new
translation agent. It should be relatively easy to use for virtio-iommu in
the kernel, and therefore I don't think we should introduce a fourth type
of table just for virtio-iommu.

> Will other architectures use that table?

Yes, in particular the IORT node would be used on virtual x86 platforms.
On ARM platforms, Linux guests use existing device-tree bindings.
Operating systems that don't implement DT could use IORT as well.

> In IORT we also have Named Component node
> which look pretty the same. Could you elaborate of the difference?

It's the same mechanism, except that the named component node is supposed
to be a leaf in the topology, not a translation agent. Using the same node
type could make it difficult for the driver to understand the table. But
see below.

> Device Object name: isn't it the path to the LNR0005 in the namespace?

It is. However, I wrote the example table to give a rough idea, but didn't
investigate the feasibility. I now think it would be easier to not present
the virtio-iommu as a LNRO0005 to the guest, but solely as the IORT node.
Instead of a namespace path, the node would directly describe the _CRS
content, which should be enough for the kernel to instantiate the device.

I'll prototype something to understand the problem better and find which
is easier to do. I'm busy preparing for conferences, so it might be two or
three weeks.


Unrelated remarks:

* When renaming "devices" to "endpoints" in v0.3 (to avoid confusion wrt
"The device", which is virtio-iommu dev), I didn't update the
attach/detach/probe requests, they still have 'device' fields. Since I'm
already butchering the API by renaming address_space to domain, should I
also rename 'device' to 'endpoint' in the structure? It doesn't matter as
much, since the difference here is mostly stylistic. But I think I will
get it out of the way for v0.5, to make things more clear overall.

* Regarding my question in v0.4 about adding a 'size' field to all
requests. I don't think we should do that. The problem was for parsing
interleaved RO/WO/RO buffers, but we might never hit the problem if we
design virtio-iommu properly. Framing should be left to the transport
protocol and it'd be mostly redundant in virtio-iommu requests.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [virtio-dev] [RFC] virtio-iommu version 0.4
  2017-08-23 13:55   ` [virtio-dev] " Auger Eric
  (?)
  (?)
@ 2017-09-06 11:48   ` Jean-Philippe Brucker
  -1 siblings, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-09-06 11:48 UTC (permalink / raw)
  To: Auger Eric, iommu, kvm, virtualization, virtio-dev
  Cc: lorenzo.pieralisi, mst, marc.zyngier, will.deacon, robin.murphy,
	eric.auger.pro

Hi Eric,

On 23/08/17 14:55, Auger Eric wrote:
> Please find some comments/questions below:

Thanks a lot for this. Sorry for the delay, I was on holiday and it took
me a while to sort out the details.

> 2.6.7:1
> I do not understand the footnode #6 sentence: 'Without a specific
> definition of the ACK requirements for a given property type, it simply
> means that the driver read all fields of that property."

That doesn't serve any purpose, I'll remove the footnote.

> 2.6.7:2
> what if the driver has not provided a big enough buffer and the device
> cannot report all supported properties?

Good point, how about adding the following to 2.6.7.2 - Device
Requirements: PROBE Request

    If the properties array is smaller than \field{probe_size}, then the
device SHOULD NOT write any property and SHOULD set the request
\field{status} to VIRTIO_IOMMU_S_INVAL.

> 2.6.8.2:
> Couldn't we use the same terminology as iommu_resv_type in iommu.h?
> the negative form is not the easiest to understand for me.
> Why F_MSI?

It also needs to be understandable by someone not familiar with the Linux
APIs. It's not entirely obvious to me what the iommu.h regions are without
looking at the implementation, maybe using names from the device point of
view is more clear. I don't mind simplifying though, as I don't see a
reason for the driver to know about BYPASS regions other than
HW MSIs.

How about I remove ABORT and BYPASS, make the only subtype RESV_MEM_T_MSI
(1)? We'd use subtype 0 as "don't care, just don't map this". Can call it
RESV_MEM_T_RESERVED. Flags won't be used for these subtypes.

Another subtype will be RESV_MEM_T_IDENTITY (see my reply to Kevin).
IDENTITY regions must be identity-mapped by the guest and, as I understand
it, are virtual addresses used for DMA by firmware. The flags field will
then contain read/write protection flags.

> 2.6.8.2.1
> Multiple overlapping RESV_MEM properties are merged together. Device
> requirement? if same types I assume?

Combination rules apply to device and driver. When the driver puts
multiple endpoints in the same domain, combination rules apply. They will
become important once the guest attempts to do things like combining
endpoints with different PASID capabilities in the same domain. I'm
considering these endpoints to be behind different physical IOMMUs.

For reserved regions, we specify what the driver should do and what the
device should enforce with regard to mapping IOVAs of overlapping regions.
For example, if a device has some virtual address RESV_T_MSI and an other
device has the same virtual address RESV_T_IDENTITY, what should the
driver do?

I think it should apply the RESV_T_IDENTITY. RESV_T_MSI is just a special
case of RESV_T_RESERVED, it's a hint for the IRQ subsystem and doesn't
have a meaning within a domain. From DMA mappings point of view, it is
effectively the same as RESV_T_RESERVED. When merging RESV_T_RESERVED and
RESV_T_IDENTITY, we should make it RESV_T_IDENTITY. Because it is required
for one endpoint to work (the one with IDENTITY) and is not accessed by
the other (the driver must not allocate this IOVA range for DMA).

More generally, I'd like to add the following combination table to the
spec, that shows the resulting reserved region type after merging regions
from two endpoints. N: no reserved region, R: RESV_T_RESERVED, M:
RESV_T_MSI, I: RESV_T_IDENTITY. It is symmetric so I didn't fill the
bottom half.

      | N | R | M | I
    ------------------
    N | N | R | M | ?
    ------------------
    R |   | R | R | I
    ------------------
    M |   |   | M | I
    ------------------
    I |   |   |   | I

The 'I' column is problematic. If one endpoint has an IDENTITY region and
the other doesn't have anything at that virtual address, then making that
region an identity mapping will result in the second endpoint being able
to access firmware memory. If using nested translation, stage-2 can help
us here. But in general we should allow the device to reject an attach
that would result in a N+I combination if the host considers it too dodgy.
I think the other combinations are safe enough.

> what if the device is a physical assigned device. does the device report
> the host doorbell or the guest doorbell, may be worth to clarify.

Indeed. For a physical assigned device it is the guest doorbell as it
corresponds to the virtual ITS or APIC (though with the APIC it's the same
region).

> 3.1.1 last paragraph
> you talk about device ID but this is a terminology only known by ARM I
> think. Actually it is introduced later in 3.1.2.

Indeed, I'll move the explanation up.

> 3.1.2
> About ACPI integration. Isn't IORT ARM specific?

IORT is currently edited by ARM, as is virtio-iommu. Their contents are
not ARM specific however, they describe software interfaces that solve
mostly generic problems, and aim to be open.

IORT describes the relation between endpoints and their translation
agents. This problem has already been solved by IVRS, DMAR and IORT. From
a technical point of view the latter is more flexible and can host a new
translation agent. It should be relatively easy to use for virtio-iommu in
the kernel, and therefore I don't think we should introduce a fourth type
of table just for virtio-iommu.

> Will other architectures use that table?

Yes, in particular the IORT node would be used on virtual x86 platforms.
On ARM platforms, Linux guests use existing device-tree bindings.
Operating systems that don't implement DT could use IORT as well.

> In IORT we also have Named Component node
> which look pretty the same. Could you elaborate of the difference?

It's the same mechanism, except that the named component node is supposed
to be a leaf in the topology, not a translation agent. Using the same node
type could make it difficult for the driver to understand the table. But
see below.

> Device Object name: isn't it the path to the LNR0005 in the namespace?

It is. However, I wrote the example table to give a rough idea, but didn't
investigate the feasibility. I now think it would be easier to not present
the virtio-iommu as a LNRO0005 to the guest, but solely as the IORT node.
Instead of a namespace path, the node would directly describe the _CRS
content, which should be enough for the kernel to instantiate the device.

I'll prototype something to understand the problem better and find which
is easier to do. I'm busy preparing for conferences, so it might be two or
three weeks.


Unrelated remarks:

* When renaming "devices" to "endpoints" in v0.3 (to avoid confusion wrt
"The device", which is virtio-iommu dev), I didn't update the
attach/detach/probe requests, they still have 'device' fields. Since I'm
already butchering the API by renaming address_space to domain, should I
also rename 'device' to 'endpoint' in the structure? It doesn't matter as
much, since the difference here is mostly stylistic. But I think I will
get it out of the way for v0.5, to make things more clear overall.

* Regarding my question in v0.4 about adding a 'size' field to all
requests. I don't think we should do that. The problem was for parsing
interleaved RO/WO/RO buffers, but we might never hit the problem if we
design virtio-iommu properly. Framing should be left to the transport
protocol and it'd be mostly redundant in virtio-iommu requests.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [virtio-dev] [RFC] virtio-iommu version 0.4
@ 2017-09-06 11:48       ` Jean-Philippe Brucker
  0 siblings, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-09-06 11:48 UTC (permalink / raw)
  To: Auger Eric, iommu, kvm, virtualization, virtio-dev
  Cc: will.deacon, robin.murphy, lorenzo.pieralisi, mst, jasowang,
	marc.zyngier, eric.auger.pro, bharat.bhushan, peterx, kevin.tian

Hi Eric,

On 23/08/17 14:55, Auger Eric wrote:
> Please find some comments/questions below:

Thanks a lot for this. Sorry for the delay, I was on holiday and it took
me a while to sort out the details.

> 2.6.7:1
> I do not understand the footnode #6 sentence: 'Without a specific
> definition of the ACK requirements for a given property type, it simply
> means that the driver read all fields of that property."

That doesn't serve any purpose, I'll remove the footnote.

> 2.6.7:2
> what if the driver has not provided a big enough buffer and the device
> cannot report all supported properties?

Good point, how about adding the following to 2.6.7.2 - Device
Requirements: PROBE Request

    If the properties array is smaller than \field{probe_size}, then the
device SHOULD NOT write any property and SHOULD set the request
\field{status} to VIRTIO_IOMMU_S_INVAL.

> 2.6.8.2:
> Couldn't we use the same terminology as iommu_resv_type in iommu.h?
> the negative form is not the easiest to understand for me.
> Why F_MSI?

It also needs to be understandable by someone not familiar with the Linux
APIs. It's not entirely obvious to me what the iommu.h regions are without
looking at the implementation, maybe using names from the device point of
view is more clear. I don't mind simplifying though, as I don't see a
reason for the driver to know about BYPASS regions other than
HW MSIs.

How about I remove ABORT and BYPASS, make the only subtype RESV_MEM_T_MSI
(1)? We'd use subtype 0 as "don't care, just don't map this". Can call it
RESV_MEM_T_RESERVED. Flags won't be used for these subtypes.

Another subtype will be RESV_MEM_T_IDENTITY (see my reply to Kevin).
IDENTITY regions must be identity-mapped by the guest and, as I understand
it, are virtual addresses used for DMA by firmware. The flags field will
then contain read/write protection flags.

> 2.6.8.2.1
> Multiple overlapping RESV_MEM properties are merged together. Device
> requirement? if same types I assume?

Combination rules apply to device and driver. When the driver puts
multiple endpoints in the same domain, combination rules apply. They will
become important once the guest attempts to do things like combining
endpoints with different PASID capabilities in the same domain. I'm
considering these endpoints to be behind different physical IOMMUs.

For reserved regions, we specify what the driver should do and what the
device should enforce with regard to mapping IOVAs of overlapping regions.
For example, if a device has some virtual address RESV_T_MSI and an other
device has the same virtual address RESV_T_IDENTITY, what should the
driver do?

I think it should apply the RESV_T_IDENTITY. RESV_T_MSI is just a special
case of RESV_T_RESERVED, it's a hint for the IRQ subsystem and doesn't
have a meaning within a domain. From DMA mappings point of view, it is
effectively the same as RESV_T_RESERVED. When merging RESV_T_RESERVED and
RESV_T_IDENTITY, we should make it RESV_T_IDENTITY. Because it is required
for one endpoint to work (the one with IDENTITY) and is not accessed by
the other (the driver must not allocate this IOVA range for DMA).

More generally, I'd like to add the following combination table to the
spec, that shows the resulting reserved region type after merging regions
from two endpoints. N: no reserved region, R: RESV_T_RESERVED, M:
RESV_T_MSI, I: RESV_T_IDENTITY. It is symmetric so I didn't fill the
bottom half.

      | N | R | M | I
    ------------------
    N | N | R | M | ?
    ------------------
    R |   | R | R | I
    ------------------
    M |   |   | M | I
    ------------------
    I |   |   |   | I

The 'I' column is problematic. If one endpoint has an IDENTITY region and
the other doesn't have anything at that virtual address, then making that
region an identity mapping will result in the second endpoint being able
to access firmware memory. If using nested translation, stage-2 can help
us here. But in general we should allow the device to reject an attach
that would result in a N+I combination if the host considers it too dodgy.
I think the other combinations are safe enough.

> what if the device is a physical assigned device. does the device report
> the host doorbell or the guest doorbell, may be worth to clarify.

Indeed. For a physical assigned device it is the guest doorbell as it
corresponds to the virtual ITS or APIC (though with the APIC it's the same
region).

> 3.1.1 last paragraph
> you talk about device ID but this is a terminology only known by ARM I
> think. Actually it is introduced later in 3.1.2.

Indeed, I'll move the explanation up.

> 3.1.2
> About ACPI integration. Isn't IORT ARM specific?

IORT is currently edited by ARM, as is virtio-iommu. Their contents are
not ARM specific however, they describe software interfaces that solve
mostly generic problems, and aim to be open.

IORT describes the relation between endpoints and their translation
agents. This problem has already been solved by IVRS, DMAR and IORT. From
a technical point of view the latter is more flexible and can host a new
translation agent. It should be relatively easy to use for virtio-iommu in
the kernel, and therefore I don't think we should introduce a fourth type
of table just for virtio-iommu.

> Will other architectures use that table?

Yes, in particular the IORT node would be used on virtual x86 platforms.
On ARM platforms, Linux guests use existing device-tree bindings.
Operating systems that don't implement DT could use IORT as well.

> In IORT we also have Named Component node
> which look pretty the same. Could you elaborate of the difference?

It's the same mechanism, except that the named component node is supposed
to be a leaf in the topology, not a translation agent. Using the same node
type could make it difficult for the driver to understand the table. But
see below.

> Device Object name: isn't it the path to the LNR0005 in the namespace?

It is. However, I wrote the example table to give a rough idea, but didn't
investigate the feasibility. I now think it would be easier to not present
the virtio-iommu as a LNRO0005 to the guest, but solely as the IORT node.
Instead of a namespace path, the node would directly describe the _CRS
content, which should be enough for the kernel to instantiate the device.

I'll prototype something to understand the problem better and find which
is easier to do. I'm busy preparing for conferences, so it might be two or
three weeks.


Unrelated remarks:

* When renaming "devices" to "endpoints" in v0.3 (to avoid confusion wrt
"The device", which is virtio-iommu dev), I didn't update the
attach/detach/probe requests, they still have 'device' fields. Since I'm
already butchering the API by renaming address_space to domain, should I
also rename 'device' to 'endpoint' in the structure? It doesn't matter as
much, since the difference here is mostly stylistic. But I think I will
get it out of the way for v0.5, to make things more clear overall.

* Regarding my question in v0.4 about adding a 'size' field to all
requests. I don't think we should do that. The problem was for parsing
interleaved RO/WO/RO buffers, but we might never hit the problem if we
design virtio-iommu properly. Framing should be left to the transport
protocol and it'd be mostly redundant in virtio-iommu requests.

Thanks,
Jean

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] virtio-iommu version 0.4
  2017-08-28  7:39     ` [virtio-dev] " Tian, Kevin
@ 2017-09-06 11:54       ` Jean-Philippe Brucker
  -1 siblings, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-09-06 11:54 UTC (permalink / raw)
  To: Tian, Kevin, iommu, kvm, virtualization, virtio-dev
  Cc: will.deacon, robin.murphy, lorenzo.pieralisi, mst, jasowang,
	marc.zyngier, eric.auger, eric.auger.pro, bharat.bhushan, peterx

Hi Kevin,

On 28/08/17 08:39, Tian, Kevin wrote:
> Here comes some comments:
> 
> 1.1 Motivation
> 
> You describe I/O page faults handling as future work. Seems you considered
> only recoverable fault (since "aka. PCI PRI" being used). What about other
> unrecoverable faults e.g. what to do if a virtual DMA request doesn't find 
> a valid mapping? Even when there is no PRI support, we need some basic
> form of fault reporting mechanism to indicate such errors to guest.

I am considering recoverable faults as the end goal, but reporting
unrecoverable faults should use the same queue, with slightly different
fields and no need for the driver to reply to the device.

> 2.6.8.2 Property RESV_MEM
> 
> I'm not immediately clear when VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT
> should be explicitly reported. Is there any real example on bare metal IOMMU?
> usually reserved memory is reported to CPU through other method (e.g. e820
> on x86 platform). Of course MSI is a special case which is covered by BYPASS 
> and MSI flag... If yes, maybe you can also include an example in implementation 
> notes.

The RESV_MEM regions only describes IOVA space for the moment, not
guest-physical, so I guess it provides different information than e820.

I think a useful example is the PCI bridge windows reported by the Linux
host to userspace using RESV_RESERVED regions (see
iommu_dma_get_resv_regions). If I understand correctly, they represent DMA
addresses that shouldn't be accessed by endpoints because they won't reach
the IOMMU. These are specific to the physical topology: a device will have
different reserved regions depending on the PCI slot it occupies.

When handled properly, PCI bridge windows quickly become a nuisance. With
kvmtool we observed that carving out their addresses globally removes a
lot of useful GPA space from the guest. Without a virtual IOMMU we can
either ignore them and hope everything will be fine, or remove all
reserved regions from the GPA space (which currently means editing by hand
the static guest-physical map...)

That's where RESV_MEM_T_ABORT comes handy with virtio-iommu. It describes
reserved IOVAs for a specific endpoint, and therefore removes the need to
carve the window out of the whole guest.

> Another thing I want to ask your opinion, about whether there is value of
> adding another subtype (MEM_T_IDENTITY), asking for identity mapping
> in the address space. It's similar to Reserved Memory Region Reporting
> (RMRR) structure defined in VT-d, to indicate BIOS allocated reserved
> memory ranges which may be DMA target and has to be identity mapped
> when DMA remapping is enabled. I'm not sure whether ARM has similar
> capability and whether there might be a general usage beyond VT-d. For
> now the only usage in my mind is to assign a device with RMRR associated
> on VT-d (Intel GPU, or some USB controllers) where the RMRR info needs
> propagated to the guest (since identity mapping also means reservation
> of virtual address space).

Yes I think adding MEM_T_IDENTITY will be necessary. I can see they are
used for both iGPU and USB controllers on my x86 machines. Do you know
more precisely what they are used for by the firmware?

It's not necessary with the base virtio-iommu device though (v0.4),
because the device can create the identity mappings itself and report them
to the guest as MEM_T_BYPASS. However, when we start handing page table
control over to the guest, the host won't be in control of IOVA->GPA
mappings and will need to gracefully ask the guest to do it.

I'm not aware of any firmware description resembling Intel RMRR or AMD
IVMD on ARM platforms. I do think ARM platforms could need MEM_T_IDENTITY
for requesting the guest to map MSI windows when page-table handover is in
use (MSI addresses are translated by the physical SMMU, so a IOVA->GPA
mapping must be installed by the guest). But since a vSMMU would need a
solution as well, I think I'll try to implement something more generic.

> 2.6.8.2.3 Device Requirements: Property RESV_MEM
> 
> --citation start--
> If an endpoint is attached to an address space, the device SHOULD leave 
> any access targeting one of its VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS 
> regions pass through untranslated. In other words, the device SHOULD 
> handle such a region as if it was identity-mapped (virtual address equal to
> physical address). If the endpoint is not attached to any address space, 
> then the device MAY abort the transaction.
> --citation end
> 
> I have a question for the last sentence. From definition of BYPASS, it's
> orthogonal to whether there is an address space attached, then should
> we still allow "May abort" behavior? 

The behavior is left as an implementation choice, and I'm not sure it's
worth enforcing in the architecture. If the endpoint isn't attached to any
domain then (unless VIRTIO_IOMMU_F_BYPASS is negotiated), it isn't
necessarily able to do DMA at all. The virtio-iommu device may setup DMA
mastering lazily, in which case any DMA transaction would abort, or have
setup DMA already, in which case the endpoint can access MEM_T_BYPASS regions.

Thanks!
Jean

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] virtio-iommu version 0.4
  2017-08-28  7:39     ` [virtio-dev] " Tian, Kevin
  (?)
  (?)
@ 2017-09-06 11:54     ` Jean-Philippe Brucker
  -1 siblings, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-09-06 11:54 UTC (permalink / raw)
  To: Tian, Kevin, iommu, kvm, virtualization, virtio-dev
  Cc: lorenzo.pieralisi, mst, marc.zyngier, will.deacon, eric.auger,
	robin.murphy, eric.auger.pro

Hi Kevin,

On 28/08/17 08:39, Tian, Kevin wrote:
> Here comes some comments:
> 
> 1.1 Motivation
> 
> You describe I/O page faults handling as future work. Seems you considered
> only recoverable fault (since "aka. PCI PRI" being used). What about other
> unrecoverable faults e.g. what to do if a virtual DMA request doesn't find 
> a valid mapping? Even when there is no PRI support, we need some basic
> form of fault reporting mechanism to indicate such errors to guest.

I am considering recoverable faults as the end goal, but reporting
unrecoverable faults should use the same queue, with slightly different
fields and no need for the driver to reply to the device.

> 2.6.8.2 Property RESV_MEM
> 
> I'm not immediately clear when VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT
> should be explicitly reported. Is there any real example on bare metal IOMMU?
> usually reserved memory is reported to CPU through other method (e.g. e820
> on x86 platform). Of course MSI is a special case which is covered by BYPASS 
> and MSI flag... If yes, maybe you can also include an example in implementation 
> notes.

The RESV_MEM regions only describes IOVA space for the moment, not
guest-physical, so I guess it provides different information than e820.

I think a useful example is the PCI bridge windows reported by the Linux
host to userspace using RESV_RESERVED regions (see
iommu_dma_get_resv_regions). If I understand correctly, they represent DMA
addresses that shouldn't be accessed by endpoints because they won't reach
the IOMMU. These are specific to the physical topology: a device will have
different reserved regions depending on the PCI slot it occupies.

When handled properly, PCI bridge windows quickly become a nuisance. With
kvmtool we observed that carving out their addresses globally removes a
lot of useful GPA space from the guest. Without a virtual IOMMU we can
either ignore them and hope everything will be fine, or remove all
reserved regions from the GPA space (which currently means editing by hand
the static guest-physical map...)

That's where RESV_MEM_T_ABORT comes handy with virtio-iommu. It describes
reserved IOVAs for a specific endpoint, and therefore removes the need to
carve the window out of the whole guest.

> Another thing I want to ask your opinion, about whether there is value of
> adding another subtype (MEM_T_IDENTITY), asking for identity mapping
> in the address space. It's similar to Reserved Memory Region Reporting
> (RMRR) structure defined in VT-d, to indicate BIOS allocated reserved
> memory ranges which may be DMA target and has to be identity mapped
> when DMA remapping is enabled. I'm not sure whether ARM has similar
> capability and whether there might be a general usage beyond VT-d. For
> now the only usage in my mind is to assign a device with RMRR associated
> on VT-d (Intel GPU, or some USB controllers) where the RMRR info needs
> propagated to the guest (since identity mapping also means reservation
> of virtual address space).

Yes I think adding MEM_T_IDENTITY will be necessary. I can see they are
used for both iGPU and USB controllers on my x86 machines. Do you know
more precisely what they are used for by the firmware?

It's not necessary with the base virtio-iommu device though (v0.4),
because the device can create the identity mappings itself and report them
to the guest as MEM_T_BYPASS. However, when we start handing page table
control over to the guest, the host won't be in control of IOVA->GPA
mappings and will need to gracefully ask the guest to do it.

I'm not aware of any firmware description resembling Intel RMRR or AMD
IVMD on ARM platforms. I do think ARM platforms could need MEM_T_IDENTITY
for requesting the guest to map MSI windows when page-table handover is in
use (MSI addresses are translated by the physical SMMU, so a IOVA->GPA
mapping must be installed by the guest). But since a vSMMU would need a
solution as well, I think I'll try to implement something more generic.

> 2.6.8.2.3 Device Requirements: Property RESV_MEM
> 
> --citation start--
> If an endpoint is attached to an address space, the device SHOULD leave 
> any access targeting one of its VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS 
> regions pass through untranslated. In other words, the device SHOULD 
> handle such a region as if it was identity-mapped (virtual address equal to
> physical address). If the endpoint is not attached to any address space, 
> then the device MAY abort the transaction.
> --citation end
> 
> I have a question for the last sentence. From definition of BYPASS, it's
> orthogonal to whether there is an address space attached, then should
> we still allow "May abort" behavior? 

The behavior is left as an implementation choice, and I'm not sure it's
worth enforcing in the architecture. If the endpoint isn't attached to any
domain then (unless VIRTIO_IOMMU_F_BYPASS is negotiated), it isn't
necessarily able to do DMA at all. The virtio-iommu device may setup DMA
mastering lazily, in which case any DMA transaction would abort, or have
setup DMA already, in which case the endpoint can access MEM_T_BYPASS regions.

Thanks!
Jean

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [virtio-dev] Re: [RFC] virtio-iommu version 0.4
@ 2017-09-06 11:54       ` Jean-Philippe Brucker
  0 siblings, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-09-06 11:54 UTC (permalink / raw)
  To: Tian, Kevin, iommu, kvm, virtualization, virtio-dev
  Cc: will.deacon, robin.murphy, lorenzo.pieralisi, mst, jasowang,
	marc.zyngier, eric.auger, eric.auger.pro, bharat.bhushan, peterx

Hi Kevin,

On 28/08/17 08:39, Tian, Kevin wrote:
> Here comes some comments:
> 
> 1.1 Motivation
> 
> You describe I/O page faults handling as future work. Seems you considered
> only recoverable fault (since "aka. PCI PRI" being used). What about other
> unrecoverable faults e.g. what to do if a virtual DMA request doesn't find 
> a valid mapping? Even when there is no PRI support, we need some basic
> form of fault reporting mechanism to indicate such errors to guest.

I am considering recoverable faults as the end goal, but reporting
unrecoverable faults should use the same queue, with slightly different
fields and no need for the driver to reply to the device.

> 2.6.8.2 Property RESV_MEM
> 
> I'm not immediately clear when VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT
> should be explicitly reported. Is there any real example on bare metal IOMMU?
> usually reserved memory is reported to CPU through other method (e.g. e820
> on x86 platform). Of course MSI is a special case which is covered by BYPASS 
> and MSI flag... If yes, maybe you can also include an example in implementation 
> notes.

The RESV_MEM regions only describes IOVA space for the moment, not
guest-physical, so I guess it provides different information than e820.

I think a useful example is the PCI bridge windows reported by the Linux
host to userspace using RESV_RESERVED regions (see
iommu_dma_get_resv_regions). If I understand correctly, they represent DMA
addresses that shouldn't be accessed by endpoints because they won't reach
the IOMMU. These are specific to the physical topology: a device will have
different reserved regions depending on the PCI slot it occupies.

When handled properly, PCI bridge windows quickly become a nuisance. With
kvmtool we observed that carving out their addresses globally removes a
lot of useful GPA space from the guest. Without a virtual IOMMU we can
either ignore them and hope everything will be fine, or remove all
reserved regions from the GPA space (which currently means editing by hand
the static guest-physical map...)

That's where RESV_MEM_T_ABORT comes handy with virtio-iommu. It describes
reserved IOVAs for a specific endpoint, and therefore removes the need to
carve the window out of the whole guest.

> Another thing I want to ask your opinion, about whether there is value of
> adding another subtype (MEM_T_IDENTITY), asking for identity mapping
> in the address space. It's similar to Reserved Memory Region Reporting
> (RMRR) structure defined in VT-d, to indicate BIOS allocated reserved
> memory ranges which may be DMA target and has to be identity mapped
> when DMA remapping is enabled. I'm not sure whether ARM has similar
> capability and whether there might be a general usage beyond VT-d. For
> now the only usage in my mind is to assign a device with RMRR associated
> on VT-d (Intel GPU, or some USB controllers) where the RMRR info needs
> propagated to the guest (since identity mapping also means reservation
> of virtual address space).

Yes I think adding MEM_T_IDENTITY will be necessary. I can see they are
used for both iGPU and USB controllers on my x86 machines. Do you know
more precisely what they are used for by the firmware?

It's not necessary with the base virtio-iommu device though (v0.4),
because the device can create the identity mappings itself and report them
to the guest as MEM_T_BYPASS. However, when we start handing page table
control over to the guest, the host won't be in control of IOVA->GPA
mappings and will need to gracefully ask the guest to do it.

I'm not aware of any firmware description resembling Intel RMRR or AMD
IVMD on ARM platforms. I do think ARM platforms could need MEM_T_IDENTITY
for requesting the guest to map MSI windows when page-table handover is in
use (MSI addresses are translated by the physical SMMU, so a IOVA->GPA
mapping must be installed by the guest). But since a vSMMU would need a
solution as well, I think I'll try to implement something more generic.

> 2.6.8.2.3 Device Requirements: Property RESV_MEM
> 
> --citation start--
> If an endpoint is attached to an address space, the device SHOULD leave 
> any access targeting one of its VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS 
> regions pass through untranslated. In other words, the device SHOULD 
> handle such a region as if it was identity-mapped (virtual address equal to
> physical address). If the endpoint is not attached to any address space, 
> then the device MAY abort the transaction.
> --citation end
> 
> I have a question for the last sentence. From definition of BYPASS, it's
> orthogonal to whether there is an address space attached, then should
> we still allow "May abort" behavior? 

The behavior is left as an implementation choice, and I'm not sure it's
worth enforcing in the architecture. If the endpoint isn't attached to any
domain then (unless VIRTIO_IOMMU_F_BYPASS is negotiated), it isn't
necessarily able to do DMA at all. The virtio-iommu device may setup DMA
mastering lazily, in which case any DMA transaction would abort, or have
setup DMA already, in which case the endpoint can access MEM_T_BYPASS regions.

Thanks!
Jean

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] virtio-iommu version 0.4
  2017-08-04 18:19 ` [virtio-dev] " Jean-Philippe Brucker
@ 2017-09-12 17:13   ` Auger Eric
  -1 siblings, 0 replies; 71+ messages in thread
From: Auger Eric @ 2017-09-12 17:13 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: will.deacon, robin.murphy, lorenzo.pieralisi, mst, jasowang,
	marc.zyngier, eric.auger.pro, bharat.bhushan, peterx, kevin.tian

Hi jean,

On 04/08/2017 20:19, Jean-Philippe Brucker wrote:
> This is the continuation of my proposal for virtio-iommu, the para-
> virtualized IOMMU. Here is a summary of the changes since last time [1]:
> 
> * The virtio-iommu document now resembles an actual specification. It is
>   split into a formal description of the virtio device, and implementation
>   notes. Please find sources and binaries at [2].
> 
> * Added a probe request to describe to the guest different properties that
>   do not fit in firmware or in the virtio config space. This is a
>   necessary stepping stone for extending the virtio-iommu.
> 
> * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
>   Bhushan.
> 
> You can find the Linux driver and kvmtool device at [4] and [5]. I
> plan to rework driver and kvmtool device slightly before sending the
> patches.
> 
> To understand the virtio-iommu, I advise to first read introduction and
> motivation, then skim through implementation notes and finally look at the
> device specification.
> 
> I wasn't sure how to organize the review. For those who prefer to comment
> inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex
> to this thread. They are the biggest chunks of the document. But LaTeX
> isn't very pleasant to read, so you can simply send a list of comments in
> relation to section numbers and a few words of context, we'll manage.
> 
> ---
> Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare
> more easily differences since the RFC (see [6]), but haven't been made
> public so far. This is the first public posting since initial proposal
> [1], and the following describes all changes.
> 
> ## v0.1 ##
> 
> Content is the same as the RFC, but formatted to LaTeX. 'make' generates
> one PDF and one HTML document.
> 
> ## v0.2 ##
> 
> Add introductions, improve topology example and firmware description based
> on feedback and a number of useful discussions.
> 
> ## v0.3 ##
> 
> Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten
> the device and driver behaviour. Unmap semantics are consolidated; they
> are now closer to VFIO Type1 v2 semantics.
> 
> ## v0.4 ##
> 
> Introduce PROBE requests. They provide per-endpoint information to the
> driver that couldn't be described otherwise.
> 
> For the moment, they allow to handle MSIs on x86 virtual platforms (see
> 3.2). To do that we communicate reserved IOVA regions, that will also be
> useful for describing regions that cannot be mapped for a given endpoint,
> for instance addresses that correspond to a PCI bridge window.
> 
> Introducing such a large framework for this tiny feature may seem
> overkill, but it is needed for future extensions of the virtio-iommu and I
> believe it really is worth the effort.

2.6.7
- As I am currently integrating v0.4 in QEMU here are some other comments:
At the moment struct virtio_iommu_req_probe flags is missing in your
header. As such I understood the ACK protocol was not implemented by the
driver in your branch.
- VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is VIRTIO_IOMMU_T_MASK in your
header too.
2.6.8.2:
- I am really confused about what the device should report as resv
regions depending on the PE nature (VFIO or not VFIO)

In other iommu drivers, the resv regions are populated by the iommu
driver through its get_resv_regions callback. They are usually composed
of an iommu specific MSI region (mapped or bypassed) and non IOMMU
specific (device specific) reserved regions:
iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those
are the guest reserved regions.

First in the current virtio-iommu driver I don't see the
iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu
driver should compute the non IOMMU specific MSI regions. ie. this is
not the responsability of the virtio-iommu device.

Then why is it more the job of the device to return the guest iommu
specific region rather than the driver itself?

Then I understand this is the responsability of the virtio-iommu device
to gather information about the host resv regions in case of VFIO EP.
Typically the host PCIe host bridge windows cannot be used for IOVA.
Also the host MSI reserved IOVA window cannot be used. Do you agree.

I really think the spec should clarify what exact resv regions the
device should return in case of VFIO device and non VFIO device.

Thanks

Eric

> 
> ## Future ##
> 
> Other extensions are in preparation. I won't detail them here because v0.4
> already is a lot to digest, but in short, building on top of PROBE:
> 
> * First, since the IOMMU is paravirtualized, the device can expose some
>   properties of the physical topology to the guest, and let it allocate
>   resources more efficiently. For example, when the virtio-iommu manages
>   both physical and emulated endpoints, with different underlying IOMMUs,
>   we now have a way to describe multiple page and block granularities,
>   instead of forcing the guest to use the most restricted one for all
>   endpoints. This will most likely be in v0.5.
> 
> * Then on top of that, a major improvement will describe hardware
>   acceleration features available to the guest. There is what I call "Page
>   Table Handover" (or simply, from the host POV, "Nested"), the ability
>   for the guest to manipulate its own page tables instead of sending
>   MAP/UNMAP requests to the host. This, along with IO Page Fault
>   reporting, will also permit SVM virtualization on different platforms.
> 
> Thanks,
> Jean
> 
> [1] http://www.spinics.net/lists/kvm/msg147990.html
> [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
>     http://www.linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
>     I reiterate the disclaimers: don't use this document as a reference,
>     it's a draft. It's also not an OASIS document yet. It may be riddled
>     with mistakes. As this is a working draft, it is unstable and I do not
>     guarantee backward compatibility of future versions.
> [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
> [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
>     Warning: UAPI headers have changed! They didn't follow the spec,
>     please update. (Use branch v0.1, that has the old headers, for the
>     Qemu prototype [3])
> [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
>     Warning: command-line has changed! Use --viommu vfio[,opts] and
>     --viommu virtio[,opts] to instantiate a device.
> [6] http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] virtio-iommu version 0.4
  2017-08-04 18:19 ` [virtio-dev] " Jean-Philippe Brucker
                   ` (13 preceding siblings ...)
  (?)
@ 2017-09-12 17:13 ` Auger Eric
  -1 siblings, 0 replies; 71+ messages in thread
From: Auger Eric @ 2017-09-12 17:13 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: lorenzo.pieralisi, mst, marc.zyngier, will.deacon, robin.murphy,
	eric.auger.pro

Hi jean,

On 04/08/2017 20:19, Jean-Philippe Brucker wrote:
> This is the continuation of my proposal for virtio-iommu, the para-
> virtualized IOMMU. Here is a summary of the changes since last time [1]:
> 
> * The virtio-iommu document now resembles an actual specification. It is
>   split into a formal description of the virtio device, and implementation
>   notes. Please find sources and binaries at [2].
> 
> * Added a probe request to describe to the guest different properties that
>   do not fit in firmware or in the virtio config space. This is a
>   necessary stepping stone for extending the virtio-iommu.
> 
> * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
>   Bhushan.
> 
> You can find the Linux driver and kvmtool device at [4] and [5]. I
> plan to rework driver and kvmtool device slightly before sending the
> patches.
> 
> To understand the virtio-iommu, I advise to first read introduction and
> motivation, then skim through implementation notes and finally look at the
> device specification.
> 
> I wasn't sure how to organize the review. For those who prefer to comment
> inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex
> to this thread. They are the biggest chunks of the document. But LaTeX
> isn't very pleasant to read, so you can simply send a list of comments in
> relation to section numbers and a few words of context, we'll manage.
> 
> ---
> Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare
> more easily differences since the RFC (see [6]), but haven't been made
> public so far. This is the first public posting since initial proposal
> [1], and the following describes all changes.
> 
> ## v0.1 ##
> 
> Content is the same as the RFC, but formatted to LaTeX. 'make' generates
> one PDF and one HTML document.
> 
> ## v0.2 ##
> 
> Add introductions, improve topology example and firmware description based
> on feedback and a number of useful discussions.
> 
> ## v0.3 ##
> 
> Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten
> the device and driver behaviour. Unmap semantics are consolidated; they
> are now closer to VFIO Type1 v2 semantics.
> 
> ## v0.4 ##
> 
> Introduce PROBE requests. They provide per-endpoint information to the
> driver that couldn't be described otherwise.
> 
> For the moment, they allow to handle MSIs on x86 virtual platforms (see
> 3.2). To do that we communicate reserved IOVA regions, that will also be
> useful for describing regions that cannot be mapped for a given endpoint,
> for instance addresses that correspond to a PCI bridge window.
> 
> Introducing such a large framework for this tiny feature may seem
> overkill, but it is needed for future extensions of the virtio-iommu and I
> believe it really is worth the effort.

2.6.7
- As I am currently integrating v0.4 in QEMU here are some other comments:
At the moment struct virtio_iommu_req_probe flags is missing in your
header. As such I understood the ACK protocol was not implemented by the
driver in your branch.
- VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is VIRTIO_IOMMU_T_MASK in your
header too.
2.6.8.2:
- I am really confused about what the device should report as resv
regions depending on the PE nature (VFIO or not VFIO)

In other iommu drivers, the resv regions are populated by the iommu
driver through its get_resv_regions callback. They are usually composed
of an iommu specific MSI region (mapped or bypassed) and non IOMMU
specific (device specific) reserved regions:
iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those
are the guest reserved regions.

First in the current virtio-iommu driver I don't see the
iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu
driver should compute the non IOMMU specific MSI regions. ie. this is
not the responsability of the virtio-iommu device.

Then why is it more the job of the device to return the guest iommu
specific region rather than the driver itself?

Then I understand this is the responsability of the virtio-iommu device
to gather information about the host resv regions in case of VFIO EP.
Typically the host PCIe host bridge windows cannot be used for IOVA.
Also the host MSI reserved IOVA window cannot be used. Do you agree.

I really think the spec should clarify what exact resv regions the
device should return in case of VFIO device and non VFIO device.

Thanks

Eric

> 
> ## Future ##
> 
> Other extensions are in preparation. I won't detail them here because v0.4
> already is a lot to digest, but in short, building on top of PROBE:
> 
> * First, since the IOMMU is paravirtualized, the device can expose some
>   properties of the physical topology to the guest, and let it allocate
>   resources more efficiently. For example, when the virtio-iommu manages
>   both physical and emulated endpoints, with different underlying IOMMUs,
>   we now have a way to describe multiple page and block granularities,
>   instead of forcing the guest to use the most restricted one for all
>   endpoints. This will most likely be in v0.5.
> 
> * Then on top of that, a major improvement will describe hardware
>   acceleration features available to the guest. There is what I call "Page
>   Table Handover" (or simply, from the host POV, "Nested"), the ability
>   for the guest to manipulate its own page tables instead of sending
>   MAP/UNMAP requests to the host. This, along with IO Page Fault
>   reporting, will also permit SVM virtualization on different platforms.
> 
> Thanks,
> Jean
> 
> [1] http://www.spinics.net/lists/kvm/msg147990.html
> [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
>     http://www.linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
>     I reiterate the disclaimers: don't use this document as a reference,
>     it's a draft. It's also not an OASIS document yet. It may be riddled
>     with mistakes. As this is a working draft, it is unstable and I do not
>     guarantee backward compatibility of future versions.
> [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
> [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
>     Warning: UAPI headers have changed! They didn't follow the spec,
>     please update. (Use branch v0.1, that has the old headers, for the
>     Qemu prototype [3])
> [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
>     Warning: command-line has changed! Use --viommu vfio[,opts] and
>     --viommu virtio[,opts] to instantiate a device.
> [6] http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [virtio-dev] Re: [RFC] virtio-iommu version 0.4
@ 2017-09-12 17:13   ` Auger Eric
  0 siblings, 0 replies; 71+ messages in thread
From: Auger Eric @ 2017-09-12 17:13 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: will.deacon, robin.murphy, lorenzo.pieralisi, mst, jasowang,
	marc.zyngier, eric.auger.pro, bharat.bhushan, peterx, kevin.tian

Hi jean,

On 04/08/2017 20:19, Jean-Philippe Brucker wrote:
> This is the continuation of my proposal for virtio-iommu, the para-
> virtualized IOMMU. Here is a summary of the changes since last time [1]:
> 
> * The virtio-iommu document now resembles an actual specification. It is
>   split into a formal description of the virtio device, and implementation
>   notes. Please find sources and binaries at [2].
> 
> * Added a probe request to describe to the guest different properties that
>   do not fit in firmware or in the virtio config space. This is a
>   necessary stepping stone for extending the virtio-iommu.
> 
> * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
>   Bhushan.
> 
> You can find the Linux driver and kvmtool device at [4] and [5]. I
> plan to rework driver and kvmtool device slightly before sending the
> patches.
> 
> To understand the virtio-iommu, I advise to first read introduction and
> motivation, then skim through implementation notes and finally look at the
> device specification.
> 
> I wasn't sure how to organize the review. For those who prefer to comment
> inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex
> to this thread. They are the biggest chunks of the document. But LaTeX
> isn't very pleasant to read, so you can simply send a list of comments in
> relation to section numbers and a few words of context, we'll manage.
> 
> ---
> Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare
> more easily differences since the RFC (see [6]), but haven't been made
> public so far. This is the first public posting since initial proposal
> [1], and the following describes all changes.
> 
> ## v0.1 ##
> 
> Content is the same as the RFC, but formatted to LaTeX. 'make' generates
> one PDF and one HTML document.
> 
> ## v0.2 ##
> 
> Add introductions, improve topology example and firmware description based
> on feedback and a number of useful discussions.
> 
> ## v0.3 ##
> 
> Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten
> the device and driver behaviour. Unmap semantics are consolidated; they
> are now closer to VFIO Type1 v2 semantics.
> 
> ## v0.4 ##
> 
> Introduce PROBE requests. They provide per-endpoint information to the
> driver that couldn't be described otherwise.
> 
> For the moment, they allow to handle MSIs on x86 virtual platforms (see
> 3.2). To do that we communicate reserved IOVA regions, that will also be
> useful for describing regions that cannot be mapped for a given endpoint,
> for instance addresses that correspond to a PCI bridge window.
> 
> Introducing such a large framework for this tiny feature may seem
> overkill, but it is needed for future extensions of the virtio-iommu and I
> believe it really is worth the effort.

2.6.7
- As I am currently integrating v0.4 in QEMU here are some other comments:
At the moment struct virtio_iommu_req_probe flags is missing in your
header. As such I understood the ACK protocol was not implemented by the
driver in your branch.
- VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is VIRTIO_IOMMU_T_MASK in your
header too.
2.6.8.2:
- I am really confused about what the device should report as resv
regions depending on the PE nature (VFIO or not VFIO)

In other iommu drivers, the resv regions are populated by the iommu
driver through its get_resv_regions callback. They are usually composed
of an iommu specific MSI region (mapped or bypassed) and non IOMMU
specific (device specific) reserved regions:
iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those
are the guest reserved regions.

First in the current virtio-iommu driver I don't see the
iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu
driver should compute the non IOMMU specific MSI regions. ie. this is
not the responsability of the virtio-iommu device.

Then why is it more the job of the device to return the guest iommu
specific region rather than the driver itself?

Then I understand this is the responsability of the virtio-iommu device
to gather information about the host resv regions in case of VFIO EP.
Typically the host PCIe host bridge windows cannot be used for IOVA.
Also the host MSI reserved IOVA window cannot be used. Do you agree.

I really think the spec should clarify what exact resv regions the
device should return in case of VFIO device and non VFIO device.

Thanks

Eric

> 
> ## Future ##
> 
> Other extensions are in preparation. I won't detail them here because v0.4
> already is a lot to digest, but in short, building on top of PROBE:
> 
> * First, since the IOMMU is paravirtualized, the device can expose some
>   properties of the physical topology to the guest, and let it allocate
>   resources more efficiently. For example, when the virtio-iommu manages
>   both physical and emulated endpoints, with different underlying IOMMUs,
>   we now have a way to describe multiple page and block granularities,
>   instead of forcing the guest to use the most restricted one for all
>   endpoints. This will most likely be in v0.5.
> 
> * Then on top of that, a major improvement will describe hardware
>   acceleration features available to the guest. There is what I call "Page
>   Table Handover" (or simply, from the host POV, "Nested"), the ability
>   for the guest to manipulate its own page tables instead of sending
>   MAP/UNMAP requests to the host. This, along with IO Page Fault
>   reporting, will also permit SVM virtualization on different platforms.
> 
> Thanks,
> Jean
> 
> [1] http://www.spinics.net/lists/kvm/msg147990.html
> [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
>     http://www.linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
>     I reiterate the disclaimers: don't use this document as a reference,
>     it's a draft. It's also not an OASIS document yet. It may be riddled
>     with mistakes. As this is a working draft, it is unstable and I do not
>     guarantee backward compatibility of future versions.
> [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
> [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
>     Warning: UAPI headers have changed! They didn't follow the spec,
>     please update. (Use branch v0.1, that has the old headers, for the
>     Qemu prototype [3])
> [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
>     Warning: command-line has changed! Use --viommu vfio[,opts] and
>     --viommu virtio[,opts] to instantiate a device.
> [6] http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [RFC] virtio-iommu version 0.4
  2017-09-12 17:13   ` [virtio-dev] " Auger Eric
@ 2017-09-13  3:47       ` Bharat Bhushan
  -1 siblings, 0 replies; 71+ messages in thread
From: Bharat Bhushan @ 2017-09-13  3:47 UTC (permalink / raw)
  To: Auger Eric, Jean-Philippe Brucker,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kvm-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	virtio-dev-sDuHXQ4OtrM4h7I2RyI4rWD2FQJk+8+b
  Cc: mst-H+wXaHxf7aLQT0dZR+AlfA, marc.zyngier-5wv7dgnIgG8,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA, will.deacon-5wv7dgnIgG8,
	eric.auger.pro-Re5JQEeQqe8AvxtiuMwx3w

Hi Eric,

> -----Original Message-----
> From: Auger Eric [mailto:eric.auger-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org]
> Sent: Tuesday, September 12, 2017 10:43 PM
> To: Jean-Philippe Brucker <jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org>;
> iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org; kvm-u79uwXL29TY76Z2rM5mHXA@public.gmane.org;
> virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org; virtio-dev-sDuHXQ4OtrM4h7I2RyI4rWD2FQJk+8+b@public.gmane.org
> Cc: will.deacon-5wv7dgnIgG8@public.gmane.org; robin.murphy-5wv7dgnIgG8@public.gmane.org;
> lorenzo.pieralisi-5wv7dgnIgG8@public.gmane.org; mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; jasowang-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org;
> marc.zyngier-5wv7dgnIgG8@public.gmane.org; eric.auger.pro-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org; Bharat Bhushan
> <bharat.bhushan-3arQi8VN3Tc@public.gmane.org>; peterx-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; kevin.tian-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org
> Subject: Re: [RFC] virtio-iommu version 0.4
> 
> Hi jean,
> 
> On 04/08/2017 20:19, Jean-Philippe Brucker wrote:
> > This is the continuation of my proposal for virtio-iommu, the para-
> > virtualized IOMMU. Here is a summary of the changes since last time [1]:
> >
> > * The virtio-iommu document now resembles an actual specification. It is
> >   split into a formal description of the virtio device, and implementation
> >   notes. Please find sources and binaries at [2].
> >
> > * Added a probe request to describe to the guest different properties that
> >   do not fit in firmware or in the virtio config space. This is a
> >   necessary stepping stone for extending the virtio-iommu.
> >
> > * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
> >   Bhushan.
> >
> > You can find the Linux driver and kvmtool device at [4] and [5]. I
> > plan to rework driver and kvmtool device slightly before sending the
> > patches.
> >
> > To understand the virtio-iommu, I advise to first read introduction
> > and motivation, then skim through implementation notes and finally
> > look at the device specification.
> >
> > I wasn't sure how to organize the review. For those who prefer to
> > comment inline, I attached v0.4 of device-operations.tex and
> > topology.tex+MSI.tex to this thread. They are the biggest chunks of
> > the document. But LaTeX isn't very pleasant to read, so you can simply
> > send a list of comments in relation to section numbers and a few words of
> context, we'll manage.
> >
> > ---
> > Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to
> > compare more easily differences since the RFC (see [6]), but haven't
> > been made public so far. This is the first public posting since
> > initial proposal [1], and the following describes all changes.
> >
> > ## v0.1 ##
> >
> > Content is the same as the RFC, but formatted to LaTeX. 'make'
> > generates one PDF and one HTML document.
> >
> > ## v0.2 ##
> >
> > Add introductions, improve topology example and firmware description
> > based on feedback and a number of useful discussions.
> >
> > ## v0.3 ##
> >
> > Add normative sections (MUST, SHOULD, etc). Clarify some things,
> > tighten the device and driver behaviour. Unmap semantics are
> > consolidated; they are now closer to VFIO Type1 v2 semantics.
> >
> > ## v0.4 ##
> >
> > Introduce PROBE requests. They provide per-endpoint information to the
> > driver that couldn't be described otherwise.
> >
> > For the moment, they allow to handle MSIs on x86 virtual platforms
> > (see 3.2). To do that we communicate reserved IOVA regions, that will
> > also be useful for describing regions that cannot be mapped for a
> > given endpoint, for instance addresses that correspond to a PCI bridge
> window.
> >
> > Introducing such a large framework for this tiny feature may seem
> > overkill, but it is needed for future extensions of the virtio-iommu
> > and I believe it really is worth the effort.
> 
> 2.6.7
> - As I am currently integrating v0.4 in QEMU here are some other comments:
> At the moment struct virtio_iommu_req_probe flags is missing in your
> header. As such I understood the ACK protocol was not implemented by the
> driver in your branch.
> - VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is
> VIRTIO_IOMMU_T_MASK in your header too.
> 2.6.8.2:
> - I am really confused about what the device should report as resv regions
> depending on the PE nature (VFIO or not VFIO)
> 
> In other iommu drivers, the resv regions are populated by the iommu driver
> through its get_resv_regions callback. They are usually composed of an
> iommu specific MSI region (mapped or bypassed) and non IOMMU specific
> (device specific) reserved regions:
> iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those
> are the guest reserved regions.
> 
> First in the current virtio-iommu driver I don't see the
> iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu
> driver should compute the non IOMMU specific MSI regions. ie. this is not
> the responsability of the virtio-iommu device.
> 
> Then why is it more the job of the device to return the guest iommu specific
> region rather than the driver itself?
> 
> Then I understand this is the responsability of the virtio-iommu device to
> gather information about the host resv regions in case of VFIO EP.
> Typically the host PCIe host bridge windows cannot be used for IOVA.
> Also the host MSI reserved IOVA window cannot be used. Do you agree.
> 
> I really think the spec should clarify what exact resv regions the device should
> return in case of VFIO device and non VFIO device.

My understanding is that reserved regions are 1) Host reserved regions ( This is applicable for VFIO) 2) Guest MSI reserved region (This is applicable for emulated devices as well)
Does this make sense to have reserved regions per address-space/domain?

Thanks
-Bharat

> 
> Thanks
> 
> Eric
> 
> >
> > ## Future ##
> >
> > Other extensions are in preparation. I won't detail them here because
> > v0.4 already is a lot to digest, but in short, building on top of PROBE:
> >
> > * First, since the IOMMU is paravirtualized, the device can expose some
> >   properties of the physical topology to the guest, and let it allocate
> >   resources more efficiently. For example, when the virtio-iommu manages
> >   both physical and emulated endpoints, with different underlying
> IOMMUs,
> >   we now have a way to describe multiple page and block granularities,
> >   instead of forcing the guest to use the most restricted one for all
> >   endpoints. This will most likely be in v0.5.
> >
> > * Then on top of that, a major improvement will describe hardware
> >   acceleration features available to the guest. There is what I call "Page
> >   Table Handover" (or simply, from the host POV, "Nested"), the ability
> >   for the guest to manipulate its own page tables instead of sending
> >   MAP/UNMAP requests to the host. This, along with IO Page Fault
> >   reporting, will also permit SVM virtualization on different platforms.
> >
> > Thanks,
> > Jean
> >
> > [1] http://www.spinics.net/lists/kvm/msg147990.html
> > [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
> >     http://www.linux-arm.org/git?p=virtio-
> iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
> >     I reiterate the disclaimers: don't use this document as a reference,
> >     it's a draft. It's also not an OASIS document yet. It may be riddled
> >     with mistakes. As this is a working draft, it is unstable and I do not
> >     guarantee backward compatibility of future versions.
> > [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
> > [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
> >     Warning: UAPI headers have changed! They didn't follow the spec,
> >     please update. (Use branch v0.1, that has the old headers, for the
> >     Qemu prototype [3])
> > [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
> >     Warning: command-line has changed! Use --viommu vfio[,opts] and
> >     --viommu virtio[,opts] to instantiate a device.
> > [6]
> > http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs
> >

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [RFC] virtio-iommu version 0.4
  2017-09-12 17:13   ` [virtio-dev] " Auger Eric
  (?)
@ 2017-09-13  3:47   ` Bharat Bhushan
  -1 siblings, 0 replies; 71+ messages in thread
From: Bharat Bhushan @ 2017-09-13  3:47 UTC (permalink / raw)
  To: Auger Eric, Jean-Philippe Brucker, iommu, kvm, virtualization,
	virtio-dev
  Cc: lorenzo.pieralisi, mst, marc.zyngier, will.deacon, robin.murphy,
	eric.auger.pro

Hi Eric,

> -----Original Message-----
> From: Auger Eric [mailto:eric.auger@redhat.com]
> Sent: Tuesday, September 12, 2017 10:43 PM
> To: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>;
> iommu@lists.linux-foundation.org; kvm@vger.kernel.org;
> virtualization@lists.linux-foundation.org; virtio-dev@lists.oasis-open.org
> Cc: will.deacon@arm.com; robin.murphy@arm.com;
> lorenzo.pieralisi@arm.com; mst@redhat.com; jasowang@redhat.com;
> marc.zyngier@arm.com; eric.auger.pro@gmail.com; Bharat Bhushan
> <bharat.bhushan@nxp.com>; peterx@redhat.com; kevin.tian@intel.com
> Subject: Re: [RFC] virtio-iommu version 0.4
> 
> Hi jean,
> 
> On 04/08/2017 20:19, Jean-Philippe Brucker wrote:
> > This is the continuation of my proposal for virtio-iommu, the para-
> > virtualized IOMMU. Here is a summary of the changes since last time [1]:
> >
> > * The virtio-iommu document now resembles an actual specification. It is
> >   split into a formal description of the virtio device, and implementation
> >   notes. Please find sources and binaries at [2].
> >
> > * Added a probe request to describe to the guest different properties that
> >   do not fit in firmware or in the virtio config space. This is a
> >   necessary stepping stone for extending the virtio-iommu.
> >
> > * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
> >   Bhushan.
> >
> > You can find the Linux driver and kvmtool device at [4] and [5]. I
> > plan to rework driver and kvmtool device slightly before sending the
> > patches.
> >
> > To understand the virtio-iommu, I advise to first read introduction
> > and motivation, then skim through implementation notes and finally
> > look at the device specification.
> >
> > I wasn't sure how to organize the review. For those who prefer to
> > comment inline, I attached v0.4 of device-operations.tex and
> > topology.tex+MSI.tex to this thread. They are the biggest chunks of
> > the document. But LaTeX isn't very pleasant to read, so you can simply
> > send a list of comments in relation to section numbers and a few words of
> context, we'll manage.
> >
> > ---
> > Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to
> > compare more easily differences since the RFC (see [6]), but haven't
> > been made public so far. This is the first public posting since
> > initial proposal [1], and the following describes all changes.
> >
> > ## v0.1 ##
> >
> > Content is the same as the RFC, but formatted to LaTeX. 'make'
> > generates one PDF and one HTML document.
> >
> > ## v0.2 ##
> >
> > Add introductions, improve topology example and firmware description
> > based on feedback and a number of useful discussions.
> >
> > ## v0.3 ##
> >
> > Add normative sections (MUST, SHOULD, etc). Clarify some things,
> > tighten the device and driver behaviour. Unmap semantics are
> > consolidated; they are now closer to VFIO Type1 v2 semantics.
> >
> > ## v0.4 ##
> >
> > Introduce PROBE requests. They provide per-endpoint information to the
> > driver that couldn't be described otherwise.
> >
> > For the moment, they allow to handle MSIs on x86 virtual platforms
> > (see 3.2). To do that we communicate reserved IOVA regions, that will
> > also be useful for describing regions that cannot be mapped for a
> > given endpoint, for instance addresses that correspond to a PCI bridge
> window.
> >
> > Introducing such a large framework for this tiny feature may seem
> > overkill, but it is needed for future extensions of the virtio-iommu
> > and I believe it really is worth the effort.
> 
> 2.6.7
> - As I am currently integrating v0.4 in QEMU here are some other comments:
> At the moment struct virtio_iommu_req_probe flags is missing in your
> header. As such I understood the ACK protocol was not implemented by the
> driver in your branch.
> - VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is
> VIRTIO_IOMMU_T_MASK in your header too.
> 2.6.8.2:
> - I am really confused about what the device should report as resv regions
> depending on the PE nature (VFIO or not VFIO)
> 
> In other iommu drivers, the resv regions are populated by the iommu driver
> through its get_resv_regions callback. They are usually composed of an
> iommu specific MSI region (mapped or bypassed) and non IOMMU specific
> (device specific) reserved regions:
> iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those
> are the guest reserved regions.
> 
> First in the current virtio-iommu driver I don't see the
> iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu
> driver should compute the non IOMMU specific MSI regions. ie. this is not
> the responsability of the virtio-iommu device.
> 
> Then why is it more the job of the device to return the guest iommu specific
> region rather than the driver itself?
> 
> Then I understand this is the responsability of the virtio-iommu device to
> gather information about the host resv regions in case of VFIO EP.
> Typically the host PCIe host bridge windows cannot be used for IOVA.
> Also the host MSI reserved IOVA window cannot be used. Do you agree.
> 
> I really think the spec should clarify what exact resv regions the device should
> return in case of VFIO device and non VFIO device.

My understanding is that reserved regions are 1) Host reserved regions ( This is applicable for VFIO) 2) Guest MSI reserved region (This is applicable for emulated devices as well)
Does this make sense to have reserved regions per address-space/domain?

Thanks
-Bharat

> 
> Thanks
> 
> Eric
> 
> >
> > ## Future ##
> >
> > Other extensions are in preparation. I won't detail them here because
> > v0.4 already is a lot to digest, but in short, building on top of PROBE:
> >
> > * First, since the IOMMU is paravirtualized, the device can expose some
> >   properties of the physical topology to the guest, and let it allocate
> >   resources more efficiently. For example, when the virtio-iommu manages
> >   both physical and emulated endpoints, with different underlying
> IOMMUs,
> >   we now have a way to describe multiple page and block granularities,
> >   instead of forcing the guest to use the most restricted one for all
> >   endpoints. This will most likely be in v0.5.
> >
> > * Then on top of that, a major improvement will describe hardware
> >   acceleration features available to the guest. There is what I call "Page
> >   Table Handover" (or simply, from the host POV, "Nested"), the ability
> >   for the guest to manipulate its own page tables instead of sending
> >   MAP/UNMAP requests to the host. This, along with IO Page Fault
> >   reporting, will also permit SVM virtualization on different platforms.
> >
> > Thanks,
> > Jean
> >
> > [1] http://www.spinics.net/lists/kvm/msg147990.html
> > [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
> >     http://www.linux-arm.org/git?p=virtio-
> iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
> >     I reiterate the disclaimers: don't use this document as a reference,
> >     it's a draft. It's also not an OASIS document yet. It may be riddled
> >     with mistakes. As this is a working draft, it is unstable and I do not
> >     guarantee backward compatibility of future versions.
> > [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
> > [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
> >     Warning: UAPI headers have changed! They didn't follow the spec,
> >     please update. (Use branch v0.1, that has the old headers, for the
> >     Qemu prototype [3])
> > [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
> >     Warning: command-line has changed! Use --viommu vfio[,opts] and
> >     --viommu virtio[,opts] to instantiate a device.
> > [6]
> > http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs
> >

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [virtio-dev] RE: [RFC] virtio-iommu version 0.4
@ 2017-09-13  3:47       ` Bharat Bhushan
  0 siblings, 0 replies; 71+ messages in thread
From: Bharat Bhushan @ 2017-09-13  3:47 UTC (permalink / raw)
  To: Auger Eric, Jean-Philippe Brucker, iommu, kvm, virtualization,
	virtio-dev
  Cc: will.deacon, robin.murphy, lorenzo.pieralisi, mst, jasowang,
	marc.zyngier, eric.auger.pro, peterx, kevin.tian

Hi Eric,

> -----Original Message-----
> From: Auger Eric [mailto:eric.auger@redhat.com]
> Sent: Tuesday, September 12, 2017 10:43 PM
> To: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>;
> iommu@lists.linux-foundation.org; kvm@vger.kernel.org;
> virtualization@lists.linux-foundation.org; virtio-dev@lists.oasis-open.org
> Cc: will.deacon@arm.com; robin.murphy@arm.com;
> lorenzo.pieralisi@arm.com; mst@redhat.com; jasowang@redhat.com;
> marc.zyngier@arm.com; eric.auger.pro@gmail.com; Bharat Bhushan
> <bharat.bhushan@nxp.com>; peterx@redhat.com; kevin.tian@intel.com
> Subject: Re: [RFC] virtio-iommu version 0.4
> 
> Hi jean,
> 
> On 04/08/2017 20:19, Jean-Philippe Brucker wrote:
> > This is the continuation of my proposal for virtio-iommu, the para-
> > virtualized IOMMU. Here is a summary of the changes since last time [1]:
> >
> > * The virtio-iommu document now resembles an actual specification. It is
> >   split into a formal description of the virtio device, and implementation
> >   notes. Please find sources and binaries at [2].
> >
> > * Added a probe request to describe to the guest different properties that
> >   do not fit in firmware or in the virtio config space. This is a
> >   necessary stepping stone for extending the virtio-iommu.
> >
> > * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
> >   Bhushan.
> >
> > You can find the Linux driver and kvmtool device at [4] and [5]. I
> > plan to rework driver and kvmtool device slightly before sending the
> > patches.
> >
> > To understand the virtio-iommu, I advise to first read introduction
> > and motivation, then skim through implementation notes and finally
> > look at the device specification.
> >
> > I wasn't sure how to organize the review. For those who prefer to
> > comment inline, I attached v0.4 of device-operations.tex and
> > topology.tex+MSI.tex to this thread. They are the biggest chunks of
> > the document. But LaTeX isn't very pleasant to read, so you can simply
> > send a list of comments in relation to section numbers and a few words of
> context, we'll manage.
> >
> > ---
> > Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to
> > compare more easily differences since the RFC (see [6]), but haven't
> > been made public so far. This is the first public posting since
> > initial proposal [1], and the following describes all changes.
> >
> > ## v0.1 ##
> >
> > Content is the same as the RFC, but formatted to LaTeX. 'make'
> > generates one PDF and one HTML document.
> >
> > ## v0.2 ##
> >
> > Add introductions, improve topology example and firmware description
> > based on feedback and a number of useful discussions.
> >
> > ## v0.3 ##
> >
> > Add normative sections (MUST, SHOULD, etc). Clarify some things,
> > tighten the device and driver behaviour. Unmap semantics are
> > consolidated; they are now closer to VFIO Type1 v2 semantics.
> >
> > ## v0.4 ##
> >
> > Introduce PROBE requests. They provide per-endpoint information to the
> > driver that couldn't be described otherwise.
> >
> > For the moment, they allow to handle MSIs on x86 virtual platforms
> > (see 3.2). To do that we communicate reserved IOVA regions, that will
> > also be useful for describing regions that cannot be mapped for a
> > given endpoint, for instance addresses that correspond to a PCI bridge
> window.
> >
> > Introducing such a large framework for this tiny feature may seem
> > overkill, but it is needed for future extensions of the virtio-iommu
> > and I believe it really is worth the effort.
> 
> 2.6.7
> - As I am currently integrating v0.4 in QEMU here are some other comments:
> At the moment struct virtio_iommu_req_probe flags is missing in your
> header. As such I understood the ACK protocol was not implemented by the
> driver in your branch.
> - VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is
> VIRTIO_IOMMU_T_MASK in your header too.
> 2.6.8.2:
> - I am really confused about what the device should report as resv regions
> depending on the PE nature (VFIO or not VFIO)
> 
> In other iommu drivers, the resv regions are populated by the iommu driver
> through its get_resv_regions callback. They are usually composed of an
> iommu specific MSI region (mapped or bypassed) and non IOMMU specific
> (device specific) reserved regions:
> iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those
> are the guest reserved regions.
> 
> First in the current virtio-iommu driver I don't see the
> iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu
> driver should compute the non IOMMU specific MSI regions. ie. this is not
> the responsability of the virtio-iommu device.
> 
> Then why is it more the job of the device to return the guest iommu specific
> region rather than the driver itself?
> 
> Then I understand this is the responsability of the virtio-iommu device to
> gather information about the host resv regions in case of VFIO EP.
> Typically the host PCIe host bridge windows cannot be used for IOVA.
> Also the host MSI reserved IOVA window cannot be used. Do you agree.
> 
> I really think the spec should clarify what exact resv regions the device should
> return in case of VFIO device and non VFIO device.

My understanding is that reserved regions are 1) Host reserved regions ( This is applicable for VFIO) 2) Guest MSI reserved region (This is applicable for emulated devices as well)
Does this make sense to have reserved regions per address-space/domain?

Thanks
-Bharat

> 
> Thanks
> 
> Eric
> 
> >
> > ## Future ##
> >
> > Other extensions are in preparation. I won't detail them here because
> > v0.4 already is a lot to digest, but in short, building on top of PROBE:
> >
> > * First, since the IOMMU is paravirtualized, the device can expose some
> >   properties of the physical topology to the guest, and let it allocate
> >   resources more efficiently. For example, when the virtio-iommu manages
> >   both physical and emulated endpoints, with different underlying
> IOMMUs,
> >   we now have a way to describe multiple page and block granularities,
> >   instead of forcing the guest to use the most restricted one for all
> >   endpoints. This will most likely be in v0.5.
> >
> > * Then on top of that, a major improvement will describe hardware
> >   acceleration features available to the guest. There is what I call "Page
> >   Table Handover" (or simply, from the host POV, "Nested"), the ability
> >   for the guest to manipulate its own page tables instead of sending
> >   MAP/UNMAP requests to the host. This, along with IO Page Fault
> >   reporting, will also permit SVM virtualization on different platforms.
> >
> > Thanks,
> > Jean
> >
> > [1] http://www.spinics.net/lists/kvm/msg147990.html
> > [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
> >     http://www.linux-arm.org/git?p=virtio-
> iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
> >     I reiterate the disclaimers: don't use this document as a reference,
> >     it's a draft. It's also not an OASIS document yet. It may be riddled
> >     with mistakes. As this is a working draft, it is unstable and I do not
> >     guarantee backward compatibility of future versions.
> > [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
> > [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
> >     Warning: UAPI headers have changed! They didn't follow the spec,
> >     please update. (Use branch v0.1, that has the old headers, for the
> >     Qemu prototype [3])
> > [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
> >     Warning: command-line has changed! Use --viommu vfio[,opts] and
> >     --viommu virtio[,opts] to instantiate a device.
> > [6]
> > http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] virtio-iommu version 0.4
  2017-09-12 17:13   ` [virtio-dev] " Auger Eric
@ 2017-09-19 10:47     ` Jean-Philippe Brucker
  -1 siblings, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-09-19 10:47 UTC (permalink / raw)
  To: Auger Eric, iommu, kvm, virtualization, virtio-dev
  Cc: Will Deacon, Robin Murphy, Lorenzo Pieralisi, mst, jasowang,
	Marc Zyngier, eric.auger.pro, bharat.bhushan, peterx, kevin.tian

Hi Eric,

On 12/09/17 18:13, Auger Eric wrote:
> 2.6.7
> - As I am currently integrating v0.4 in QEMU here are some other comments:
> At the moment struct virtio_iommu_req_probe flags is missing in your
> header. As such I understood the ACK protocol was not implemented by the
> driver in your branch.

Uh indeed. And yet I could swear I've written that code... somewhere. I
will add it to the batch of v0.5 changes, it shouldn't be too invasive.

> - VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is VIRTIO_IOMMU_T_MASK in your
> header too.

Yes, keeping VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is probably best
(though it is a mouthful).

> 2.6.8.2:
> - I am really confused about what the device should report as resv
> regions depending on the PE nature (VFIO or not VFIO)
>
> In other iommu drivers, the resv regions are populated by the iommu
> driver through its get_resv_regions callback. They are usually composed
> of an iommu specific MSI region (mapped or bypassed) and non IOMMU
> specific (device specific) reserved regions:
> iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those
> are the guest reserved regions.
>
> First in the current virtio-iommu driver I don't see the
> iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu
> driver should compute the non IOMMU specific MSI regions. ie. this is
> not the responsability of the virtio-iommu device.

For SW_MSI, certainly. The driver allocates a fixed IOVA region for
mapping the MSI doorbell. But the driver has to know whether the doorbell
region is translated or bypassed.

> Then why is it more the job of the device to return the guest iommu
> specific region rather than the driver itself?

The MSI region is architectural on x86 IOMMUs, but
implementation-defined on virtio-iommu. It depends which platform the host
is emulating. In Linux, x86 IOMMU drivers register the bypass region
because there always is an IOAPIC on the other end, with a fixed MSI
address. But virtio-iommu may be either behind a GIC, an APIC or some
other IRQ chip.

The driver *could* go over all the irqchips/platforms it knows and try to
guess if there is a fixed doorbell or if it needs to reserve an IOVA for
them, but it would look horrible. I much prefer having a well-defined way
of doing this, so a description from the device.

> Then I understand this is the responsability of the virtio-iommu device
> to gather information about the host resv regions in case of VFIO EP.
> Typically the host PCIe host bridge windows cannot be used for IOVA.
> Also the host MSI reserved IOVA window cannot be used. Do you agree.

Yes, all regions reported in sysfs reserved_regions in the host would be
reported as RESV_T_RESERVED by virtio-iommu.

> I really think the spec should clarify what exact resv regions the
> device should return in case of VFIO device and non VFIO device.

Agreed. I will add something about RESV_T_RESERVED with the PCI bridge
example in Implementation Notes. Do you think the MSI examples at the end
need improvement as well? I can try to explain that RESV_MSI regions in
virtio-iommu are only those of the emulated platform, not the HW or SW MSI
regions from the host.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] virtio-iommu version 0.4
  2017-09-12 17:13   ` [virtio-dev] " Auger Eric
                     ` (3 preceding siblings ...)
  (?)
@ 2017-09-19 10:47   ` Jean-Philippe Brucker
  -1 siblings, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-09-19 10:47 UTC (permalink / raw)
  To: Auger Eric, iommu, kvm, virtualization, virtio-dev
  Cc: Lorenzo Pieralisi, mst, Marc Zyngier, Will Deacon, Robin Murphy,
	eric.auger.pro

Hi Eric,

On 12/09/17 18:13, Auger Eric wrote:
> 2.6.7
> - As I am currently integrating v0.4 in QEMU here are some other comments:
> At the moment struct virtio_iommu_req_probe flags is missing in your
> header. As such I understood the ACK protocol was not implemented by the
> driver in your branch.

Uh indeed. And yet I could swear I've written that code... somewhere. I
will add it to the batch of v0.5 changes, it shouldn't be too invasive.

> - VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is VIRTIO_IOMMU_T_MASK in your
> header too.

Yes, keeping VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is probably best
(though it is a mouthful).

> 2.6.8.2:
> - I am really confused about what the device should report as resv
> regions depending on the PE nature (VFIO or not VFIO)
>
> In other iommu drivers, the resv regions are populated by the iommu
> driver through its get_resv_regions callback. They are usually composed
> of an iommu specific MSI region (mapped or bypassed) and non IOMMU
> specific (device specific) reserved regions:
> iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those
> are the guest reserved regions.
>
> First in the current virtio-iommu driver I don't see the
> iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu
> driver should compute the non IOMMU specific MSI regions. ie. this is
> not the responsability of the virtio-iommu device.

For SW_MSI, certainly. The driver allocates a fixed IOVA region for
mapping the MSI doorbell. But the driver has to know whether the doorbell
region is translated or bypassed.

> Then why is it more the job of the device to return the guest iommu
> specific region rather than the driver itself?

The MSI region is architectural on x86 IOMMUs, but
implementation-defined on virtio-iommu. It depends which platform the host
is emulating. In Linux, x86 IOMMU drivers register the bypass region
because there always is an IOAPIC on the other end, with a fixed MSI
address. But virtio-iommu may be either behind a GIC, an APIC or some
other IRQ chip.

The driver *could* go over all the irqchips/platforms it knows and try to
guess if there is a fixed doorbell or if it needs to reserve an IOVA for
them, but it would look horrible. I much prefer having a well-defined way
of doing this, so a description from the device.

> Then I understand this is the responsability of the virtio-iommu device
> to gather information about the host resv regions in case of VFIO EP.
> Typically the host PCIe host bridge windows cannot be used for IOVA.
> Also the host MSI reserved IOVA window cannot be used. Do you agree.

Yes, all regions reported in sysfs reserved_regions in the host would be
reported as RESV_T_RESERVED by virtio-iommu.

> I really think the spec should clarify what exact resv regions the
> device should return in case of VFIO device and non VFIO device.

Agreed. I will add something about RESV_T_RESERVED with the PCI bridge
example in Implementation Notes. Do you think the MSI examples at the end
need improvement as well? I can try to explain that RESV_MSI regions in
virtio-iommu are only those of the emulated platform, not the HW or SW MSI
regions from the host.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [virtio-dev] Re: [RFC] virtio-iommu version 0.4
@ 2017-09-19 10:47     ` Jean-Philippe Brucker
  0 siblings, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-09-19 10:47 UTC (permalink / raw)
  To: Auger Eric, iommu, kvm, virtualization, virtio-dev
  Cc: Will Deacon, Robin Murphy, Lorenzo Pieralisi, mst, jasowang,
	Marc Zyngier, eric.auger.pro, bharat.bhushan, peterx, kevin.tian

Hi Eric,

On 12/09/17 18:13, Auger Eric wrote:
> 2.6.7
> - As I am currently integrating v0.4 in QEMU here are some other comments:
> At the moment struct virtio_iommu_req_probe flags is missing in your
> header. As such I understood the ACK protocol was not implemented by the
> driver in your branch.

Uh indeed. And yet I could swear I've written that code... somewhere. I
will add it to the batch of v0.5 changes, it shouldn't be too invasive.

> - VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is VIRTIO_IOMMU_T_MASK in your
> header too.

Yes, keeping VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is probably best
(though it is a mouthful).

> 2.6.8.2:
> - I am really confused about what the device should report as resv
> regions depending on the PE nature (VFIO or not VFIO)
>
> In other iommu drivers, the resv regions are populated by the iommu
> driver through its get_resv_regions callback. They are usually composed
> of an iommu specific MSI region (mapped or bypassed) and non IOMMU
> specific (device specific) reserved regions:
> iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those
> are the guest reserved regions.
>
> First in the current virtio-iommu driver I don't see the
> iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu
> driver should compute the non IOMMU specific MSI regions. ie. this is
> not the responsability of the virtio-iommu device.

For SW_MSI, certainly. The driver allocates a fixed IOVA region for
mapping the MSI doorbell. But the driver has to know whether the doorbell
region is translated or bypassed.

> Then why is it more the job of the device to return the guest iommu
> specific region rather than the driver itself?

The MSI region is architectural on x86 IOMMUs, but
implementation-defined on virtio-iommu. It depends which platform the host
is emulating. In Linux, x86 IOMMU drivers register the bypass region
because there always is an IOAPIC on the other end, with a fixed MSI
address. But virtio-iommu may be either behind a GIC, an APIC or some
other IRQ chip.

The driver *could* go over all the irqchips/platforms it knows and try to
guess if there is a fixed doorbell or if it needs to reserve an IOVA for
them, but it would look horrible. I much prefer having a well-defined way
of doing this, so a description from the device.

> Then I understand this is the responsability of the virtio-iommu device
> to gather information about the host resv regions in case of VFIO EP.
> Typically the host PCIe host bridge windows cannot be used for IOVA.
> Also the host MSI reserved IOVA window cannot be used. Do you agree.

Yes, all regions reported in sysfs reserved_regions in the host would be
reported as RESV_T_RESERVED by virtio-iommu.

> I really think the spec should clarify what exact resv regions the
> device should return in case of VFIO device and non VFIO device.

Agreed. I will add something about RESV_T_RESERVED with the PCI bridge
example in Implementation Notes. Do you think the MSI examples at the end
need improvement as well? I can try to explain that RESV_MSI regions in
virtio-iommu are only those of the emulated platform, not the HW or SW MSI
regions from the host.

Thanks,
Jean

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] virtio-iommu version 0.4
  2017-09-19 10:47     ` [virtio-dev] " Jean-Philippe Brucker
@ 2017-09-20  9:37       ` Auger Eric
  -1 siblings, 0 replies; 71+ messages in thread
From: Auger Eric @ 2017-09-20  9:37 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: Will Deacon, Robin Murphy, Lorenzo Pieralisi, mst, jasowang,
	Marc Zyngier, eric.auger.pro, bharat.bhushan, peterx, kevin.tian

Hi Jean,
On 19/09/2017 12:47, Jean-Philippe Brucker wrote:
> Hi Eric,
> 
> On 12/09/17 18:13, Auger Eric wrote:
>> 2.6.7
>> - As I am currently integrating v0.4 in QEMU here are some other comments:
>> At the moment struct virtio_iommu_req_probe flags is missing in your
>> header. As such I understood the ACK protocol was not implemented by the
>> driver in your branch.
> 
> Uh indeed. And yet I could swear I've written that code... somewhere. I
> will add it to the batch of v0.5 changes, it shouldn't be too invasive.
> 
>> - VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is VIRTIO_IOMMU_T_MASK in your
>> header too.
> 
> Yes, keeping VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is probably best
> (though it is a mouthful).
> 
>> 2.6.8.2:
>> - I am really confused about what the device should report as resv
>> regions depending on the PE nature (VFIO or not VFIO)
>>
>> In other iommu drivers, the resv regions are populated by the iommu
>> driver through its get_resv_regions callback. They are usually composed
>> of an iommu specific MSI region (mapped or bypassed) and non IOMMU
>> specific (device specific) reserved regions:
>> iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those
>> are the guest reserved regions.
>>
>> First in the current virtio-iommu driver I don't see the
>> iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu
>> driver should compute the non IOMMU specific MSI regions ie. this is
>> not the responsability of the virtio-iommu device.
> 
> For SW_MSI, certainly. The driver allocates a fixed IOVA region for
> mapping the MSI doorbell. But the driver has to know whether the doorbell
> region is translated or bypassed.
Sorry I was talking about *non* IOMMU specific MSI regions, typically
the regions corresponding to guest PCI host bridge windows. This is
normally computed in the iommu driver and I didn't that that in the
existing virtio-iommu driver.
> 
>> Then why is it more the job of the device to return the guest iommu
>> specific region rather than the driver itself?
> 
> The MSI region is architectural on x86 IOMMUs, but
> implementation-defined on virtio-iommu. It depends which platform the host
> is emulating. In Linux, x86 IOMMU drivers register the bypass region
> because there always is an IOAPIC on the other end, with a fixed MSI
> address. But virtio-iommu may be either behind a GIC, an APIC or some
> other IRQ chip.
> 
> The driver *could* go over all the irqchips/platforms it knows and try to
> guess if there is a fixed doorbell or if it needs to reserve an IOVA for
> them, but it would look horrible. I much prefer having a well-defined way
> of doing this, so a description from the device.

This means I must have target specific code in the virtio-iommu device
which is unusual, right? I was initially thinking you could handle that
on the driver side using a config set for ARM|ARM64. But on the other
hand you should communicate the info to the device ...

> 
>> Then I understand this is the responsability of the virtio-iommu device
>> to gather information about the host resv regions in case of VFIO EP.
>> Typically the host PCIe host bridge windows cannot be used for IOVA.
>> Also the host MSI reserved IOVA window cannot be used. Do you agree.
> 
> Yes, all regions reported in sysfs reserved_regions in the host would be
> reported as RESV_T_RESERVED by virtio-iommu.
So to summarize if the probe request is sent to an emulated device, we
should return the target specific MSI window. We can't and don't return
the non IOMMU specific guest reserved windows.

For a VFIO device, we would return all reserved regions of the group the
device belongs to. Is that correct?
> 
>> I really think the spec should clarify what exact resv regions the
>> device should return in case of VFIO device and non VFIO device.
> 
> Agreed. I will add something about RESV_T_RESERVED with the PCI bridge
> example in Implementation Notes. Do you think the MSI examples at the end
> need improvement as well? I can try to explain that RESV_MSI regions in
> virtio-iommu are only those of the emulated platform, not the HW or SW MSI
> regions from the host.

I think I would expect an explanation detailing returned reserved
regions for pure emulated devices and HW/VFIO devices.

Another unrelated remark:
- you should add a permission violation error.
- wrt the probe request ACK protocol, this looks pretty heavy as both
the driver and the device need to parse the req_probe buffer. The device
need to fill in the output buffer and then read the same info on the
input buffer. Couldn't we imagine something simpler?

Thanks

Eric
> 
> Thanks,
> Jean
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] virtio-iommu version 0.4
  2017-09-19 10:47     ` [virtio-dev] " Jean-Philippe Brucker
  (?)
@ 2017-09-20  9:37     ` Auger Eric
  -1 siblings, 0 replies; 71+ messages in thread
From: Auger Eric @ 2017-09-20  9:37 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: Lorenzo Pieralisi, mst, Marc Zyngier, Will Deacon, Robin Murphy,
	eric.auger.pro

Hi Jean,
On 19/09/2017 12:47, Jean-Philippe Brucker wrote:
> Hi Eric,
> 
> On 12/09/17 18:13, Auger Eric wrote:
>> 2.6.7
>> - As I am currently integrating v0.4 in QEMU here are some other comments:
>> At the moment struct virtio_iommu_req_probe flags is missing in your
>> header. As such I understood the ACK protocol was not implemented by the
>> driver in your branch.
> 
> Uh indeed. And yet I could swear I've written that code... somewhere. I
> will add it to the batch of v0.5 changes, it shouldn't be too invasive.
> 
>> - VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is VIRTIO_IOMMU_T_MASK in your
>> header too.
> 
> Yes, keeping VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is probably best
> (though it is a mouthful).
> 
>> 2.6.8.2:
>> - I am really confused about what the device should report as resv
>> regions depending on the PE nature (VFIO or not VFIO)
>>
>> In other iommu drivers, the resv regions are populated by the iommu
>> driver through its get_resv_regions callback. They are usually composed
>> of an iommu specific MSI region (mapped or bypassed) and non IOMMU
>> specific (device specific) reserved regions:
>> iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those
>> are the guest reserved regions.
>>
>> First in the current virtio-iommu driver I don't see the
>> iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu
>> driver should compute the non IOMMU specific MSI regions ie. this is
>> not the responsability of the virtio-iommu device.
> 
> For SW_MSI, certainly. The driver allocates a fixed IOVA region for
> mapping the MSI doorbell. But the driver has to know whether the doorbell
> region is translated or bypassed.
Sorry I was talking about *non* IOMMU specific MSI regions, typically
the regions corresponding to guest PCI host bridge windows. This is
normally computed in the iommu driver and I didn't that that in the
existing virtio-iommu driver.
> 
>> Then why is it more the job of the device to return the guest iommu
>> specific region rather than the driver itself?
> 
> The MSI region is architectural on x86 IOMMUs, but
> implementation-defined on virtio-iommu. It depends which platform the host
> is emulating. In Linux, x86 IOMMU drivers register the bypass region
> because there always is an IOAPIC on the other end, with a fixed MSI
> address. But virtio-iommu may be either behind a GIC, an APIC or some
> other IRQ chip.
> 
> The driver *could* go over all the irqchips/platforms it knows and try to
> guess if there is a fixed doorbell or if it needs to reserve an IOVA for
> them, but it would look horrible. I much prefer having a well-defined way
> of doing this, so a description from the device.

This means I must have target specific code in the virtio-iommu device
which is unusual, right? I was initially thinking you could handle that
on the driver side using a config set for ARM|ARM64. But on the other
hand you should communicate the info to the device ...

> 
>> Then I understand this is the responsability of the virtio-iommu device
>> to gather information about the host resv regions in case of VFIO EP.
>> Typically the host PCIe host bridge windows cannot be used for IOVA.
>> Also the host MSI reserved IOVA window cannot be used. Do you agree.
> 
> Yes, all regions reported in sysfs reserved_regions in the host would be
> reported as RESV_T_RESERVED by virtio-iommu.
So to summarize if the probe request is sent to an emulated device, we
should return the target specific MSI window. We can't and don't return
the non IOMMU specific guest reserved windows.

For a VFIO device, we would return all reserved regions of the group the
device belongs to. Is that correct?
> 
>> I really think the spec should clarify what exact resv regions the
>> device should return in case of VFIO device and non VFIO device.
> 
> Agreed. I will add something about RESV_T_RESERVED with the PCI bridge
> example in Implementation Notes. Do you think the MSI examples at the end
> need improvement as well? I can try to explain that RESV_MSI regions in
> virtio-iommu are only those of the emulated platform, not the HW or SW MSI
> regions from the host.

I think I would expect an explanation detailing returned reserved
regions for pure emulated devices and HW/VFIO devices.

Another unrelated remark:
- you should add a permission violation error.
- wrt the probe request ACK protocol, this looks pretty heavy as both
the driver and the device need to parse the req_probe buffer. The device
need to fill in the output buffer and then read the same info on the
input buffer. Couldn't we imagine something simpler?

Thanks

Eric
> 
> Thanks,
> Jean
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [virtio-dev] Re: [RFC] virtio-iommu version 0.4
@ 2017-09-20  9:37       ` Auger Eric
  0 siblings, 0 replies; 71+ messages in thread
From: Auger Eric @ 2017-09-20  9:37 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: Will Deacon, Robin Murphy, Lorenzo Pieralisi, mst, jasowang,
	Marc Zyngier, eric.auger.pro, bharat.bhushan, peterx, kevin.tian

Hi Jean,
On 19/09/2017 12:47, Jean-Philippe Brucker wrote:
> Hi Eric,
> 
> On 12/09/17 18:13, Auger Eric wrote:
>> 2.6.7
>> - As I am currently integrating v0.4 in QEMU here are some other comments:
>> At the moment struct virtio_iommu_req_probe flags is missing in your
>> header. As such I understood the ACK protocol was not implemented by the
>> driver in your branch.
> 
> Uh indeed. And yet I could swear I've written that code... somewhere. I
> will add it to the batch of v0.5 changes, it shouldn't be too invasive.
> 
>> - VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is VIRTIO_IOMMU_T_MASK in your
>> header too.
> 
> Yes, keeping VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is probably best
> (though it is a mouthful).
> 
>> 2.6.8.2:
>> - I am really confused about what the device should report as resv
>> regions depending on the PE nature (VFIO or not VFIO)
>>
>> In other iommu drivers, the resv regions are populated by the iommu
>> driver through its get_resv_regions callback. They are usually composed
>> of an iommu specific MSI region (mapped or bypassed) and non IOMMU
>> specific (device specific) reserved regions:
>> iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those
>> are the guest reserved regions.
>>
>> First in the current virtio-iommu driver I don't see the
>> iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu
>> driver should compute the non IOMMU specific MSI regions ie. this is
>> not the responsability of the virtio-iommu device.
> 
> For SW_MSI, certainly. The driver allocates a fixed IOVA region for
> mapping the MSI doorbell. But the driver has to know whether the doorbell
> region is translated or bypassed.
Sorry I was talking about *non* IOMMU specific MSI regions, typically
the regions corresponding to guest PCI host bridge windows. This is
normally computed in the iommu driver and I didn't that that in the
existing virtio-iommu driver.
> 
>> Then why is it more the job of the device to return the guest iommu
>> specific region rather than the driver itself?
> 
> The MSI region is architectural on x86 IOMMUs, but
> implementation-defined on virtio-iommu. It depends which platform the host
> is emulating. In Linux, x86 IOMMU drivers register the bypass region
> because there always is an IOAPIC on the other end, with a fixed MSI
> address. But virtio-iommu may be either behind a GIC, an APIC or some
> other IRQ chip.
> 
> The driver *could* go over all the irqchips/platforms it knows and try to
> guess if there is a fixed doorbell or if it needs to reserve an IOVA for
> them, but it would look horrible. I much prefer having a well-defined way
> of doing this, so a description from the device.

This means I must have target specific code in the virtio-iommu device
which is unusual, right? I was initially thinking you could handle that
on the driver side using a config set for ARM|ARM64. But on the other
hand you should communicate the info to the device ...

> 
>> Then I understand this is the responsability of the virtio-iommu device
>> to gather information about the host resv regions in case of VFIO EP.
>> Typically the host PCIe host bridge windows cannot be used for IOVA.
>> Also the host MSI reserved IOVA window cannot be used. Do you agree.
> 
> Yes, all regions reported in sysfs reserved_regions in the host would be
> reported as RESV_T_RESERVED by virtio-iommu.
So to summarize if the probe request is sent to an emulated device, we
should return the target specific MSI window. We can't and don't return
the non IOMMU specific guest reserved windows.

For a VFIO device, we would return all reserved regions of the group the
device belongs to. Is that correct?
> 
>> I really think the spec should clarify what exact resv regions the
>> device should return in case of VFIO device and non VFIO device.
> 
> Agreed. I will add something about RESV_T_RESERVED with the PCI bridge
> example in Implementation Notes. Do you think the MSI examples at the end
> need improvement as well? I can try to explain that RESV_MSI regions in
> virtio-iommu are only those of the emulated platform, not the HW or SW MSI
> regions from the host.

I think I would expect an explanation detailing returned reserved
regions for pure emulated devices and HW/VFIO devices.

Another unrelated remark:
- you should add a permission violation error.
- wrt the probe request ACK protocol, this looks pretty heavy as both
the driver and the device need to parse the req_probe buffer. The device
need to fill in the output buffer and then read the same info on the
input buffer. Couldn't we imagine something simpler?

Thanks

Eric
> 
> Thanks,
> Jean
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [RFC] virtio-iommu version 0.4
  2017-09-06 11:54       ` [virtio-dev] " Jean-Philippe Brucker
@ 2017-09-21  6:27         ` Tian, Kevin
  -1 siblings, 0 replies; 71+ messages in thread
From: Tian, Kevin @ 2017-09-21  6:27 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: will.deacon, robin.murphy, lorenzo.pieralisi, mst, jasowang,
	marc.zyngier, eric.auger, eric.auger.pro, bharat.bhushan, peterx

> From: Jean-Philippe Brucker
> Sent: Wednesday, September 6, 2017 7:55 PM
> 
> Hi Kevin,
> 
> On 28/08/17 08:39, Tian, Kevin wrote:
> > Here comes some comments:
> >
> > 1.1 Motivation
> >
> > You describe I/O page faults handling as future work. Seems you
> considered
> > only recoverable fault (since "aka. PCI PRI" being used). What about other
> > unrecoverable faults e.g. what to do if a virtual DMA request doesn't find
> > a valid mapping? Even when there is no PRI support, we need some basic
> > form of fault reporting mechanism to indicate such errors to guest.
> 
> I am considering recoverable faults as the end goal, but reporting
> unrecoverable faults should use the same queue, with slightly different
> fields and no need for the driver to reply to the device.

what about adding a placeholder for now? Though same mechanism
can be reused, it's an essential part to make virtio-iommu architecture
complete even before talking support for recoverable faults. :-)

> 
> > 2.6.8.2 Property RESV_MEM
> >
> > I'm not immediately clear when
> VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT
> > should be explicitly reported. Is there any real example on bare metal
> IOMMU?
> > usually reserved memory is reported to CPU through other method (e.g.
> e820
> > on x86 platform). Of course MSI is a special case which is covered by
> BYPASS
> > and MSI flag... If yes, maybe you can also include an example in
> implementation
> > notes.
> 
> The RESV_MEM regions only describes IOVA space for the moment, not
> guest-physical, so I guess it provides different information than e820.
> 
> I think a useful example is the PCI bridge windows reported by the Linux
> host to userspace using RESV_RESERVED regions (see
> iommu_dma_get_resv_regions). If I understand correctly, they represent
> DMA
> addresses that shouldn't be accessed by endpoints because they won't
> reach
> the IOMMU. These are specific to the physical topology: a device will have
> different reserved regions depending on the PCI slot it occupies.
> 
> When handled properly, PCI bridge windows quickly become a nuisance.
> With
> kvmtool we observed that carving out their addresses globally removes a
> lot of useful GPA space from the guest. Without a virtual IOMMU we can
> either ignore them and hope everything will be fine, or remove all
> reserved regions from the GPA space (which currently means editing by
> hand
> the static guest-physical map...)
> 
> That's where RESV_MEM_T_ABORT comes handy with virtio-iommu. It
> describes
> reserved IOVAs for a specific endpoint, and therefore removes the need to
> carve the window out of the whole guest.

Understand and thanks for elaboration.

> 
> > Another thing I want to ask your opinion, about whether there is value of
> > adding another subtype (MEM_T_IDENTITY), asking for identity mapping
> > in the address space. It's similar to Reserved Memory Region Reporting
> > (RMRR) structure defined in VT-d, to indicate BIOS allocated reserved
> > memory ranges which may be DMA target and has to be identity mapped
> > when DMA remapping is enabled. I'm not sure whether ARM has similar
> > capability and whether there might be a general usage beyond VT-d. For
> > now the only usage in my mind is to assign a device with RMRR associated
> > on VT-d (Intel GPU, or some USB controllers) where the RMRR info needs
> > propagated to the guest (since identity mapping also means reservation
> > of virtual address space).
> 
> Yes I think adding MEM_T_IDENTITY will be necessary. I can see they are
> used for both iGPU and USB controllers on my x86 machines. Do you know
> more precisely what they are used for by the firmware?

VTd spec has a clear description:

3.14 Handling Requests to Reserved System Memory
Reserved system memory regions are typically allocated by BIOS at boot 
time and reported to OS as reserved address ranges in the system memory 
map. Requests-without-PASID to these reserved regions may either occur 
as a result of operations performed by the system software driver (for 
example in the case of DMA from unified memory access (UMA) graphics 
controllers to graphics reserved memory), or may be initiated by non 
system software (for example in case of DMA performed by a USB 
controller under BIOS SMM control for legacy keyboard emulation). 
For proper functioning of these legacy reserved memory usages, when 
system software enables DMA remapping, the second-level translation 
structures for the respective devices are expected to be set up to provide
identity mapping for the specified reserved memory regions with read 
and write permissions.

(one specific example for GPU happens in legacy VGA usage in early
boot time before actual graphics driver is loaded)

> 
> It's not necessary with the base virtio-iommu device though (v0.4),
> because the device can create the identity mappings itself and report them
> to the guest as MEM_T_BYPASS. However, when we start handing page

when you say "the device can create ...", I think you really meant
"host iommu driver can create identity mapping for assigned device",
correct?

Then yes, I think above works.

> table
> control over to the guest, the host won't be in control of IOVA->GPA
> mappings and will need to gracefully ask the guest to do it.
> 
> I'm not aware of any firmware description resembling Intel RMRR or AMD
> IVMD on ARM platforms. I do think ARM platforms could need
> MEM_T_IDENTITY
> for requesting the guest to map MSI windows when page-table handover is
> in
> use (MSI addresses are translated by the physical SMMU, so a IOVA->GPA
> mapping must be installed by the guest). But since a vSMMU would need a
> solution as well, I think I'll try to implement something more generic.

curious do you need identity mapping full IOVA->GPA->HPA translation, 
or just in GPA->HPA stage sufficient for above MSI scenario?

> 
> > 2.6.8.2.3 Device Requirements: Property RESV_MEM
> >
> > --citation start--
> > If an endpoint is attached to an address space, the device SHOULD leave
> > any access targeting one of its
> VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS
> > regions pass through untranslated. In other words, the device SHOULD
> > handle such a region as if it was identity-mapped (virtual address equal to
> > physical address). If the endpoint is not attached to any address space,
> > then the device MAY abort the transaction.
> > --citation end
> >
> > I have a question for the last sentence. From definition of BYPASS, it's
> > orthogonal to whether there is an address space attached, then should
> > we still allow "May abort" behavior?
> 
> The behavior is left as an implementation choice, and I'm not sure it's
> worth enforcing in the architecture. If the endpoint isn't attached to any
> domain then (unless VIRTIO_IOMMU_F_BYPASS is negotiated), it isn't
> necessarily able to do DMA at all. The virtio-iommu device may setup DMA
> mastering lazily, in which case any DMA transaction would abort, or have
> setup DMA already, in which case the endpoint can access MEM_T_BYPASS
> regions.
> 

fair enough. thanks

Kevin

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [RFC] virtio-iommu version 0.4
  2017-09-06 11:54       ` [virtio-dev] " Jean-Philippe Brucker
  (?)
@ 2017-09-21  6:27       ` Tian, Kevin
  -1 siblings, 0 replies; 71+ messages in thread
From: Tian, Kevin @ 2017-09-21  6:27 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: lorenzo.pieralisi, mst, marc.zyngier, will.deacon, eric.auger,
	robin.murphy, eric.auger.pro

> From: Jean-Philippe Brucker
> Sent: Wednesday, September 6, 2017 7:55 PM
> 
> Hi Kevin,
> 
> On 28/08/17 08:39, Tian, Kevin wrote:
> > Here comes some comments:
> >
> > 1.1 Motivation
> >
> > You describe I/O page faults handling as future work. Seems you
> considered
> > only recoverable fault (since "aka. PCI PRI" being used). What about other
> > unrecoverable faults e.g. what to do if a virtual DMA request doesn't find
> > a valid mapping? Even when there is no PRI support, we need some basic
> > form of fault reporting mechanism to indicate such errors to guest.
> 
> I am considering recoverable faults as the end goal, but reporting
> unrecoverable faults should use the same queue, with slightly different
> fields and no need for the driver to reply to the device.

what about adding a placeholder for now? Though same mechanism
can be reused, it's an essential part to make virtio-iommu architecture
complete even before talking support for recoverable faults. :-)

> 
> > 2.6.8.2 Property RESV_MEM
> >
> > I'm not immediately clear when
> VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT
> > should be explicitly reported. Is there any real example on bare metal
> IOMMU?
> > usually reserved memory is reported to CPU through other method (e.g.
> e820
> > on x86 platform). Of course MSI is a special case which is covered by
> BYPASS
> > and MSI flag... If yes, maybe you can also include an example in
> implementation
> > notes.
> 
> The RESV_MEM regions only describes IOVA space for the moment, not
> guest-physical, so I guess it provides different information than e820.
> 
> I think a useful example is the PCI bridge windows reported by the Linux
> host to userspace using RESV_RESERVED regions (see
> iommu_dma_get_resv_regions). If I understand correctly, they represent
> DMA
> addresses that shouldn't be accessed by endpoints because they won't
> reach
> the IOMMU. These are specific to the physical topology: a device will have
> different reserved regions depending on the PCI slot it occupies.
> 
> When handled properly, PCI bridge windows quickly become a nuisance.
> With
> kvmtool we observed that carving out their addresses globally removes a
> lot of useful GPA space from the guest. Without a virtual IOMMU we can
> either ignore them and hope everything will be fine, or remove all
> reserved regions from the GPA space (which currently means editing by
> hand
> the static guest-physical map...)
> 
> That's where RESV_MEM_T_ABORT comes handy with virtio-iommu. It
> describes
> reserved IOVAs for a specific endpoint, and therefore removes the need to
> carve the window out of the whole guest.

Understand and thanks for elaboration.

> 
> > Another thing I want to ask your opinion, about whether there is value of
> > adding another subtype (MEM_T_IDENTITY), asking for identity mapping
> > in the address space. It's similar to Reserved Memory Region Reporting
> > (RMRR) structure defined in VT-d, to indicate BIOS allocated reserved
> > memory ranges which may be DMA target and has to be identity mapped
> > when DMA remapping is enabled. I'm not sure whether ARM has similar
> > capability and whether there might be a general usage beyond VT-d. For
> > now the only usage in my mind is to assign a device with RMRR associated
> > on VT-d (Intel GPU, or some USB controllers) where the RMRR info needs
> > propagated to the guest (since identity mapping also means reservation
> > of virtual address space).
> 
> Yes I think adding MEM_T_IDENTITY will be necessary. I can see they are
> used for both iGPU and USB controllers on my x86 machines. Do you know
> more precisely what they are used for by the firmware?

VTd spec has a clear description:

3.14 Handling Requests to Reserved System Memory
Reserved system memory regions are typically allocated by BIOS at boot 
time and reported to OS as reserved address ranges in the system memory 
map. Requests-without-PASID to these reserved regions may either occur 
as a result of operations performed by the system software driver (for 
example in the case of DMA from unified memory access (UMA) graphics 
controllers to graphics reserved memory), or may be initiated by non 
system software (for example in case of DMA performed by a USB 
controller under BIOS SMM control for legacy keyboard emulation). 
For proper functioning of these legacy reserved memory usages, when 
system software enables DMA remapping, the second-level translation 
structures for the respective devices are expected to be set up to provide
identity mapping for the specified reserved memory regions with read 
and write permissions.

(one specific example for GPU happens in legacy VGA usage in early
boot time before actual graphics driver is loaded)

> 
> It's not necessary with the base virtio-iommu device though (v0.4),
> because the device can create the identity mappings itself and report them
> to the guest as MEM_T_BYPASS. However, when we start handing page

when you say "the device can create ...", I think you really meant
"host iommu driver can create identity mapping for assigned device",
correct?

Then yes, I think above works.

> table
> control over to the guest, the host won't be in control of IOVA->GPA
> mappings and will need to gracefully ask the guest to do it.
> 
> I'm not aware of any firmware description resembling Intel RMRR or AMD
> IVMD on ARM platforms. I do think ARM platforms could need
> MEM_T_IDENTITY
> for requesting the guest to map MSI windows when page-table handover is
> in
> use (MSI addresses are translated by the physical SMMU, so a IOVA->GPA
> mapping must be installed by the guest). But since a vSMMU would need a
> solution as well, I think I'll try to implement something more generic.

curious do you need identity mapping full IOVA->GPA->HPA translation, 
or just in GPA->HPA stage sufficient for above MSI scenario?

> 
> > 2.6.8.2.3 Device Requirements: Property RESV_MEM
> >
> > --citation start--
> > If an endpoint is attached to an address space, the device SHOULD leave
> > any access targeting one of its
> VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS
> > regions pass through untranslated. In other words, the device SHOULD
> > handle such a region as if it was identity-mapped (virtual address equal to
> > physical address). If the endpoint is not attached to any address space,
> > then the device MAY abort the transaction.
> > --citation end
> >
> > I have a question for the last sentence. From definition of BYPASS, it's
> > orthogonal to whether there is an address space attached, then should
> > we still allow "May abort" behavior?
> 
> The behavior is left as an implementation choice, and I'm not sure it's
> worth enforcing in the architecture. If the endpoint isn't attached to any
> domain then (unless VIRTIO_IOMMU_F_BYPASS is negotiated), it isn't
> necessarily able to do DMA at all. The virtio-iommu device may setup DMA
> mastering lazily, in which case any DMA transaction would abort, or have
> setup DMA already, in which case the endpoint can access MEM_T_BYPASS
> regions.
> 

fair enough. thanks

Kevin

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [virtio-dev] RE: [RFC] virtio-iommu version 0.4
@ 2017-09-21  6:27         ` Tian, Kevin
  0 siblings, 0 replies; 71+ messages in thread
From: Tian, Kevin @ 2017-09-21  6:27 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: will.deacon, robin.murphy, lorenzo.pieralisi, mst, jasowang,
	marc.zyngier, eric.auger, eric.auger.pro, bharat.bhushan, peterx

> From: Jean-Philippe Brucker
> Sent: Wednesday, September 6, 2017 7:55 PM
> 
> Hi Kevin,
> 
> On 28/08/17 08:39, Tian, Kevin wrote:
> > Here comes some comments:
> >
> > 1.1 Motivation
> >
> > You describe I/O page faults handling as future work. Seems you
> considered
> > only recoverable fault (since "aka. PCI PRI" being used). What about other
> > unrecoverable faults e.g. what to do if a virtual DMA request doesn't find
> > a valid mapping? Even when there is no PRI support, we need some basic
> > form of fault reporting mechanism to indicate such errors to guest.
> 
> I am considering recoverable faults as the end goal, but reporting
> unrecoverable faults should use the same queue, with slightly different
> fields and no need for the driver to reply to the device.

what about adding a placeholder for now? Though same mechanism
can be reused, it's an essential part to make virtio-iommu architecture
complete even before talking support for recoverable faults. :-)

> 
> > 2.6.8.2 Property RESV_MEM
> >
> > I'm not immediately clear when
> VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT
> > should be explicitly reported. Is there any real example on bare metal
> IOMMU?
> > usually reserved memory is reported to CPU through other method (e.g.
> e820
> > on x86 platform). Of course MSI is a special case which is covered by
> BYPASS
> > and MSI flag... If yes, maybe you can also include an example in
> implementation
> > notes.
> 
> The RESV_MEM regions only describes IOVA space for the moment, not
> guest-physical, so I guess it provides different information than e820.
> 
> I think a useful example is the PCI bridge windows reported by the Linux
> host to userspace using RESV_RESERVED regions (see
> iommu_dma_get_resv_regions). If I understand correctly, they represent
> DMA
> addresses that shouldn't be accessed by endpoints because they won't
> reach
> the IOMMU. These are specific to the physical topology: a device will have
> different reserved regions depending on the PCI slot it occupies.
> 
> When handled properly, PCI bridge windows quickly become a nuisance.
> With
> kvmtool we observed that carving out their addresses globally removes a
> lot of useful GPA space from the guest. Without a virtual IOMMU we can
> either ignore them and hope everything will be fine, or remove all
> reserved regions from the GPA space (which currently means editing by
> hand
> the static guest-physical map...)
> 
> That's where RESV_MEM_T_ABORT comes handy with virtio-iommu. It
> describes
> reserved IOVAs for a specific endpoint, and therefore removes the need to
> carve the window out of the whole guest.

Understand and thanks for elaboration.

> 
> > Another thing I want to ask your opinion, about whether there is value of
> > adding another subtype (MEM_T_IDENTITY), asking for identity mapping
> > in the address space. It's similar to Reserved Memory Region Reporting
> > (RMRR) structure defined in VT-d, to indicate BIOS allocated reserved
> > memory ranges which may be DMA target and has to be identity mapped
> > when DMA remapping is enabled. I'm not sure whether ARM has similar
> > capability and whether there might be a general usage beyond VT-d. For
> > now the only usage in my mind is to assign a device with RMRR associated
> > on VT-d (Intel GPU, or some USB controllers) where the RMRR info needs
> > propagated to the guest (since identity mapping also means reservation
> > of virtual address space).
> 
> Yes I think adding MEM_T_IDENTITY will be necessary. I can see they are
> used for both iGPU and USB controllers on my x86 machines. Do you know
> more precisely what they are used for by the firmware?

VTd spec has a clear description:

3.14 Handling Requests to Reserved System Memory
Reserved system memory regions are typically allocated by BIOS at boot 
time and reported to OS as reserved address ranges in the system memory 
map. Requests-without-PASID to these reserved regions may either occur 
as a result of operations performed by the system software driver (for 
example in the case of DMA from unified memory access (UMA) graphics 
controllers to graphics reserved memory), or may be initiated by non 
system software (for example in case of DMA performed by a USB 
controller under BIOS SMM control for legacy keyboard emulation). 
For proper functioning of these legacy reserved memory usages, when 
system software enables DMA remapping, the second-level translation 
structures for the respective devices are expected to be set up to provide
identity mapping for the specified reserved memory regions with read 
and write permissions.

(one specific example for GPU happens in legacy VGA usage in early
boot time before actual graphics driver is loaded)

> 
> It's not necessary with the base virtio-iommu device though (v0.4),
> because the device can create the identity mappings itself and report them
> to the guest as MEM_T_BYPASS. However, when we start handing page

when you say "the device can create ...", I think you really meant
"host iommu driver can create identity mapping for assigned device",
correct?

Then yes, I think above works.

> table
> control over to the guest, the host won't be in control of IOVA->GPA
> mappings and will need to gracefully ask the guest to do it.
> 
> I'm not aware of any firmware description resembling Intel RMRR or AMD
> IVMD on ARM platforms. I do think ARM platforms could need
> MEM_T_IDENTITY
> for requesting the guest to map MSI windows when page-table handover is
> in
> use (MSI addresses are translated by the physical SMMU, so a IOVA->GPA
> mapping must be installed by the guest). But since a vSMMU would need a
> solution as well, I think I'll try to implement something more generic.

curious do you need identity mapping full IOVA->GPA->HPA translation, 
or just in GPA->HPA stage sufficient for above MSI scenario?

> 
> > 2.6.8.2.3 Device Requirements: Property RESV_MEM
> >
> > --citation start--
> > If an endpoint is attached to an address space, the device SHOULD leave
> > any access targeting one of its
> VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS
> > regions pass through untranslated. In other words, the device SHOULD
> > handle such a region as if it was identity-mapped (virtual address equal to
> > physical address). If the endpoint is not attached to any address space,
> > then the device MAY abort the transaction.
> > --citation end
> >
> > I have a question for the last sentence. From definition of BYPASS, it's
> > orthogonal to whether there is an address space attached, then should
> > we still allow "May abort" behavior?
> 
> The behavior is left as an implementation choice, and I'm not sure it's
> worth enforcing in the architecture. If the endpoint isn't attached to any
> domain then (unless VIRTIO_IOMMU_F_BYPASS is negotiated), it isn't
> necessarily able to do DMA at all. The virtio-iommu device may setup DMA
> mastering lazily, in which case any DMA transaction would abort, or have
> setup DMA already, in which case the endpoint can access MEM_T_BYPASS
> regions.
> 

fair enough. thanks

Kevin

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [RFC] virtio-iommu version 0.4
  2017-09-06 11:48       ` Jean-Philippe Brucker
@ 2017-09-21  6:41         ` Tian, Kevin
  -1 siblings, 0 replies; 71+ messages in thread
From: Tian, Kevin @ 2017-09-21  6:41 UTC (permalink / raw)
  To: Jean-Philippe Brucker, Auger Eric, iommu, kvm, virtualization,
	virtio-dev
  Cc: will.deacon, robin.murphy, lorenzo.pieralisi, mst, jasowang,
	marc.zyngier, eric.auger.pro, bharat.bhushan, peterx

> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
> Sent: Wednesday, September 6, 2017 7:49 PM
> 
> 
> > 2.6.8.2.1
> > Multiple overlapping RESV_MEM properties are merged together. Device
> > requirement? if same types I assume?
> 
> Combination rules apply to device and driver. When the driver puts
> multiple endpoints in the same domain, combination rules apply. They will
> become important once the guest attempts to do things like combining
> endpoints with different PASID capabilities in the same domain. I'm
> considering these endpoints to be behind different physical IOMMUs.
> 
> For reserved regions, we specify what the driver should do and what the
> device should enforce with regard to mapping IOVAs of overlapping regions.
> For example, if a device has some virtual address RESV_T_MSI and an other
> device has the same virtual address RESV_T_IDENTITY, what should the
> driver do?
> 
> I think it should apply the RESV_T_IDENTITY. RESV_T_MSI is just a special
> case of RESV_T_RESERVED, it's a hint for the IRQ subsystem and doesn't
> have a meaning within a domain. From DMA mappings point of view, it is
> effectively the same as RESV_T_RESERVED. When merging
> RESV_T_RESERVED and
> RESV_T_IDENTITY, we should make it RESV_T_IDENTITY. Because it is
> required
> for one endpoint to work (the one with IDENTITY) and is not accessed by
> the other (the driver must not allocate this IOVA range for DMA).
> 
> More generally, I'd like to add the following combination table to the
> spec, that shows the resulting reserved region type after merging regions
> from two endpoints. N: no reserved region, R: RESV_T_RESERVED, M:
> RESV_T_MSI, I: RESV_T_IDENTITY. It is symmetric so I didn't fill the
> bottom half.
> 
>       | N | R | M | I
>     ------------------
>     N | N | R | M | ?
>     ------------------
>     R |   | R | R | I
>     ------------------
>     M |   |   | M | I
>     ------------------
>     I |   |   |   | I
> 
> The 'I' column is problematic. If one endpoint has an IDENTITY region and
> the other doesn't have anything at that virtual address, then making that
> region an identity mapping will result in the second endpoint being able
> to access firmware memory. If using nested translation, stage-2 can help
> us here. But in general we should allow the device to reject an attach
> that would result in a N+I combination if the host considers it too dodgy.
> I think the other combinations are safe enough.
> 

will overlap happen in real? Can we simplify the spec to have device
not reporting overlapping regions?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [virtio-dev] [RFC] virtio-iommu version 0.4
  2017-09-06 11:48       ` Jean-Philippe Brucker
  (?)
@ 2017-09-21  6:41       ` Tian, Kevin
  -1 siblings, 0 replies; 71+ messages in thread
From: Tian, Kevin @ 2017-09-21  6:41 UTC (permalink / raw)
  To: Jean-Philippe Brucker, Auger Eric, iommu, kvm, virtualization,
	virtio-dev
  Cc: lorenzo.pieralisi, mst, marc.zyngier, will.deacon, robin.murphy,
	eric.auger.pro

> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
> Sent: Wednesday, September 6, 2017 7:49 PM
> 
> 
> > 2.6.8.2.1
> > Multiple overlapping RESV_MEM properties are merged together. Device
> > requirement? if same types I assume?
> 
> Combination rules apply to device and driver. When the driver puts
> multiple endpoints in the same domain, combination rules apply. They will
> become important once the guest attempts to do things like combining
> endpoints with different PASID capabilities in the same domain. I'm
> considering these endpoints to be behind different physical IOMMUs.
> 
> For reserved regions, we specify what the driver should do and what the
> device should enforce with regard to mapping IOVAs of overlapping regions.
> For example, if a device has some virtual address RESV_T_MSI and an other
> device has the same virtual address RESV_T_IDENTITY, what should the
> driver do?
> 
> I think it should apply the RESV_T_IDENTITY. RESV_T_MSI is just a special
> case of RESV_T_RESERVED, it's a hint for the IRQ subsystem and doesn't
> have a meaning within a domain. From DMA mappings point of view, it is
> effectively the same as RESV_T_RESERVED. When merging
> RESV_T_RESERVED and
> RESV_T_IDENTITY, we should make it RESV_T_IDENTITY. Because it is
> required
> for one endpoint to work (the one with IDENTITY) and is not accessed by
> the other (the driver must not allocate this IOVA range for DMA).
> 
> More generally, I'd like to add the following combination table to the
> spec, that shows the resulting reserved region type after merging regions
> from two endpoints. N: no reserved region, R: RESV_T_RESERVED, M:
> RESV_T_MSI, I: RESV_T_IDENTITY. It is symmetric so I didn't fill the
> bottom half.
> 
>       | N | R | M | I
>     ------------------
>     N | N | R | M | ?
>     ------------------
>     R |   | R | R | I
>     ------------------
>     M |   |   | M | I
>     ------------------
>     I |   |   |   | I
> 
> The 'I' column is problematic. If one endpoint has an IDENTITY region and
> the other doesn't have anything at that virtual address, then making that
> region an identity mapping will result in the second endpoint being able
> to access firmware memory. If using nested translation, stage-2 can help
> us here. But in general we should allow the device to reject an attach
> that would result in a N+I combination if the host considers it too dodgy.
> I think the other combinations are safe enough.
> 

will overlap happen in real? Can we simplify the spec to have device
not reporting overlapping regions?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [virtio-dev] [RFC] virtio-iommu version 0.4
@ 2017-09-21  6:41         ` Tian, Kevin
  0 siblings, 0 replies; 71+ messages in thread
From: Tian, Kevin @ 2017-09-21  6:41 UTC (permalink / raw)
  To: Jean-Philippe Brucker, Auger Eric, iommu, kvm, virtualization,
	virtio-dev
  Cc: will.deacon, robin.murphy, lorenzo.pieralisi, mst, jasowang,
	marc.zyngier, eric.auger.pro, bharat.bhushan, peterx

> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
> Sent: Wednesday, September 6, 2017 7:49 PM
> 
> 
> > 2.6.8.2.1
> > Multiple overlapping RESV_MEM properties are merged together. Device
> > requirement? if same types I assume?
> 
> Combination rules apply to device and driver. When the driver puts
> multiple endpoints in the same domain, combination rules apply. They will
> become important once the guest attempts to do things like combining
> endpoints with different PASID capabilities in the same domain. I'm
> considering these endpoints to be behind different physical IOMMUs.
> 
> For reserved regions, we specify what the driver should do and what the
> device should enforce with regard to mapping IOVAs of overlapping regions.
> For example, if a device has some virtual address RESV_T_MSI and an other
> device has the same virtual address RESV_T_IDENTITY, what should the
> driver do?
> 
> I think it should apply the RESV_T_IDENTITY. RESV_T_MSI is just a special
> case of RESV_T_RESERVED, it's a hint for the IRQ subsystem and doesn't
> have a meaning within a domain. From DMA mappings point of view, it is
> effectively the same as RESV_T_RESERVED. When merging
> RESV_T_RESERVED and
> RESV_T_IDENTITY, we should make it RESV_T_IDENTITY. Because it is
> required
> for one endpoint to work (the one with IDENTITY) and is not accessed by
> the other (the driver must not allocate this IOVA range for DMA).
> 
> More generally, I'd like to add the following combination table to the
> spec, that shows the resulting reserved region type after merging regions
> from two endpoints. N: no reserved region, R: RESV_T_RESERVED, M:
> RESV_T_MSI, I: RESV_T_IDENTITY. It is symmetric so I didn't fill the
> bottom half.
> 
>       | N | R | M | I
>     ------------------
>     N | N | R | M | ?
>     ------------------
>     R |   | R | R | I
>     ------------------
>     M |   |   | M | I
>     ------------------
>     I |   |   |   | I
> 
> The 'I' column is problematic. If one endpoint has an IDENTITY region and
> the other doesn't have anything at that virtual address, then making that
> region an identity mapping will result in the second endpoint being able
> to access firmware memory. If using nested translation, stage-2 can help
> us here. But in general we should allow the device to reject an attach
> that would result in a N+I combination if the host considers it too dodgy.
> I think the other combinations are safe enough.
> 

will overlap happen in real? Can we simplify the spec to have device
not reporting overlapping regions?

Thanks
Kevin

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] virtio-iommu version 0.4
  2017-09-20  9:37       ` [virtio-dev] " Auger Eric
@ 2017-09-21 10:29         ` Jean-Philippe Brucker
  -1 siblings, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-09-21 10:29 UTC (permalink / raw)
  To: Auger Eric, iommu, kvm, virtualization, virtio-dev
  Cc: Will Deacon, Robin Murphy, Lorenzo Pieralisi, mst, jasowang,
	Marc Zyngier, eric.auger.pro, bharat.bhushan, peterx, kevin.tian

On 20/09/17 10:37, Auger Eric wrote:
> Hi Jean,
> On 19/09/2017 12:47, Jean-Philippe Brucker wrote:
>> Hi Eric,
>>
>> On 12/09/17 18:13, Auger Eric wrote:
>>> 2.6.7
>>> - As I am currently integrating v0.4 in QEMU here are some other comments:
>>> At the moment struct virtio_iommu_req_probe flags is missing in your
>>> header. As such I understood the ACK protocol was not implemented by the
>>> driver in your branch.
>>
>> Uh indeed. And yet I could swear I've written that code... somewhere. I
>> will add it to the batch of v0.5 changes, it shouldn't be too invasive.
>>
>>> - VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is VIRTIO_IOMMU_T_MASK in your
>>> header too.
>>
>> Yes, keeping VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is probably best
>> (though it is a mouthful).
>>
>>> 2.6.8.2:
>>> - I am really confused about what the device should report as resv
>>> regions depending on the PE nature (VFIO or not VFIO)
>>>
>>> In other iommu drivers, the resv regions are populated by the iommu
>>> driver through its get_resv_regions callback. They are usually composed
>>> of an iommu specific MSI region (mapped or bypassed) and non IOMMU
>>> specific (device specific) reserved regions:
>>> iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those
>>> are the guest reserved regions.
>>>
>>> First in the current virtio-iommu driver I don't see the
>>> iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu
>>> driver should compute the non IOMMU specific MSI regions ie. this is
>>> not the responsability of the virtio-iommu device.
>>
>> For SW_MSI, certainly. The driver allocates a fixed IOVA region for
>> mapping the MSI doorbell. But the driver has to know whether the doorbell
>> region is translated or bypassed.
> Sorry I was talking about *non* IOMMU specific MSI regions, typically
> the regions corresponding to guest PCI host bridge windows. This is
> normally computed in the iommu driver and I didn't that that in the
> existing virtio-iommu driver.

Ah right, I don't think the virtio-iommu device has to report the windows
of the emulated host bridge. RESV is useful for things that are not
obvious to the guest, for instance the physical PCI bridges that are
hidden from the guest.

It's an interesting point though, I can imagine non-Linux guest that are
not as well-equipped with dealing with things like PCI bridge windows, and
could benefit from the device reporting them.

In the end it is up to the device implementation to decide what regions to
report, and make sure that the guest is aware of the various traps in IOVA
space. For things like emulated bridges, the device can expect the guest
to find out about them. For physical bridges/SW_MSI of the host, it should
report the region and make sure that the guest doesn't map them. I'll add
a few more examples to the Implementation Notes, but I suspect reading
your Qemu source code will always be more helpful to people.

>>> Then why is it more the job of the device to return the guest iommu
>>> specific region rather than the driver itself?
>>
>> The MSI region is architectural on x86 IOMMUs, but
>> implementation-defined on virtio-iommu. It depends which platform the host
>> is emulating. In Linux, x86 IOMMU drivers register the bypass region
>> because there always is an IOAPIC on the other end, with a fixed MSI
>> address. But virtio-iommu may be either behind a GIC, an APIC or some
>> other IRQ chip.
>>
>> The driver *could* go over all the irqchips/platforms it knows and try to
>> guess if there is a fixed doorbell or if it needs to reserve an IOVA for
>> them, but it would look horrible. I much prefer having a well-defined way
>> of doing this, so a description from the device.
> 
> This means I must have target specific code in the virtio-iommu device
> which is unusual, right? I was initially thinking you could handle that
> on the driver side using a config set for ARM|ARM64. But on the other
> hand you should communicate the info to the device ...

But the device has to know that it has a region that DMA transactions
bypass, right? If you want to implement MSI bypass, then you already have
to add a special case to your device code, and reporting it in a probe
shouldn't require a lot more work. For example, amdvi_translate() has a
special case for amdvi_is_interrupt_addr().

>>> Then I understand this is the responsability of the virtio-iommu device
>>> to gather information about the host resv regions in case of VFIO EP.
>>> Typically the host PCIe host bridge windows cannot be used for IOVA.
>>> Also the host MSI reserved IOVA window cannot be used. Do you agree.
>>
>> Yes, all regions reported in sysfs reserved_regions in the host would be
>> reported as RESV_T_RESERVED by virtio-iommu.
> So to summarize if the probe request is sent to an emulated device, we
> should return the target specific MSI window. We can't and don't return
> the non IOMMU specific guest reserved windows.
> 
> For a VFIO device, we would return all reserved regions of the group the
> device belongs to. Is that correct?

Yes

>>
>>> I really think the spec should clarify what exact resv regions the
>>> device should return in case of VFIO device and non VFIO device.
>>
>> Agreed. I will add something about RESV_T_RESERVED with the PCI bridge
>> example in Implementation Notes. Do you think the MSI examples at the end
>> need improvement as well? I can try to explain that RESV_MSI regions in
>> virtio-iommu are only those of the emulated platform, not the HW or SW MSI
>> regions from the host.
> 
> I think I would expect an explanation detailing returned reserved
> regions for pure emulated devices and HW/VFIO devices.
> 
> Another unrelated remark:
> - you should add a permission violation error.

Do you mean reporting errors from device to driver? It's on the top of my
list, I might add it to v0.5 (without recoverable/PRI mode at the moment)

> - wrt the probe request ACK protocol, this looks pretty heavy as both
> the driver and the device need to parse the req_probe buffer. The device
> need to fill in the output buffer and then read the same info on the
> input buffer. Couldn't we imagine something simpler?

I can't, but I can get rid of the ack protocol and leave space if someone
needs to re-introduce it. I quite like it but don't think it'll be useful
to the few features I'm planning to add. A future re-introduction will
simply have to add a global feature bit VIRTIO_IOMMU_F_PROBE_ACK.

Note that the device isn't required to implement the ACK phase, we simply
give it a chance to verify that the driver understood its information. It
could simply check if the ACK bit is present and ignore the request in
this case.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] virtio-iommu version 0.4
  2017-09-20  9:37       ` [virtio-dev] " Auger Eric
  (?)
@ 2017-09-21 10:29       ` Jean-Philippe Brucker
  -1 siblings, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-09-21 10:29 UTC (permalink / raw)
  To: Auger Eric, iommu, kvm, virtualization, virtio-dev
  Cc: Lorenzo Pieralisi, mst, Marc Zyngier, Will Deacon, Robin Murphy,
	eric.auger.pro

On 20/09/17 10:37, Auger Eric wrote:
> Hi Jean,
> On 19/09/2017 12:47, Jean-Philippe Brucker wrote:
>> Hi Eric,
>>
>> On 12/09/17 18:13, Auger Eric wrote:
>>> 2.6.7
>>> - As I am currently integrating v0.4 in QEMU here are some other comments:
>>> At the moment struct virtio_iommu_req_probe flags is missing in your
>>> header. As such I understood the ACK protocol was not implemented by the
>>> driver in your branch.
>>
>> Uh indeed. And yet I could swear I've written that code... somewhere. I
>> will add it to the batch of v0.5 changes, it shouldn't be too invasive.
>>
>>> - VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is VIRTIO_IOMMU_T_MASK in your
>>> header too.
>>
>> Yes, keeping VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is probably best
>> (though it is a mouthful).
>>
>>> 2.6.8.2:
>>> - I am really confused about what the device should report as resv
>>> regions depending on the PE nature (VFIO or not VFIO)
>>>
>>> In other iommu drivers, the resv regions are populated by the iommu
>>> driver through its get_resv_regions callback. They are usually composed
>>> of an iommu specific MSI region (mapped or bypassed) and non IOMMU
>>> specific (device specific) reserved regions:
>>> iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those
>>> are the guest reserved regions.
>>>
>>> First in the current virtio-iommu driver I don't see the
>>> iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu
>>> driver should compute the non IOMMU specific MSI regions ie. this is
>>> not the responsability of the virtio-iommu device.
>>
>> For SW_MSI, certainly. The driver allocates a fixed IOVA region for
>> mapping the MSI doorbell. But the driver has to know whether the doorbell
>> region is translated or bypassed.
> Sorry I was talking about *non* IOMMU specific MSI regions, typically
> the regions corresponding to guest PCI host bridge windows. This is
> normally computed in the iommu driver and I didn't that that in the
> existing virtio-iommu driver.

Ah right, I don't think the virtio-iommu device has to report the windows
of the emulated host bridge. RESV is useful for things that are not
obvious to the guest, for instance the physical PCI bridges that are
hidden from the guest.

It's an interesting point though, I can imagine non-Linux guest that are
not as well-equipped with dealing with things like PCI bridge windows, and
could benefit from the device reporting them.

In the end it is up to the device implementation to decide what regions to
report, and make sure that the guest is aware of the various traps in IOVA
space. For things like emulated bridges, the device can expect the guest
to find out about them. For physical bridges/SW_MSI of the host, it should
report the region and make sure that the guest doesn't map them. I'll add
a few more examples to the Implementation Notes, but I suspect reading
your Qemu source code will always be more helpful to people.

>>> Then why is it more the job of the device to return the guest iommu
>>> specific region rather than the driver itself?
>>
>> The MSI region is architectural on x86 IOMMUs, but
>> implementation-defined on virtio-iommu. It depends which platform the host
>> is emulating. In Linux, x86 IOMMU drivers register the bypass region
>> because there always is an IOAPIC on the other end, with a fixed MSI
>> address. But virtio-iommu may be either behind a GIC, an APIC or some
>> other IRQ chip.
>>
>> The driver *could* go over all the irqchips/platforms it knows and try to
>> guess if there is a fixed doorbell or if it needs to reserve an IOVA for
>> them, but it would look horrible. I much prefer having a well-defined way
>> of doing this, so a description from the device.
> 
> This means I must have target specific code in the virtio-iommu device
> which is unusual, right? I was initially thinking you could handle that
> on the driver side using a config set for ARM|ARM64. But on the other
> hand you should communicate the info to the device ...

But the device has to know that it has a region that DMA transactions
bypass, right? If you want to implement MSI bypass, then you already have
to add a special case to your device code, and reporting it in a probe
shouldn't require a lot more work. For example, amdvi_translate() has a
special case for amdvi_is_interrupt_addr().

>>> Then I understand this is the responsability of the virtio-iommu device
>>> to gather information about the host resv regions in case of VFIO EP.
>>> Typically the host PCIe host bridge windows cannot be used for IOVA.
>>> Also the host MSI reserved IOVA window cannot be used. Do you agree.
>>
>> Yes, all regions reported in sysfs reserved_regions in the host would be
>> reported as RESV_T_RESERVED by virtio-iommu.
> So to summarize if the probe request is sent to an emulated device, we
> should return the target specific MSI window. We can't and don't return
> the non IOMMU specific guest reserved windows.
> 
> For a VFIO device, we would return all reserved regions of the group the
> device belongs to. Is that correct?

Yes

>>
>>> I really think the spec should clarify what exact resv regions the
>>> device should return in case of VFIO device and non VFIO device.
>>
>> Agreed. I will add something about RESV_T_RESERVED with the PCI bridge
>> example in Implementation Notes. Do you think the MSI examples at the end
>> need improvement as well? I can try to explain that RESV_MSI regions in
>> virtio-iommu are only those of the emulated platform, not the HW or SW MSI
>> regions from the host.
> 
> I think I would expect an explanation detailing returned reserved
> regions for pure emulated devices and HW/VFIO devices.
> 
> Another unrelated remark:
> - you should add a permission violation error.

Do you mean reporting errors from device to driver? It's on the top of my
list, I might add it to v0.5 (without recoverable/PRI mode at the moment)

> - wrt the probe request ACK protocol, this looks pretty heavy as both
> the driver and the device need to parse the req_probe buffer. The device
> need to fill in the output buffer and then read the same info on the
> input buffer. Couldn't we imagine something simpler?

I can't, but I can get rid of the ack protocol and leave space if someone
needs to re-introduce it. I quite like it but don't think it'll be useful
to the few features I'm planning to add. A future re-introduction will
simply have to add a global feature bit VIRTIO_IOMMU_F_PROBE_ACK.

Note that the device isn't required to implement the ACK phase, we simply
give it a chance to verify that the driver understood its information. It
could simply check if the ACK bit is present and ignore the request in
this case.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [virtio-dev] Re: [RFC] virtio-iommu version 0.4
@ 2017-09-21 10:29         ` Jean-Philippe Brucker
  0 siblings, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-09-21 10:29 UTC (permalink / raw)
  To: Auger Eric, iommu, kvm, virtualization, virtio-dev
  Cc: Will Deacon, Robin Murphy, Lorenzo Pieralisi, mst, jasowang,
	Marc Zyngier, eric.auger.pro, bharat.bhushan, peterx, kevin.tian

On 20/09/17 10:37, Auger Eric wrote:
> Hi Jean,
> On 19/09/2017 12:47, Jean-Philippe Brucker wrote:
>> Hi Eric,
>>
>> On 12/09/17 18:13, Auger Eric wrote:
>>> 2.6.7
>>> - As I am currently integrating v0.4 in QEMU here are some other comments:
>>> At the moment struct virtio_iommu_req_probe flags is missing in your
>>> header. As such I understood the ACK protocol was not implemented by the
>>> driver in your branch.
>>
>> Uh indeed. And yet I could swear I've written that code... somewhere. I
>> will add it to the batch of v0.5 changes, it shouldn't be too invasive.
>>
>>> - VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is VIRTIO_IOMMU_T_MASK in your
>>> header too.
>>
>> Yes, keeping VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is probably best
>> (though it is a mouthful).
>>
>>> 2.6.8.2:
>>> - I am really confused about what the device should report as resv
>>> regions depending on the PE nature (VFIO or not VFIO)
>>>
>>> In other iommu drivers, the resv regions are populated by the iommu
>>> driver through its get_resv_regions callback. They are usually composed
>>> of an iommu specific MSI region (mapped or bypassed) and non IOMMU
>>> specific (device specific) reserved regions:
>>> iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those
>>> are the guest reserved regions.
>>>
>>> First in the current virtio-iommu driver I don't see the
>>> iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu
>>> driver should compute the non IOMMU specific MSI regions ie. this is
>>> not the responsability of the virtio-iommu device.
>>
>> For SW_MSI, certainly. The driver allocates a fixed IOVA region for
>> mapping the MSI doorbell. But the driver has to know whether the doorbell
>> region is translated or bypassed.
> Sorry I was talking about *non* IOMMU specific MSI regions, typically
> the regions corresponding to guest PCI host bridge windows. This is
> normally computed in the iommu driver and I didn't that that in the
> existing virtio-iommu driver.

Ah right, I don't think the virtio-iommu device has to report the windows
of the emulated host bridge. RESV is useful for things that are not
obvious to the guest, for instance the physical PCI bridges that are
hidden from the guest.

It's an interesting point though, I can imagine non-Linux guest that are
not as well-equipped with dealing with things like PCI bridge windows, and
could benefit from the device reporting them.

In the end it is up to the device implementation to decide what regions to
report, and make sure that the guest is aware of the various traps in IOVA
space. For things like emulated bridges, the device can expect the guest
to find out about them. For physical bridges/SW_MSI of the host, it should
report the region and make sure that the guest doesn't map them. I'll add
a few more examples to the Implementation Notes, but I suspect reading
your Qemu source code will always be more helpful to people.

>>> Then why is it more the job of the device to return the guest iommu
>>> specific region rather than the driver itself?
>>
>> The MSI region is architectural on x86 IOMMUs, but
>> implementation-defined on virtio-iommu. It depends which platform the host
>> is emulating. In Linux, x86 IOMMU drivers register the bypass region
>> because there always is an IOAPIC on the other end, with a fixed MSI
>> address. But virtio-iommu may be either behind a GIC, an APIC or some
>> other IRQ chip.
>>
>> The driver *could* go over all the irqchips/platforms it knows and try to
>> guess if there is a fixed doorbell or if it needs to reserve an IOVA for
>> them, but it would look horrible. I much prefer having a well-defined way
>> of doing this, so a description from the device.
> 
> This means I must have target specific code in the virtio-iommu device
> which is unusual, right? I was initially thinking you could handle that
> on the driver side using a config set for ARM|ARM64. But on the other
> hand you should communicate the info to the device ...

But the device has to know that it has a region that DMA transactions
bypass, right? If you want to implement MSI bypass, then you already have
to add a special case to your device code, and reporting it in a probe
shouldn't require a lot more work. For example, amdvi_translate() has a
special case for amdvi_is_interrupt_addr().

>>> Then I understand this is the responsability of the virtio-iommu device
>>> to gather information about the host resv regions in case of VFIO EP.
>>> Typically the host PCIe host bridge windows cannot be used for IOVA.
>>> Also the host MSI reserved IOVA window cannot be used. Do you agree.
>>
>> Yes, all regions reported in sysfs reserved_regions in the host would be
>> reported as RESV_T_RESERVED by virtio-iommu.
> So to summarize if the probe request is sent to an emulated device, we
> should return the target specific MSI window. We can't and don't return
> the non IOMMU specific guest reserved windows.
> 
> For a VFIO device, we would return all reserved regions of the group the
> device belongs to. Is that correct?

Yes

>>
>>> I really think the spec should clarify what exact resv regions the
>>> device should return in case of VFIO device and non VFIO device.
>>
>> Agreed. I will add something about RESV_T_RESERVED with the PCI bridge
>> example in Implementation Notes. Do you think the MSI examples at the end
>> need improvement as well? I can try to explain that RESV_MSI regions in
>> virtio-iommu are only those of the emulated platform, not the HW or SW MSI
>> regions from the host.
> 
> I think I would expect an explanation detailing returned reserved
> regions for pure emulated devices and HW/VFIO devices.
> 
> Another unrelated remark:
> - you should add a permission violation error.

Do you mean reporting errors from device to driver? It's on the top of my
list, I might add it to v0.5 (without recoverable/PRI mode at the moment)

> - wrt the probe request ACK protocol, this looks pretty heavy as both
> the driver and the device need to parse the req_probe buffer. The device
> need to fill in the output buffer and then read the same info on the
> input buffer. Couldn't we imagine something simpler?

I can't, but I can get rid of the ack protocol and leave space if someone
needs to re-introduce it. I quite like it but don't think it'll be useful
to the few features I'm planning to add. A future re-introduction will
simply have to add a global feature bit VIRTIO_IOMMU_F_PROBE_ACK.

Note that the device isn't required to implement the ACK phase, we simply
give it a chance to verify that the driver understood its information. It
could simply check if the ACK bit is present and ignore the request in
this case.

Thanks,
Jean

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [virtio-dev] RE: [RFC] virtio-iommu version 0.4
  2017-09-21  6:27         ` [virtio-dev] " Tian, Kevin
@ 2017-09-25 13:32           ` Jean-Philippe Brucker
  -1 siblings, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-09-25 13:32 UTC (permalink / raw)
  To: Tian, Kevin, iommu, kvm, virtualization, virtio-dev
  Cc: Will Deacon, Robin Murphy, Lorenzo Pieralisi, mst, jasowang,
	Marc Zyngier, eric.auger, eric.auger.pro, bharat.bhushan, peterx

On 21/09/17 07:27, Tian, Kevin wrote:
>> From: Jean-Philippe Brucker
>> Sent: Wednesday, September 6, 2017 7:55 PM
>>
>> Hi Kevin,
>>
>> On 28/08/17 08:39, Tian, Kevin wrote:
>>> Here comes some comments:
>>>
>>> 1.1 Motivation
>>>
>>> You describe I/O page faults handling as future work. Seems you
>> considered
>>> only recoverable fault (since "aka. PCI PRI" being used). What about other
>>> unrecoverable faults e.g. what to do if a virtual DMA request doesn't find
>>> a valid mapping? Even when there is no PRI support, we need some basic
>>> form of fault reporting mechanism to indicate such errors to guest.
>>
>> I am considering recoverable faults as the end goal, but reporting
>> unrecoverable faults should use the same queue, with slightly different
>> fields and no need for the driver to reply to the device.
> 
> what about adding a placeholder for now? Though same mechanism
> can be reused, it's an essential part to make virtio-iommu architecture
> complete even before talking support for recoverable faults. :-)

I'll see if I can come up with something simple for v0.5, but it seems
like a big chunk of work. I don't really know what to report to the guest
at the moment. I don't want to report vendor-specific details about the
fault, but it should still be useful content, to let the guest decide
whether they need to reset/kill the device or just print something

[...]
>> Yes I think adding MEM_T_IDENTITY will be necessary. I can see they are
>> used for both iGPU and USB controllers on my x86 machines. Do you know
>> more precisely what they are used for by the firmware?
> 
> VTd spec has a clear description:
> 
> 3.14 Handling Requests to Reserved System Memory
> Reserved system memory regions are typically allocated by BIOS at boot 
> time and reported to OS as reserved address ranges in the system memory 
> map. Requests-without-PASID to these reserved regions may either occur 
> as a result of operations performed by the system software driver (for 
> example in the case of DMA from unified memory access (UMA) graphics 
> controllers to graphics reserved memory), or may be initiated by non 
> system software (for example in case of DMA performed by a USB 
> controller under BIOS SMM control for legacy keyboard emulation). 
> For proper functioning of these legacy reserved memory usages, when 
> system software enables DMA remapping, the second-level translation 
> structures for the respective devices are expected to be set up to provide
> identity mapping for the specified reserved memory regions with read 
> and write permissions.
> 
> (one specific example for GPU happens in legacy VGA usage in early
> boot time before actual graphics driver is loaded)

Thanks for the explanation. So it is only legacy, and enabling nested mode
would be forbidden for a device with Reserved System Memory regions? I'm
wondering if virtio-iommu RESV regions will be extended to affect a
specific PASIDs (or all requests-with-PASID) in the future.
>> It's not necessary with the base virtio-iommu device though (v0.4),
>> because the device can create the identity mappings itself and report them
>> to the guest as MEM_T_BYPASS. However, when we start handing page
> 
> when you say "the device can create ...", I think you really meant
> "host iommu driver can create identity mapping for assigned device",
> correct?
> 
> Then yes, I think above works.

Yes it can be the host IOMMU driver, or simply Qemu sending VFIO ioctls to
create those identity mappings (they are reported in sysfs reserved_regions).

>> table
>> control over to the guest, the host won't be in control of IOVA->GPA
>> mappings and will need to gracefully ask the guest to do it.
>>
>> I'm not aware of any firmware description resembling Intel RMRR or AMD
>> IVMD on ARM platforms. I do think ARM platforms could need
>> MEM_T_IDENTITY
>> for requesting the guest to map MSI windows when page-table handover is
>> in
>> use (MSI addresses are translated by the physical SMMU, so a IOVA->GPA
>> mapping must be installed by the guest). But since a vSMMU would need a
>> solution as well, I think I'll try to implement something more generic.
> 
> curious do you need identity mapping full IOVA->GPA->HPA translation, 
> or just in GPA->HPA stage sufficient for above MSI scenario?

It has to be IOVA->GPA->HPA. So it'll be a bit complicated to implement
for us, I think we're going to need a VFIO ioctl to tell the host what
IOVA the guest allocated for its MSI, but it's not ideal.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [virtio-dev] RE: [RFC] virtio-iommu version 0.4
  2017-09-21  6:27         ` [virtio-dev] " Tian, Kevin
  (?)
  (?)
@ 2017-09-25 13:32         ` Jean-Philippe Brucker
  -1 siblings, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-09-25 13:32 UTC (permalink / raw)
  To: Tian, Kevin, iommu, kvm, virtualization, virtio-dev
  Cc: Lorenzo Pieralisi, mst, Marc Zyngier, Will Deacon, eric.auger,
	Robin Murphy, eric.auger.pro

On 21/09/17 07:27, Tian, Kevin wrote:
>> From: Jean-Philippe Brucker
>> Sent: Wednesday, September 6, 2017 7:55 PM
>>
>> Hi Kevin,
>>
>> On 28/08/17 08:39, Tian, Kevin wrote:
>>> Here comes some comments:
>>>
>>> 1.1 Motivation
>>>
>>> You describe I/O page faults handling as future work. Seems you
>> considered
>>> only recoverable fault (since "aka. PCI PRI" being used). What about other
>>> unrecoverable faults e.g. what to do if a virtual DMA request doesn't find
>>> a valid mapping? Even when there is no PRI support, we need some basic
>>> form of fault reporting mechanism to indicate such errors to guest.
>>
>> I am considering recoverable faults as the end goal, but reporting
>> unrecoverable faults should use the same queue, with slightly different
>> fields and no need for the driver to reply to the device.
> 
> what about adding a placeholder for now? Though same mechanism
> can be reused, it's an essential part to make virtio-iommu architecture
> complete even before talking support for recoverable faults. :-)

I'll see if I can come up with something simple for v0.5, but it seems
like a big chunk of work. I don't really know what to report to the guest
at the moment. I don't want to report vendor-specific details about the
fault, but it should still be useful content, to let the guest decide
whether they need to reset/kill the device or just print something

[...]
>> Yes I think adding MEM_T_IDENTITY will be necessary. I can see they are
>> used for both iGPU and USB controllers on my x86 machines. Do you know
>> more precisely what they are used for by the firmware?
> 
> VTd spec has a clear description:
> 
> 3.14 Handling Requests to Reserved System Memory
> Reserved system memory regions are typically allocated by BIOS at boot 
> time and reported to OS as reserved address ranges in the system memory 
> map. Requests-without-PASID to these reserved regions may either occur 
> as a result of operations performed by the system software driver (for 
> example in the case of DMA from unified memory access (UMA) graphics 
> controllers to graphics reserved memory), or may be initiated by non 
> system software (for example in case of DMA performed by a USB 
> controller under BIOS SMM control for legacy keyboard emulation). 
> For proper functioning of these legacy reserved memory usages, when 
> system software enables DMA remapping, the second-level translation 
> structures for the respective devices are expected to be set up to provide
> identity mapping for the specified reserved memory regions with read 
> and write permissions.
> 
> (one specific example for GPU happens in legacy VGA usage in early
> boot time before actual graphics driver is loaded)

Thanks for the explanation. So it is only legacy, and enabling nested mode
would be forbidden for a device with Reserved System Memory regions? I'm
wondering if virtio-iommu RESV regions will be extended to affect a
specific PASIDs (or all requests-with-PASID) in the future.
>> It's not necessary with the base virtio-iommu device though (v0.4),
>> because the device can create the identity mappings itself and report them
>> to the guest as MEM_T_BYPASS. However, when we start handing page
> 
> when you say "the device can create ...", I think you really meant
> "host iommu driver can create identity mapping for assigned device",
> correct?
> 
> Then yes, I think above works.

Yes it can be the host IOMMU driver, or simply Qemu sending VFIO ioctls to
create those identity mappings (they are reported in sysfs reserved_regions).

>> table
>> control over to the guest, the host won't be in control of IOVA->GPA
>> mappings and will need to gracefully ask the guest to do it.
>>
>> I'm not aware of any firmware description resembling Intel RMRR or AMD
>> IVMD on ARM platforms. I do think ARM platforms could need
>> MEM_T_IDENTITY
>> for requesting the guest to map MSI windows when page-table handover is
>> in
>> use (MSI addresses are translated by the physical SMMU, so a IOVA->GPA
>> mapping must be installed by the guest). But since a vSMMU would need a
>> solution as well, I think I'll try to implement something more generic.
> 
> curious do you need identity mapping full IOVA->GPA->HPA translation, 
> or just in GPA->HPA stage sufficient for above MSI scenario?

It has to be IOVA->GPA->HPA. So it'll be a bit complicated to implement
for us, I think we're going to need a VFIO ioctl to tell the host what
IOVA the guest allocated for its MSI, but it's not ideal.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [virtio-dev] RE: [RFC] virtio-iommu version 0.4
@ 2017-09-25 13:32           ` Jean-Philippe Brucker
  0 siblings, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-09-25 13:32 UTC (permalink / raw)
  To: Tian, Kevin, iommu, kvm, virtualization, virtio-dev
  Cc: Will Deacon, Robin Murphy, Lorenzo Pieralisi, mst, jasowang,
	Marc Zyngier, eric.auger, eric.auger.pro, bharat.bhushan, peterx

On 21/09/17 07:27, Tian, Kevin wrote:
>> From: Jean-Philippe Brucker
>> Sent: Wednesday, September 6, 2017 7:55 PM
>>
>> Hi Kevin,
>>
>> On 28/08/17 08:39, Tian, Kevin wrote:
>>> Here comes some comments:
>>>
>>> 1.1 Motivation
>>>
>>> You describe I/O page faults handling as future work. Seems you
>> considered
>>> only recoverable fault (since "aka. PCI PRI" being used). What about other
>>> unrecoverable faults e.g. what to do if a virtual DMA request doesn't find
>>> a valid mapping? Even when there is no PRI support, we need some basic
>>> form of fault reporting mechanism to indicate such errors to guest.
>>
>> I am considering recoverable faults as the end goal, but reporting
>> unrecoverable faults should use the same queue, with slightly different
>> fields and no need for the driver to reply to the device.
> 
> what about adding a placeholder for now? Though same mechanism
> can be reused, it's an essential part to make virtio-iommu architecture
> complete even before talking support for recoverable faults. :-)

I'll see if I can come up with something simple for v0.5, but it seems
like a big chunk of work. I don't really know what to report to the guest
at the moment. I don't want to report vendor-specific details about the
fault, but it should still be useful content, to let the guest decide
whether they need to reset/kill the device or just print something

[...]
>> Yes I think adding MEM_T_IDENTITY will be necessary. I can see they are
>> used for both iGPU and USB controllers on my x86 machines. Do you know
>> more precisely what they are used for by the firmware?
> 
> VTd spec has a clear description:
> 
> 3.14 Handling Requests to Reserved System Memory
> Reserved system memory regions are typically allocated by BIOS at boot 
> time and reported to OS as reserved address ranges in the system memory 
> map. Requests-without-PASID to these reserved regions may either occur 
> as a result of operations performed by the system software driver (for 
> example in the case of DMA from unified memory access (UMA) graphics 
> controllers to graphics reserved memory), or may be initiated by non 
> system software (for example in case of DMA performed by a USB 
> controller under BIOS SMM control for legacy keyboard emulation). 
> For proper functioning of these legacy reserved memory usages, when 
> system software enables DMA remapping, the second-level translation 
> structures for the respective devices are expected to be set up to provide
> identity mapping for the specified reserved memory regions with read 
> and write permissions.
> 
> (one specific example for GPU happens in legacy VGA usage in early
> boot time before actual graphics driver is loaded)

Thanks for the explanation. So it is only legacy, and enabling nested mode
would be forbidden for a device with Reserved System Memory regions? I'm
wondering if virtio-iommu RESV regions will be extended to affect a
specific PASIDs (or all requests-with-PASID) in the future.
>> It's not necessary with the base virtio-iommu device though (v0.4),
>> because the device can create the identity mappings itself and report them
>> to the guest as MEM_T_BYPASS. However, when we start handing page
> 
> when you say "the device can create ...", I think you really meant
> "host iommu driver can create identity mapping for assigned device",
> correct?
> 
> Then yes, I think above works.

Yes it can be the host IOMMU driver, or simply Qemu sending VFIO ioctls to
create those identity mappings (they are reported in sysfs reserved_regions).

>> table
>> control over to the guest, the host won't be in control of IOVA->GPA
>> mappings and will need to gracefully ask the guest to do it.
>>
>> I'm not aware of any firmware description resembling Intel RMRR or AMD
>> IVMD on ARM platforms. I do think ARM platforms could need
>> MEM_T_IDENTITY
>> for requesting the guest to map MSI windows when page-table handover is
>> in
>> use (MSI addresses are translated by the physical SMMU, so a IOVA->GPA
>> mapping must be installed by the guest). But since a vSMMU would need a
>> solution as well, I think I'll try to implement something more generic.
> 
> curious do you need identity mapping full IOVA->GPA->HPA translation, 
> or just in GPA->HPA stage sufficient for above MSI scenario?

It has to be IOVA->GPA->HPA. So it'll be a bit complicated to implement
for us, I think we're going to need a VFIO ioctl to tell the host what
IOVA the guest allocated for its MSI, but it's not ideal.

Thanks,
Jean

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [virtio-dev] [RFC] virtio-iommu version 0.4
  2017-09-21  6:41         ` [virtio-dev] " Tian, Kevin
@ 2017-09-25 13:47             ` Jean-Philippe Brucker
  -1 siblings, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-09-25 13:47 UTC (permalink / raw)
  To: Tian, Kevin, Auger Eric,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kvm-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	virtio-dev-sDuHXQ4OtrM4h7I2RyI4rWD2FQJk+8+b
  Cc: mst-H+wXaHxf7aLQT0dZR+AlfA, Marc Zyngier,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA, Will Deacon,
	eric.auger.pro-Re5JQEeQqe8AvxtiuMwx3w

On 21/09/17 07:41, Tian, Kevin wrote:
>> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org]
>> Sent: Wednesday, September 6, 2017 7:49 PM
>>
>>
>>> 2.6.8.2.1
>>> Multiple overlapping RESV_MEM properties are merged together. Device
>>> requirement? if same types I assume?
>>
>> Combination rules apply to device and driver. When the driver puts
>> multiple endpoints in the same domain, combination rules apply. They will
>> become important once the guest attempts to do things like combining
>> endpoints with different PASID capabilities in the same domain. I'm
>> considering these endpoints to be behind different physical IOMMUs.
>>
>> For reserved regions, we specify what the driver should do and what the
>> device should enforce with regard to mapping IOVAs of overlapping regions.
>> For example, if a device has some virtual address RESV_T_MSI and an other
>> device has the same virtual address RESV_T_IDENTITY, what should the
>> driver do?
>>
>> I think it should apply the RESV_T_IDENTITY. RESV_T_MSI is just a special
>> case of RESV_T_RESERVED, it's a hint for the IRQ subsystem and doesn't
>> have a meaning within a domain. From DMA mappings point of view, it is
>> effectively the same as RESV_T_RESERVED. When merging
>> RESV_T_RESERVED and
>> RESV_T_IDENTITY, we should make it RESV_T_IDENTITY. Because it is
>> required
>> for one endpoint to work (the one with IDENTITY) and is not accessed by
>> the other (the driver must not allocate this IOVA range for DMA).
>>
>> More generally, I'd like to add the following combination table to the
>> spec, that shows the resulting reserved region type after merging regions
>> from two endpoints. N: no reserved region, R: RESV_T_RESERVED, M:
>> RESV_T_MSI, I: RESV_T_IDENTITY. It is symmetric so I didn't fill the
>> bottom half.
>>
>>       | N | R | M | I
>>     ------------------
>>     N | N | R | M | ?
>>     ------------------
>>     R |   | R | R | I
>>     ------------------
>>     M |   |   | M | I
>>     ------------------
>>     I |   |   |   | I
>>
>> The 'I' column is problematic. If one endpoint has an IDENTITY region and
>> the other doesn't have anything at that virtual address, then making that
>> region an identity mapping will result in the second endpoint being able
>> to access firmware memory. If using nested translation, stage-2 can help
>> us here. But in general we should allow the device to reject an attach
>> that would result in a N+I combination if the host considers it too dodgy.
>> I think the other combinations are safe enough.
>>
> 
> will overlap happen in real? Can we simplify the spec to have device
> not reporting overlapping regions?

I think it's likely to happen, to have two endpoints with different
overlapping types. Reporting is easy, but handling mismatched reserved
regions is hard. So what I can do is leave these combination rules as
implementation detail, and let the device reject any domain attach that
overlaps mismatched RESV regions.

After all, putting multiple devices in the same domain is a nice feature
but might not be widely used in reality. It's not worth spending too much
time describing all possible behaviors for it (but still worth thinking
about to avoid introducing security holes).

Thanks,
Jean

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [virtio-dev] [RFC] virtio-iommu version 0.4
  2017-09-21  6:41         ` [virtio-dev] " Tian, Kevin
  (?)
@ 2017-09-25 13:47         ` Jean-Philippe Brucker
  -1 siblings, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-09-25 13:47 UTC (permalink / raw)
  To: Tian, Kevin, Auger Eric, iommu, kvm, virtualization, virtio-dev
  Cc: Lorenzo Pieralisi, mst, Marc Zyngier, Will Deacon, Robin Murphy,
	eric.auger.pro

On 21/09/17 07:41, Tian, Kevin wrote:
>> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
>> Sent: Wednesday, September 6, 2017 7:49 PM
>>
>>
>>> 2.6.8.2.1
>>> Multiple overlapping RESV_MEM properties are merged together. Device
>>> requirement? if same types I assume?
>>
>> Combination rules apply to device and driver. When the driver puts
>> multiple endpoints in the same domain, combination rules apply. They will
>> become important once the guest attempts to do things like combining
>> endpoints with different PASID capabilities in the same domain. I'm
>> considering these endpoints to be behind different physical IOMMUs.
>>
>> For reserved regions, we specify what the driver should do and what the
>> device should enforce with regard to mapping IOVAs of overlapping regions.
>> For example, if a device has some virtual address RESV_T_MSI and an other
>> device has the same virtual address RESV_T_IDENTITY, what should the
>> driver do?
>>
>> I think it should apply the RESV_T_IDENTITY. RESV_T_MSI is just a special
>> case of RESV_T_RESERVED, it's a hint for the IRQ subsystem and doesn't
>> have a meaning within a domain. From DMA mappings point of view, it is
>> effectively the same as RESV_T_RESERVED. When merging
>> RESV_T_RESERVED and
>> RESV_T_IDENTITY, we should make it RESV_T_IDENTITY. Because it is
>> required
>> for one endpoint to work (the one with IDENTITY) and is not accessed by
>> the other (the driver must not allocate this IOVA range for DMA).
>>
>> More generally, I'd like to add the following combination table to the
>> spec, that shows the resulting reserved region type after merging regions
>> from two endpoints. N: no reserved region, R: RESV_T_RESERVED, M:
>> RESV_T_MSI, I: RESV_T_IDENTITY. It is symmetric so I didn't fill the
>> bottom half.
>>
>>       | N | R | M | I
>>     ------------------
>>     N | N | R | M | ?
>>     ------------------
>>     R |   | R | R | I
>>     ------------------
>>     M |   |   | M | I
>>     ------------------
>>     I |   |   |   | I
>>
>> The 'I' column is problematic. If one endpoint has an IDENTITY region and
>> the other doesn't have anything at that virtual address, then making that
>> region an identity mapping will result in the second endpoint being able
>> to access firmware memory. If using nested translation, stage-2 can help
>> us here. But in general we should allow the device to reject an attach
>> that would result in a N+I combination if the host considers it too dodgy.
>> I think the other combinations are safe enough.
>>
> 
> will overlap happen in real? Can we simplify the spec to have device
> not reporting overlapping regions?

I think it's likely to happen, to have two endpoints with different
overlapping types. Reporting is easy, but handling mismatched reserved
regions is hard. So what I can do is leave these combination rules as
implementation detail, and let the device reject any domain attach that
overlaps mismatched RESV regions.

After all, putting multiple devices in the same domain is a nice feature
but might not be widely used in reality. It's not worth spending too much
time describing all possible behaviors for it (but still worth thinking
about to avoid introducing security holes).

Thanks,
Jean

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [virtio-dev] [RFC] virtio-iommu version 0.4
@ 2017-09-25 13:47             ` Jean-Philippe Brucker
  0 siblings, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-09-25 13:47 UTC (permalink / raw)
  To: Tian, Kevin, Auger Eric, iommu, kvm, virtualization, virtio-dev
  Cc: Will Deacon, Robin Murphy, Lorenzo Pieralisi, mst, jasowang,
	Marc Zyngier, eric.auger.pro, bharat.bhushan, peterx

On 21/09/17 07:41, Tian, Kevin wrote:
>> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
>> Sent: Wednesday, September 6, 2017 7:49 PM
>>
>>
>>> 2.6.8.2.1
>>> Multiple overlapping RESV_MEM properties are merged together. Device
>>> requirement? if same types I assume?
>>
>> Combination rules apply to device and driver. When the driver puts
>> multiple endpoints in the same domain, combination rules apply. They will
>> become important once the guest attempts to do things like combining
>> endpoints with different PASID capabilities in the same domain. I'm
>> considering these endpoints to be behind different physical IOMMUs.
>>
>> For reserved regions, we specify what the driver should do and what the
>> device should enforce with regard to mapping IOVAs of overlapping regions.
>> For example, if a device has some virtual address RESV_T_MSI and an other
>> device has the same virtual address RESV_T_IDENTITY, what should the
>> driver do?
>>
>> I think it should apply the RESV_T_IDENTITY. RESV_T_MSI is just a special
>> case of RESV_T_RESERVED, it's a hint for the IRQ subsystem and doesn't
>> have a meaning within a domain. From DMA mappings point of view, it is
>> effectively the same as RESV_T_RESERVED. When merging
>> RESV_T_RESERVED and
>> RESV_T_IDENTITY, we should make it RESV_T_IDENTITY. Because it is
>> required
>> for one endpoint to work (the one with IDENTITY) and is not accessed by
>> the other (the driver must not allocate this IOVA range for DMA).
>>
>> More generally, I'd like to add the following combination table to the
>> spec, that shows the resulting reserved region type after merging regions
>> from two endpoints. N: no reserved region, R: RESV_T_RESERVED, M:
>> RESV_T_MSI, I: RESV_T_IDENTITY. It is symmetric so I didn't fill the
>> bottom half.
>>
>>       | N | R | M | I
>>     ------------------
>>     N | N | R | M | ?
>>     ------------------
>>     R |   | R | R | I
>>     ------------------
>>     M |   |   | M | I
>>     ------------------
>>     I |   |   |   | I
>>
>> The 'I' column is problematic. If one endpoint has an IDENTITY region and
>> the other doesn't have anything at that virtual address, then making that
>> region an identity mapping will result in the second endpoint being able
>> to access firmware memory. If using nested translation, stage-2 can help
>> us here. But in general we should allow the device to reject an attach
>> that would result in a N+I combination if the host considers it too dodgy.
>> I think the other combinations are safe enough.
>>
> 
> will overlap happen in real? Can we simplify the spec to have device
> not reporting overlapping regions?

I think it's likely to happen, to have two endpoints with different
overlapping types. Reporting is easy, but handling mismatched reserved
regions is hard. So what I can do is leave these combination rules as
implementation detail, and let the device reject any domain attach that
overlaps mismatched RESV regions.

After all, putting multiple devices in the same domain is a nice feature
but might not be widely used in reality. It's not worth spending too much
time describing all possible behaviors for it (but still worth thinking
about to avoid introducing security holes).

Thanks,
Jean

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] virtio-iommu version 0.4
  2017-08-04 18:19 ` [virtio-dev] " Jean-Philippe Brucker
@ 2017-10-03 13:04   ` Auger Eric
  -1 siblings, 0 replies; 71+ messages in thread
From: Auger Eric @ 2017-10-03 13:04 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: will.deacon, robin.murphy, lorenzo.pieralisi, mst, jasowang,
	marc.zyngier, eric.auger.pro, bharat.bhushan, peterx, kevin.tian

Hi Jean,

On 04/08/2017 20:19, Jean-Philippe Brucker wrote:
> This is the continuation of my proposal for virtio-iommu, the para-
> virtualized IOMMU. Here is a summary of the changes since last time [1]:
> 
> * The virtio-iommu document now resembles an actual specification. It is
>   split into a formal description of the virtio device, and implementation
>   notes. Please find sources and binaries at [2].
> 
> * Added a probe request to describe to the guest different properties that
>   do not fit in firmware or in the virtio config space. This is a
>   necessary stepping stone for extending the virtio-iommu.
> 
> * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
>   Bhushan.
> 
> You can find the Linux driver and kvmtool device at [4] and [5]. I
> plan to rework driver and kvmtool device slightly before sending the
> patches.
> 
> To understand the virtio-iommu, I advise to first read introduction and
> motivation, then skim through implementation notes and finally look at the
> device specification.
> 
> I wasn't sure how to organize the review. For those who prefer to comment
> inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex
> to this thread. They are the biggest chunks of the document. But LaTeX
> isn't very pleasant to read, so you can simply send a list of comments in
> relation to section numbers and a few words of context, we'll manage.
> 
> ---
> Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare
> more easily differences since the RFC (see [6]), but haven't been made
> public so far. This is the first public posting since initial proposal
> [1], and the following describes all changes.
> 
> ## v0.1 ##
> 
> Content is the same as the RFC, but formatted to LaTeX. 'make' generates
> one PDF and one HTML document.
> 
> ## v0.2 ##
> 
> Add introductions, improve topology example and firmware description based
> on feedback and a number of useful discussions.
> 
> ## v0.3 ##
> 
> Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten
> the device and driver behaviour. Unmap semantics are consolidated; they
> are now closer to VFIO Type1 v2 semantics.
> 
> ## v0.4 ##
> 
> Introduce PROBE requests. They provide per-endpoint information to the
> driver that couldn't be described otherwise.
> 
> For the moment, they allow to handle MSIs on x86 virtual platforms (see
> 3.2). To do that we communicate reserved IOVA regions, that will also be
> useful for describing regions that cannot be mapped for a given endpoint,
> for instance addresses that correspond to a PCI bridge window.
> 
> Introducing such a large framework for this tiny feature may seem
> overkill, but it is needed for future extensions of the virtio-iommu and I
> believe it really is worth the effort.
> 
> ## Future ##
> 
> Other extensions are in preparation. I won't detail them here because v0.4
> already is a lot to digest, but in short, building on top of PROBE:
> 
> * First, since the IOMMU is paravirtualized, the device can expose some
>   properties of the physical topology to the guest, and let it allocate
>   resources more efficiently. For example, when the virtio-iommu manages
>   both physical and emulated endpoints, with different underlying IOMMUs,
>   we now have a way to describe multiple page and block granularities,
>   instead of forcing the guest to use the most restricted one for all
>   endpoints. This will most likely be in v0.5.
> 
> * Then on top of that, a major improvement will describe hardware
>   acceleration features available to the guest. There is what I call "Page
>   Table Handover" (or simply, from the host POV, "Nested"), the ability
>   for the guest to manipulate its own page tables instead of sending
>   MAP/UNMAP requests to the host. This, along with IO Page Fault
>   reporting, will also permit SVM virtualization on different platforms.
> 
> Thanks,
> Jean
> 
> [1] http://www.spinics.net/lists/kvm/msg147990.html
> [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
>     http://www.linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
>     I reiterate the disclaimers: don't use this document as a reference,
>     it's a draft. It's also not an OASIS document yet. It may be riddled
>     with mistakes. As this is a working draft, it is unstable and I do not
>     guarantee backward compatibility of future versions.
> [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
> [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
>     Warning: UAPI headers have changed! They didn't follow the spec,
>     please update. (Use branch v0.1, that has the old headers, for the
>     Qemu prototype [3])
When rebasing the v0.4 driver on master I observe a regression: commands
are not received properly by QEMU (typically an attach command is
received with a type of 0). After a bisection of the guest kernel the
first commit the problem appears is:

commit e3067861ba6650a566a6273738c23c956ad55c02
arm64: add basic VMAP_STACK support

After reverting this patch, things resume working.

I observe the problem with a 4kB page guest kernel.

Thanks

Eric


> [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
>     Warning: command-line has changed! Use --viommu vfio[,opts] and
>     --viommu virtio[,opts] to instantiate a device.
> [6] http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [virtio-dev] [RFC] virtio-iommu version 0.4
  2017-08-04 18:19 ` [virtio-dev] " Jean-Philippe Brucker
                   ` (14 preceding siblings ...)
  (?)
@ 2017-10-03 13:04 ` Auger Eric
  -1 siblings, 0 replies; 71+ messages in thread
From: Auger Eric @ 2017-10-03 13:04 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: lorenzo.pieralisi, mst, marc.zyngier, will.deacon, robin.murphy,
	eric.auger.pro

Hi Jean,

On 04/08/2017 20:19, Jean-Philippe Brucker wrote:
> This is the continuation of my proposal for virtio-iommu, the para-
> virtualized IOMMU. Here is a summary of the changes since last time [1]:
> 
> * The virtio-iommu document now resembles an actual specification. It is
>   split into a formal description of the virtio device, and implementation
>   notes. Please find sources and binaries at [2].
> 
> * Added a probe request to describe to the guest different properties that
>   do not fit in firmware or in the virtio config space. This is a
>   necessary stepping stone for extending the virtio-iommu.
> 
> * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
>   Bhushan.
> 
> You can find the Linux driver and kvmtool device at [4] and [5]. I
> plan to rework driver and kvmtool device slightly before sending the
> patches.
> 
> To understand the virtio-iommu, I advise to first read introduction and
> motivation, then skim through implementation notes and finally look at the
> device specification.
> 
> I wasn't sure how to organize the review. For those who prefer to comment
> inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex
> to this thread. They are the biggest chunks of the document. But LaTeX
> isn't very pleasant to read, so you can simply send a list of comments in
> relation to section numbers and a few words of context, we'll manage.
> 
> ---
> Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare
> more easily differences since the RFC (see [6]), but haven't been made
> public so far. This is the first public posting since initial proposal
> [1], and the following describes all changes.
> 
> ## v0.1 ##
> 
> Content is the same as the RFC, but formatted to LaTeX. 'make' generates
> one PDF and one HTML document.
> 
> ## v0.2 ##
> 
> Add introductions, improve topology example and firmware description based
> on feedback and a number of useful discussions.
> 
> ## v0.3 ##
> 
> Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten
> the device and driver behaviour. Unmap semantics are consolidated; they
> are now closer to VFIO Type1 v2 semantics.
> 
> ## v0.4 ##
> 
> Introduce PROBE requests. They provide per-endpoint information to the
> driver that couldn't be described otherwise.
> 
> For the moment, they allow to handle MSIs on x86 virtual platforms (see
> 3.2). To do that we communicate reserved IOVA regions, that will also be
> useful for describing regions that cannot be mapped for a given endpoint,
> for instance addresses that correspond to a PCI bridge window.
> 
> Introducing such a large framework for this tiny feature may seem
> overkill, but it is needed for future extensions of the virtio-iommu and I
> believe it really is worth the effort.
> 
> ## Future ##
> 
> Other extensions are in preparation. I won't detail them here because v0.4
> already is a lot to digest, but in short, building on top of PROBE:
> 
> * First, since the IOMMU is paravirtualized, the device can expose some
>   properties of the physical topology to the guest, and let it allocate
>   resources more efficiently. For example, when the virtio-iommu manages
>   both physical and emulated endpoints, with different underlying IOMMUs,
>   we now have a way to describe multiple page and block granularities,
>   instead of forcing the guest to use the most restricted one for all
>   endpoints. This will most likely be in v0.5.
> 
> * Then on top of that, a major improvement will describe hardware
>   acceleration features available to the guest. There is what I call "Page
>   Table Handover" (or simply, from the host POV, "Nested"), the ability
>   for the guest to manipulate its own page tables instead of sending
>   MAP/UNMAP requests to the host. This, along with IO Page Fault
>   reporting, will also permit SVM virtualization on different platforms.
> 
> Thanks,
> Jean
> 
> [1] http://www.spinics.net/lists/kvm/msg147990.html
> [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
>     http://www.linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
>     I reiterate the disclaimers: don't use this document as a reference,
>     it's a draft. It's also not an OASIS document yet. It may be riddled
>     with mistakes. As this is a working draft, it is unstable and I do not
>     guarantee backward compatibility of future versions.
> [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
> [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
>     Warning: UAPI headers have changed! They didn't follow the spec,
>     please update. (Use branch v0.1, that has the old headers, for the
>     Qemu prototype [3])
When rebasing the v0.4 driver on master I observe a regression: commands
are not received properly by QEMU (typically an attach command is
received with a type of 0). After a bisection of the guest kernel the
first commit the problem appears is:

commit e3067861ba6650a566a6273738c23c956ad55c02
arm64: add basic VMAP_STACK support

After reverting this patch, things resume working.

I observe the problem with a 4kB page guest kernel.

Thanks

Eric


> [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
>     Warning: command-line has changed! Use --viommu vfio[,opts] and
>     --viommu virtio[,opts] to instantiate a device.
> [6] http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [virtio-dev] [RFC] virtio-iommu version 0.4
@ 2017-10-03 13:04   ` Auger Eric
  0 siblings, 0 replies; 71+ messages in thread
From: Auger Eric @ 2017-10-03 13:04 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: will.deacon, robin.murphy, lorenzo.pieralisi, mst, jasowang,
	marc.zyngier, eric.auger.pro, bharat.bhushan, peterx, kevin.tian

Hi Jean,

On 04/08/2017 20:19, Jean-Philippe Brucker wrote:
> This is the continuation of my proposal for virtio-iommu, the para-
> virtualized IOMMU. Here is a summary of the changes since last time [1]:
> 
> * The virtio-iommu document now resembles an actual specification. It is
>   split into a formal description of the virtio device, and implementation
>   notes. Please find sources and binaries at [2].
> 
> * Added a probe request to describe to the guest different properties that
>   do not fit in firmware or in the virtio config space. This is a
>   necessary stepping stone for extending the virtio-iommu.
> 
> * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
>   Bhushan.
> 
> You can find the Linux driver and kvmtool device at [4] and [5]. I
> plan to rework driver and kvmtool device slightly before sending the
> patches.
> 
> To understand the virtio-iommu, I advise to first read introduction and
> motivation, then skim through implementation notes and finally look at the
> device specification.
> 
> I wasn't sure how to organize the review. For those who prefer to comment
> inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex
> to this thread. They are the biggest chunks of the document. But LaTeX
> isn't very pleasant to read, so you can simply send a list of comments in
> relation to section numbers and a few words of context, we'll manage.
> 
> ---
> Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare
> more easily differences since the RFC (see [6]), but haven't been made
> public so far. This is the first public posting since initial proposal
> [1], and the following describes all changes.
> 
> ## v0.1 ##
> 
> Content is the same as the RFC, but formatted to LaTeX. 'make' generates
> one PDF and one HTML document.
> 
> ## v0.2 ##
> 
> Add introductions, improve topology example and firmware description based
> on feedback and a number of useful discussions.
> 
> ## v0.3 ##
> 
> Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten
> the device and driver behaviour. Unmap semantics are consolidated; they
> are now closer to VFIO Type1 v2 semantics.
> 
> ## v0.4 ##
> 
> Introduce PROBE requests. They provide per-endpoint information to the
> driver that couldn't be described otherwise.
> 
> For the moment, they allow to handle MSIs on x86 virtual platforms (see
> 3.2). To do that we communicate reserved IOVA regions, that will also be
> useful for describing regions that cannot be mapped for a given endpoint,
> for instance addresses that correspond to a PCI bridge window.
> 
> Introducing such a large framework for this tiny feature may seem
> overkill, but it is needed for future extensions of the virtio-iommu and I
> believe it really is worth the effort.
> 
> ## Future ##
> 
> Other extensions are in preparation. I won't detail them here because v0.4
> already is a lot to digest, but in short, building on top of PROBE:
> 
> * First, since the IOMMU is paravirtualized, the device can expose some
>   properties of the physical topology to the guest, and let it allocate
>   resources more efficiently. For example, when the virtio-iommu manages
>   both physical and emulated endpoints, with different underlying IOMMUs,
>   we now have a way to describe multiple page and block granularities,
>   instead of forcing the guest to use the most restricted one for all
>   endpoints. This will most likely be in v0.5.
> 
> * Then on top of that, a major improvement will describe hardware
>   acceleration features available to the guest. There is what I call "Page
>   Table Handover" (or simply, from the host POV, "Nested"), the ability
>   for the guest to manipulate its own page tables instead of sending
>   MAP/UNMAP requests to the host. This, along with IO Page Fault
>   reporting, will also permit SVM virtualization on different platforms.
> 
> Thanks,
> Jean
> 
> [1] http://www.spinics.net/lists/kvm/msg147990.html
> [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
>     http://www.linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
>     I reiterate the disclaimers: don't use this document as a reference,
>     it's a draft. It's also not an OASIS document yet. It may be riddled
>     with mistakes. As this is a working draft, it is unstable and I do not
>     guarantee backward compatibility of future versions.
> [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
> [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
>     Warning: UAPI headers have changed! They didn't follow the spec,
>     please update. (Use branch v0.1, that has the old headers, for the
>     Qemu prototype [3])
When rebasing the v0.4 driver on master I observe a regression: commands
are not received properly by QEMU (typically an attach command is
received with a type of 0). After a bisection of the guest kernel the
first commit the problem appears is:

commit e3067861ba6650a566a6273738c23c956ad55c02
arm64: add basic VMAP_STACK support

After reverting this patch, things resume working.

I observe the problem with a 4kB page guest kernel.

Thanks

Eric


> [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
>     Warning: command-line has changed! Use --viommu vfio[,opts] and
>     --viommu virtio[,opts] to instantiate a device.
> [6] http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [virtio-dev] [RFC] virtio-iommu version 0.4
  2017-10-03 13:04   ` [virtio-dev] " Auger Eric
@ 2017-10-09 19:46       ` Jean-Philippe Brucker
  -1 siblings, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-10-09 19:46 UTC (permalink / raw)
  To: Auger Eric, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kvm-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	virtio-dev-sDuHXQ4OtrM4h7I2RyI4rWD2FQJk+8+b
  Cc: mst-H+wXaHxf7aLQT0dZR+AlfA, Marc Zyngier,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA, Will Deacon,
	eric.auger.pro-Re5JQEeQqe8AvxtiuMwx3w

Hi Eric,

On 03/10/17 14:04, Auger Eric wrote:
> When rebasing the v0.4 driver on master I observe a regression: commands
> are not received properly by QEMU (typically an attach command is
> received with a type of 0). After a bisection of the guest kernel the
> first commit the problem appears is:
> 
> commit e3067861ba6650a566a6273738c23c956ad55c02
> arm64: add basic VMAP_STACK support

Indeed, thanks for the report. I can reproduce the problem with 4.14 and
kvmtool. We should allocate buffers on the heap for any request (see for
example 9472fe704 or e37e2ff35). I rebased onto 4.14 and pushed the
following patch at: git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4.1

It's not very nice, but should fix the issue. I didn't manage to measure
the difference though, my benchmark isn't precise enough. So I don't
know if we should try to use a kmem cache instead of kmalloc.

Thanks,
Jean

--- 8< ---
>From 3fc957560e1e6f070a0468bf75ebc4862d37ff82 Mon Sep 17 00:00:00 2001
From: Jean-Philippe Brucker <jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org>
Date: Mon, 9 Oct 2017 20:13:57 +0100
Subject: [PATCH] iommu/virtio-iommu: Allocate all requests on the heap

When CONFIG_VMAP_STACK is enabled, DMA on the stack isn't allowed. Instead
of allocating virtio-iommu requests on the stack, use kmalloc.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org>
---
 drivers/iommu/virtio-iommu.c | 86 +++++++++++++++++++++++++++++---------------
 1 file changed, 58 insertions(+), 28 deletions(-)

diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index ec1cfaba5997..2d8fd5e99fa7 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -473,13 +473,10 @@ static int viommu_attach_dev(struct iommu_domain *domain, struct device *dev)
 {
 	int i;
 	int ret = 0;
+	struct virtio_iommu_req_attach *req;
 	struct iommu_fwspec *fwspec = dev->iommu_fwspec;
 	struct viommu_endpoint *vdev = fwspec->iommu_priv;
 	struct viommu_domain *vdomain = to_viommu_domain(domain);
-	struct virtio_iommu_req_attach req = {
-		.head.type	= VIRTIO_IOMMU_T_ATTACH,
-		.address_space	= cpu_to_le32(vdomain->id),
-	};

 	mutex_lock(&vdomain->mutex);
 	if (!vdomain->viommu) {
@@ -531,14 +528,25 @@ static int viommu_attach_dev(struct iommu_domain *domain, struct device *dev)

 	dev_dbg(dev, "attach to domain %llu\n", vdomain->id);

+	req = kzalloc(sizeof(*req), GFP_KERNEL);
+	if (!req)
+		return -ENOMEM;
+
+	*req = (struct virtio_iommu_req_attach) {
+		.head.type	= VIRTIO_IOMMU_T_ATTACH,
+		.address_space	= cpu_to_le32(vdomain->id),
+	};
+
 	for (i = 0; i < fwspec->num_ids; i++) {
-		req.device = cpu_to_le32(fwspec->ids[i]);
+		req->device = cpu_to_le32(fwspec->ids[i]);

-		ret = viommu_send_req_sync(vdomain->viommu, &req);
+		ret = viommu_send_req_sync(vdomain->viommu, req);
 		if (ret)
 			break;
 	}

+	kfree(req);
+
 	vdomain->attached++;
 	vdev->vdomain = vdomain;

@@ -550,13 +558,7 @@ static int viommu_map(struct iommu_domain *domain, unsigned long iova,
 {
 	int ret;
 	struct viommu_domain *vdomain = to_viommu_domain(domain);
-	struct virtio_iommu_req_map req = {
-		.head.type	= VIRTIO_IOMMU_T_MAP,
-		.address_space	= cpu_to_le32(vdomain->id),
-		.virt_addr	= cpu_to_le64(iova),
-		.phys_addr	= cpu_to_le64(paddr),
-		.size		= cpu_to_le64(size),
-	};
+	struct virtio_iommu_req_map *req;

 	pr_debug("map %llu 0x%lx -> 0x%llx (%zu)\n", vdomain->id, iova,
 		 paddr, size);
@@ -564,17 +566,30 @@ static int viommu_map(struct iommu_domain *domain, unsigned long iova,
 	if (!vdomain->attached)
 		return -ENODEV;

-	if (prot & IOMMU_READ)
-		req.flags |= cpu_to_le32(VIRTIO_IOMMU_MAP_F_READ);
-
-	if (prot & IOMMU_WRITE)
-		req.flags |= cpu_to_le32(VIRTIO_IOMMU_MAP_F_WRITE);
-
 	ret = viommu_tlb_map(vdomain, iova, paddr, size);
 	if (ret)
 		return ret;

-	ret = viommu_send_req_sync(vdomain->viommu, &req);
+	req = kzalloc(sizeof(*req), GFP_KERNEL);
+	if (!req)
+		return -ENOMEM;
+
+	*req = (struct virtio_iommu_req_map) {
+		.head.type	= VIRTIO_IOMMU_T_MAP,
+		.address_space	= cpu_to_le32(vdomain->id),
+		.virt_addr	= cpu_to_le64(iova),
+		.phys_addr	= cpu_to_le64(paddr),
+		.size		= cpu_to_le64(size),
+	};
+
+	if (prot & IOMMU_READ)
+		req->flags |= cpu_to_le32(VIRTIO_IOMMU_MAP_F_READ);
+
+	if (prot & IOMMU_WRITE)
+		req->flags |= cpu_to_le32(VIRTIO_IOMMU_MAP_F_WRITE);
+
+	ret = viommu_send_req_sync(vdomain->viommu, req);
+	kfree(req);
 	if (ret)
 		viommu_tlb_unmap(vdomain, iova, size);

@@ -587,11 +602,7 @@ static size_t viommu_unmap(struct iommu_domain *domain, unsigned long iova,
 	int ret;
 	size_t unmapped;
 	struct viommu_domain *vdomain = to_viommu_domain(domain);
-	struct virtio_iommu_req_unmap req = {
-		.head.type	= VIRTIO_IOMMU_T_UNMAP,
-		.address_space	= cpu_to_le32(vdomain->id),
-		.virt_addr	= cpu_to_le64(iova),
-	};
+	struct virtio_iommu_req_unmap *req;

 	pr_debug("unmap %llu 0x%lx (%zu)\n", vdomain->id, iova, size);

@@ -603,9 +614,19 @@ static size_t viommu_unmap(struct iommu_domain *domain, unsigned long iova,
 	if (unmapped < size)
 		return 0;

-	req.size = cpu_to_le64(unmapped);
+	req = kzalloc(sizeof(*req), GFP_KERNEL);
+	if (!req)
+		return -ENOMEM;

-	ret = viommu_send_req_sync(vdomain->viommu, &req);
+	*req = (struct virtio_iommu_req_unmap) {
+		.head.type	= VIRTIO_IOMMU_T_UNMAP,
+		.address_space	= cpu_to_le32(vdomain->id),
+		.virt_addr	= cpu_to_le64(iova),
+		.size		= cpu_to_le64(unmapped),
+	};
+
+	ret = viommu_send_req_sync(vdomain->viommu, req);
+	kfree(req);
 	if (ret)
 		return 0;

@@ -626,12 +647,16 @@ static size_t viommu_map_sg(struct iommu_domain *domain, unsigned long iova,
 	unsigned long mapped_iova;
 	size_t top_size, bottom_size;
 	struct viommu_request reqs[nents];
-	struct virtio_iommu_req_map map_reqs[nents];
+	struct virtio_iommu_req_map *map_reqs;
 	struct viommu_domain *vdomain = to_viommu_domain(domain);

 	if (!vdomain->attached)
 		return 0;

+	map_reqs = kcalloc(nents, sizeof(*map_reqs), GFP_KERNEL);
+	if (!map_reqs)
+		return -ENOMEM;
+
 	pr_debug("map_sg %llu %u 0x%lx\n", vdomain->id, nents, iova);

 	if (prot & IOMMU_READ)
@@ -679,6 +704,7 @@ static size_t viommu_map_sg(struct iommu_domain *domain, unsigned long iova,

 	if (ret) {
 		viommu_tlb_unmap(vdomain, iova, total_size);
+		kfree(map_reqs);
 		return 0;
 	}

@@ -692,6 +718,7 @@ static size_t viommu_map_sg(struct iommu_domain *domain, unsigned long iova,
 			goto err_rollback;
 	}

+	kfree(map_reqs);
 	return total_size;

 err_rollback:
@@ -719,6 +746,7 @@ static size_t viommu_map_sg(struct iommu_domain *domain, unsigned long iova,
 	}

 	viommu_tlb_unmap(vdomain, iova, total_size);
+	kfree(map_reqs);

 	return 0;
 }
@@ -863,6 +891,8 @@ static int viommu_probe_device(struct viommu_dev *viommu,
 		type = le16_to_cpu(prop->type) & VIRTIO_IOMMU_PROBE_T_MASK;
 	}

+	kfree(probe);
+
 	return 0;
 }

--
2.13.3

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [virtio-dev] [RFC] virtio-iommu version 0.4
  2017-10-03 13:04   ` [virtio-dev] " Auger Eric
  (?)
  (?)
@ 2017-10-09 19:46   ` Jean-Philippe Brucker
  -1 siblings, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-10-09 19:46 UTC (permalink / raw)
  To: Auger Eric, iommu, kvm, virtualization, virtio-dev
  Cc: Lorenzo Pieralisi, mst, Marc Zyngier, Will Deacon, Robin Murphy,
	eric.auger.pro

Hi Eric,

On 03/10/17 14:04, Auger Eric wrote:
> When rebasing the v0.4 driver on master I observe a regression: commands
> are not received properly by QEMU (typically an attach command is
> received with a type of 0). After a bisection of the guest kernel the
> first commit the problem appears is:
> 
> commit e3067861ba6650a566a6273738c23c956ad55c02
> arm64: add basic VMAP_STACK support

Indeed, thanks for the report. I can reproduce the problem with 4.14 and
kvmtool. We should allocate buffers on the heap for any request (see for
example 9472fe704 or e37e2ff35). I rebased onto 4.14 and pushed the
following patch at: git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4.1

It's not very nice, but should fix the issue. I didn't manage to measure
the difference though, my benchmark isn't precise enough. So I don't
know if we should try to use a kmem cache instead of kmalloc.

Thanks,
Jean

--- 8< ---
From 3fc957560e1e6f070a0468bf75ebc4862d37ff82 Mon Sep 17 00:00:00 2001
From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Date: Mon, 9 Oct 2017 20:13:57 +0100
Subject: [PATCH] iommu/virtio-iommu: Allocate all requests on the heap

When CONFIG_VMAP_STACK is enabled, DMA on the stack isn't allowed. Instead
of allocating virtio-iommu requests on the stack, use kmalloc.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 drivers/iommu/virtio-iommu.c | 86 +++++++++++++++++++++++++++++---------------
 1 file changed, 58 insertions(+), 28 deletions(-)

diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index ec1cfaba5997..2d8fd5e99fa7 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -473,13 +473,10 @@ static int viommu_attach_dev(struct iommu_domain *domain, struct device *dev)
 {
 	int i;
 	int ret = 0;
+	struct virtio_iommu_req_attach *req;
 	struct iommu_fwspec *fwspec = dev->iommu_fwspec;
 	struct viommu_endpoint *vdev = fwspec->iommu_priv;
 	struct viommu_domain *vdomain = to_viommu_domain(domain);
-	struct virtio_iommu_req_attach req = {
-		.head.type	= VIRTIO_IOMMU_T_ATTACH,
-		.address_space	= cpu_to_le32(vdomain->id),
-	};

 	mutex_lock(&vdomain->mutex);
 	if (!vdomain->viommu) {
@@ -531,14 +528,25 @@ static int viommu_attach_dev(struct iommu_domain *domain, struct device *dev)

 	dev_dbg(dev, "attach to domain %llu\n", vdomain->id);

+	req = kzalloc(sizeof(*req), GFP_KERNEL);
+	if (!req)
+		return -ENOMEM;
+
+	*req = (struct virtio_iommu_req_attach) {
+		.head.type	= VIRTIO_IOMMU_T_ATTACH,
+		.address_space	= cpu_to_le32(vdomain->id),
+	};
+
 	for (i = 0; i < fwspec->num_ids; i++) {
-		req.device = cpu_to_le32(fwspec->ids[i]);
+		req->device = cpu_to_le32(fwspec->ids[i]);

-		ret = viommu_send_req_sync(vdomain->viommu, &req);
+		ret = viommu_send_req_sync(vdomain->viommu, req);
 		if (ret)
 			break;
 	}

+	kfree(req);
+
 	vdomain->attached++;
 	vdev->vdomain = vdomain;

@@ -550,13 +558,7 @@ static int viommu_map(struct iommu_domain *domain, unsigned long iova,
 {
 	int ret;
 	struct viommu_domain *vdomain = to_viommu_domain(domain);
-	struct virtio_iommu_req_map req = {
-		.head.type	= VIRTIO_IOMMU_T_MAP,
-		.address_space	= cpu_to_le32(vdomain->id),
-		.virt_addr	= cpu_to_le64(iova),
-		.phys_addr	= cpu_to_le64(paddr),
-		.size		= cpu_to_le64(size),
-	};
+	struct virtio_iommu_req_map *req;

 	pr_debug("map %llu 0x%lx -> 0x%llx (%zu)\n", vdomain->id, iova,
 		 paddr, size);
@@ -564,17 +566,30 @@ static int viommu_map(struct iommu_domain *domain, unsigned long iova,
 	if (!vdomain->attached)
 		return -ENODEV;

-	if (prot & IOMMU_READ)
-		req.flags |= cpu_to_le32(VIRTIO_IOMMU_MAP_F_READ);
-
-	if (prot & IOMMU_WRITE)
-		req.flags |= cpu_to_le32(VIRTIO_IOMMU_MAP_F_WRITE);
-
 	ret = viommu_tlb_map(vdomain, iova, paddr, size);
 	if (ret)
 		return ret;

-	ret = viommu_send_req_sync(vdomain->viommu, &req);
+	req = kzalloc(sizeof(*req), GFP_KERNEL);
+	if (!req)
+		return -ENOMEM;
+
+	*req = (struct virtio_iommu_req_map) {
+		.head.type	= VIRTIO_IOMMU_T_MAP,
+		.address_space	= cpu_to_le32(vdomain->id),
+		.virt_addr	= cpu_to_le64(iova),
+		.phys_addr	= cpu_to_le64(paddr),
+		.size		= cpu_to_le64(size),
+	};
+
+	if (prot & IOMMU_READ)
+		req->flags |= cpu_to_le32(VIRTIO_IOMMU_MAP_F_READ);
+
+	if (prot & IOMMU_WRITE)
+		req->flags |= cpu_to_le32(VIRTIO_IOMMU_MAP_F_WRITE);
+
+	ret = viommu_send_req_sync(vdomain->viommu, req);
+	kfree(req);
 	if (ret)
 		viommu_tlb_unmap(vdomain, iova, size);

@@ -587,11 +602,7 @@ static size_t viommu_unmap(struct iommu_domain *domain, unsigned long iova,
 	int ret;
 	size_t unmapped;
 	struct viommu_domain *vdomain = to_viommu_domain(domain);
-	struct virtio_iommu_req_unmap req = {
-		.head.type	= VIRTIO_IOMMU_T_UNMAP,
-		.address_space	= cpu_to_le32(vdomain->id),
-		.virt_addr	= cpu_to_le64(iova),
-	};
+	struct virtio_iommu_req_unmap *req;

 	pr_debug("unmap %llu 0x%lx (%zu)\n", vdomain->id, iova, size);

@@ -603,9 +614,19 @@ static size_t viommu_unmap(struct iommu_domain *domain, unsigned long iova,
 	if (unmapped < size)
 		return 0;

-	req.size = cpu_to_le64(unmapped);
+	req = kzalloc(sizeof(*req), GFP_KERNEL);
+	if (!req)
+		return -ENOMEM;

-	ret = viommu_send_req_sync(vdomain->viommu, &req);
+	*req = (struct virtio_iommu_req_unmap) {
+		.head.type	= VIRTIO_IOMMU_T_UNMAP,
+		.address_space	= cpu_to_le32(vdomain->id),
+		.virt_addr	= cpu_to_le64(iova),
+		.size		= cpu_to_le64(unmapped),
+	};
+
+	ret = viommu_send_req_sync(vdomain->viommu, req);
+	kfree(req);
 	if (ret)
 		return 0;

@@ -626,12 +647,16 @@ static size_t viommu_map_sg(struct iommu_domain *domain, unsigned long iova,
 	unsigned long mapped_iova;
 	size_t top_size, bottom_size;
 	struct viommu_request reqs[nents];
-	struct virtio_iommu_req_map map_reqs[nents];
+	struct virtio_iommu_req_map *map_reqs;
 	struct viommu_domain *vdomain = to_viommu_domain(domain);

 	if (!vdomain->attached)
 		return 0;

+	map_reqs = kcalloc(nents, sizeof(*map_reqs), GFP_KERNEL);
+	if (!map_reqs)
+		return -ENOMEM;
+
 	pr_debug("map_sg %llu %u 0x%lx\n", vdomain->id, nents, iova);

 	if (prot & IOMMU_READ)
@@ -679,6 +704,7 @@ static size_t viommu_map_sg(struct iommu_domain *domain, unsigned long iova,

 	if (ret) {
 		viommu_tlb_unmap(vdomain, iova, total_size);
+		kfree(map_reqs);
 		return 0;
 	}

@@ -692,6 +718,7 @@ static size_t viommu_map_sg(struct iommu_domain *domain, unsigned long iova,
 			goto err_rollback;
 	}

+	kfree(map_reqs);
 	return total_size;

 err_rollback:
@@ -719,6 +746,7 @@ static size_t viommu_map_sg(struct iommu_domain *domain, unsigned long iova,
 	}

 	viommu_tlb_unmap(vdomain, iova, total_size);
+	kfree(map_reqs);

 	return 0;
 }
@@ -863,6 +891,8 @@ static int viommu_probe_device(struct viommu_dev *viommu,
 		type = le16_to_cpu(prop->type) & VIRTIO_IOMMU_PROBE_T_MASK;
 	}

+	kfree(probe);
+
 	return 0;
 }

--
2.13.3

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [virtio-dev] [RFC] virtio-iommu version 0.4
@ 2017-10-09 19:46       ` Jean-Philippe Brucker
  0 siblings, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-10-09 19:46 UTC (permalink / raw)
  To: Auger Eric, iommu, kvm, virtualization, virtio-dev
  Cc: Will Deacon, Robin Murphy, Lorenzo Pieralisi, mst, jasowang,
	Marc Zyngier, eric.auger.pro, bharat.bhushan, peterx, kevin.tian

Hi Eric,

On 03/10/17 14:04, Auger Eric wrote:
> When rebasing the v0.4 driver on master I observe a regression: commands
> are not received properly by QEMU (typically an attach command is
> received with a type of 0). After a bisection of the guest kernel the
> first commit the problem appears is:
> 
> commit e3067861ba6650a566a6273738c23c956ad55c02
> arm64: add basic VMAP_STACK support

Indeed, thanks for the report. I can reproduce the problem with 4.14 and
kvmtool. We should allocate buffers on the heap for any request (see for
example 9472fe704 or e37e2ff35). I rebased onto 4.14 and pushed the
following patch at: git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4.1

It's not very nice, but should fix the issue. I didn't manage to measure
the difference though, my benchmark isn't precise enough. So I don't
know if we should try to use a kmem cache instead of kmalloc.

Thanks,
Jean

--- 8< ---
From 3fc957560e1e6f070a0468bf75ebc4862d37ff82 Mon Sep 17 00:00:00 2001
From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Date: Mon, 9 Oct 2017 20:13:57 +0100
Subject: [PATCH] iommu/virtio-iommu: Allocate all requests on the heap

When CONFIG_VMAP_STACK is enabled, DMA on the stack isn't allowed. Instead
of allocating virtio-iommu requests on the stack, use kmalloc.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 drivers/iommu/virtio-iommu.c | 86 +++++++++++++++++++++++++++++---------------
 1 file changed, 58 insertions(+), 28 deletions(-)

diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index ec1cfaba5997..2d8fd5e99fa7 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -473,13 +473,10 @@ static int viommu_attach_dev(struct iommu_domain *domain, struct device *dev)
 {
 	int i;
 	int ret = 0;
+	struct virtio_iommu_req_attach *req;
 	struct iommu_fwspec *fwspec = dev->iommu_fwspec;
 	struct viommu_endpoint *vdev = fwspec->iommu_priv;
 	struct viommu_domain *vdomain = to_viommu_domain(domain);
-	struct virtio_iommu_req_attach req = {
-		.head.type	= VIRTIO_IOMMU_T_ATTACH,
-		.address_space	= cpu_to_le32(vdomain->id),
-	};

 	mutex_lock(&vdomain->mutex);
 	if (!vdomain->viommu) {
@@ -531,14 +528,25 @@ static int viommu_attach_dev(struct iommu_domain *domain, struct device *dev)

 	dev_dbg(dev, "attach to domain %llu\n", vdomain->id);

+	req = kzalloc(sizeof(*req), GFP_KERNEL);
+	if (!req)
+		return -ENOMEM;
+
+	*req = (struct virtio_iommu_req_attach) {
+		.head.type	= VIRTIO_IOMMU_T_ATTACH,
+		.address_space	= cpu_to_le32(vdomain->id),
+	};
+
 	for (i = 0; i < fwspec->num_ids; i++) {
-		req.device = cpu_to_le32(fwspec->ids[i]);
+		req->device = cpu_to_le32(fwspec->ids[i]);

-		ret = viommu_send_req_sync(vdomain->viommu, &req);
+		ret = viommu_send_req_sync(vdomain->viommu, req);
 		if (ret)
 			break;
 	}

+	kfree(req);
+
 	vdomain->attached++;
 	vdev->vdomain = vdomain;

@@ -550,13 +558,7 @@ static int viommu_map(struct iommu_domain *domain, unsigned long iova,
 {
 	int ret;
 	struct viommu_domain *vdomain = to_viommu_domain(domain);
-	struct virtio_iommu_req_map req = {
-		.head.type	= VIRTIO_IOMMU_T_MAP,
-		.address_space	= cpu_to_le32(vdomain->id),
-		.virt_addr	= cpu_to_le64(iova),
-		.phys_addr	= cpu_to_le64(paddr),
-		.size		= cpu_to_le64(size),
-	};
+	struct virtio_iommu_req_map *req;

 	pr_debug("map %llu 0x%lx -> 0x%llx (%zu)\n", vdomain->id, iova,
 		 paddr, size);
@@ -564,17 +566,30 @@ static int viommu_map(struct iommu_domain *domain, unsigned long iova,
 	if (!vdomain->attached)
 		return -ENODEV;

-	if (prot & IOMMU_READ)
-		req.flags |= cpu_to_le32(VIRTIO_IOMMU_MAP_F_READ);
-
-	if (prot & IOMMU_WRITE)
-		req.flags |= cpu_to_le32(VIRTIO_IOMMU_MAP_F_WRITE);
-
 	ret = viommu_tlb_map(vdomain, iova, paddr, size);
 	if (ret)
 		return ret;

-	ret = viommu_send_req_sync(vdomain->viommu, &req);
+	req = kzalloc(sizeof(*req), GFP_KERNEL);
+	if (!req)
+		return -ENOMEM;
+
+	*req = (struct virtio_iommu_req_map) {
+		.head.type	= VIRTIO_IOMMU_T_MAP,
+		.address_space	= cpu_to_le32(vdomain->id),
+		.virt_addr	= cpu_to_le64(iova),
+		.phys_addr	= cpu_to_le64(paddr),
+		.size		= cpu_to_le64(size),
+	};
+
+	if (prot & IOMMU_READ)
+		req->flags |= cpu_to_le32(VIRTIO_IOMMU_MAP_F_READ);
+
+	if (prot & IOMMU_WRITE)
+		req->flags |= cpu_to_le32(VIRTIO_IOMMU_MAP_F_WRITE);
+
+	ret = viommu_send_req_sync(vdomain->viommu, req);
+	kfree(req);
 	if (ret)
 		viommu_tlb_unmap(vdomain, iova, size);

@@ -587,11 +602,7 @@ static size_t viommu_unmap(struct iommu_domain *domain, unsigned long iova,
 	int ret;
 	size_t unmapped;
 	struct viommu_domain *vdomain = to_viommu_domain(domain);
-	struct virtio_iommu_req_unmap req = {
-		.head.type	= VIRTIO_IOMMU_T_UNMAP,
-		.address_space	= cpu_to_le32(vdomain->id),
-		.virt_addr	= cpu_to_le64(iova),
-	};
+	struct virtio_iommu_req_unmap *req;

 	pr_debug("unmap %llu 0x%lx (%zu)\n", vdomain->id, iova, size);

@@ -603,9 +614,19 @@ static size_t viommu_unmap(struct iommu_domain *domain, unsigned long iova,
 	if (unmapped < size)
 		return 0;

-	req.size = cpu_to_le64(unmapped);
+	req = kzalloc(sizeof(*req), GFP_KERNEL);
+	if (!req)
+		return -ENOMEM;

-	ret = viommu_send_req_sync(vdomain->viommu, &req);
+	*req = (struct virtio_iommu_req_unmap) {
+		.head.type	= VIRTIO_IOMMU_T_UNMAP,
+		.address_space	= cpu_to_le32(vdomain->id),
+		.virt_addr	= cpu_to_le64(iova),
+		.size		= cpu_to_le64(unmapped),
+	};
+
+	ret = viommu_send_req_sync(vdomain->viommu, req);
+	kfree(req);
 	if (ret)
 		return 0;

@@ -626,12 +647,16 @@ static size_t viommu_map_sg(struct iommu_domain *domain, unsigned long iova,
 	unsigned long mapped_iova;
 	size_t top_size, bottom_size;
 	struct viommu_request reqs[nents];
-	struct virtio_iommu_req_map map_reqs[nents];
+	struct virtio_iommu_req_map *map_reqs;
 	struct viommu_domain *vdomain = to_viommu_domain(domain);

 	if (!vdomain->attached)
 		return 0;

+	map_reqs = kcalloc(nents, sizeof(*map_reqs), GFP_KERNEL);
+	if (!map_reqs)
+		return -ENOMEM;
+
 	pr_debug("map_sg %llu %u 0x%lx\n", vdomain->id, nents, iova);

 	if (prot & IOMMU_READ)
@@ -679,6 +704,7 @@ static size_t viommu_map_sg(struct iommu_domain *domain, unsigned long iova,

 	if (ret) {
 		viommu_tlb_unmap(vdomain, iova, total_size);
+		kfree(map_reqs);
 		return 0;
 	}

@@ -692,6 +718,7 @@ static size_t viommu_map_sg(struct iommu_domain *domain, unsigned long iova,
 			goto err_rollback;
 	}

+	kfree(map_reqs);
 	return total_size;

 err_rollback:
@@ -719,6 +746,7 @@ static size_t viommu_map_sg(struct iommu_domain *domain, unsigned long iova,
 	}

 	viommu_tlb_unmap(vdomain, iova, total_size);
+	kfree(map_reqs);

 	return 0;
 }
@@ -863,6 +891,8 @@ static int viommu_probe_device(struct viommu_dev *viommu,
 		type = le16_to_cpu(prop->type) & VIRTIO_IOMMU_PROBE_T_MASK;
 	}

+	kfree(probe);
+
 	return 0;
 }

--
2.13.3



---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC] virtio-iommu version 0.4
@ 2017-08-04 18:19 Jean-Philippe Brucker
  0 siblings, 0 replies; 71+ messages in thread
From: Jean-Philippe Brucker @ 2017-08-04 18:19 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: lorenzo.pieralisi, mst, marc.zyngier, will.deacon, eric.auger,
	robin.murphy, eric.auger.pro

This is the continuation of my proposal for virtio-iommu, the para-
virtualized IOMMU. Here is a summary of the changes since last time [1]:

* The virtio-iommu document now resembles an actual specification. It is
  split into a formal description of the virtio device, and implementation
  notes. Please find sources and binaries at [2].

* Added a probe request to describe to the guest different properties that
  do not fit in firmware or in the virtio config space. This is a
  necessary stepping stone for extending the virtio-iommu.

* There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
  Bhushan.

You can find the Linux driver and kvmtool device at [4] and [5]. I
plan to rework driver and kvmtool device slightly before sending the
patches.

To understand the virtio-iommu, I advise to first read introduction and
motivation, then skim through implementation notes and finally look at the
device specification.

I wasn't sure how to organize the review. For those who prefer to comment
inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex
to this thread. They are the biggest chunks of the document. But LaTeX
isn't very pleasant to read, so you can simply send a list of comments in
relation to section numbers and a few words of context, we'll manage.

---
Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare
more easily differences since the RFC (see [6]), but haven't been made
public so far. This is the first public posting since initial proposal
[1], and the following describes all changes.

## v0.1 ##

Content is the same as the RFC, but formatted to LaTeX. 'make' generates
one PDF and one HTML document.

## v0.2 ##

Add introductions, improve topology example and firmware description based
on feedback and a number of useful discussions.

## v0.3 ##

Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten
the device and driver behaviour. Unmap semantics are consolidated; they
are now closer to VFIO Type1 v2 semantics.

## v0.4 ##

Introduce PROBE requests. They provide per-endpoint information to the
driver that couldn't be described otherwise.

For the moment, they allow to handle MSIs on x86 virtual platforms (see
3.2). To do that we communicate reserved IOVA regions, that will also be
useful for describing regions that cannot be mapped for a given endpoint,
for instance addresses that correspond to a PCI bridge window.

Introducing such a large framework for this tiny feature may seem
overkill, but it is needed for future extensions of the virtio-iommu and I
believe it really is worth the effort.

## Future ##

Other extensions are in preparation. I won't detail them here because v0.4
already is a lot to digest, but in short, building on top of PROBE:

* First, since the IOMMU is paravirtualized, the device can expose some
  properties of the physical topology to the guest, and let it allocate
  resources more efficiently. For example, when the virtio-iommu manages
  both physical and emulated endpoints, with different underlying IOMMUs,
  we now have a way to describe multiple page and block granularities,
  instead of forcing the guest to use the most restricted one for all
  endpoints. This will most likely be in v0.5.

* Then on top of that, a major improvement will describe hardware
  acceleration features available to the guest. There is what I call "Page
  Table Handover" (or simply, from the host POV, "Nested"), the ability
  for the guest to manipulate its own page tables instead of sending
  MAP/UNMAP requests to the host. This, along with IO Page Fault
  reporting, will also permit SVM virtualization on different platforms.

Thanks,
Jean

[1] http://www.spinics.net/lists/kvm/msg147990.html
[2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
    http://www.linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
    I reiterate the disclaimers: don't use this document as a reference,
    it's a draft. It's also not an OASIS document yet. It may be riddled
    with mistakes. As this is a working draft, it is unstable and I do not
    guarantee backward compatibility of future versions.
[3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
[4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
    Warning: UAPI headers have changed! They didn't follow the spec,
    please update. (Use branch v0.1, that has the old headers, for the
    Qemu prototype [3])
[5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
    Warning: command-line has changed! Use --viommu vfio[,opts] and
    --viommu virtio[,opts] to instantiate a device.
[6] http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs

^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2017-10-09 19:46 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-04 18:19 [RFC] virtio-iommu version 0.4 Jean-Philippe Brucker
2017-08-04 18:19 ` [virtio-dev] " Jean-Philippe Brucker
2017-08-04 18:19 ` [RFC] virtio-iommu v0.4 - IOMMU Device Jean-Philippe Brucker
2017-08-04 18:19 ` Jean-Philippe Brucker
2017-08-04 18:19   ` [virtio-dev] " Jean-Philippe Brucker
2017-08-04 18:19 ` [RFC] virtio-iommu v0.4 - Implementation notes Jean-Philippe Brucker
2017-08-04 18:19 ` Jean-Philippe Brucker
2017-08-04 18:19   ` [virtio-dev] " Jean-Philippe Brucker
2017-08-14  8:27 ` [RFC] virtio-iommu version 0.4 Tian, Kevin
     [not found]   ` <AADFC41AFE54684AB9EE6CBC0274A5D190D7173C-0J0gbvR4kThpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2017-08-14  9:38     ` Jean-Philippe Brucker
2017-08-14  9:38       ` [virtio-dev] " Jean-Philippe Brucker
2017-08-14  9:38   ` Jean-Philippe Brucker
2017-08-14  8:27 ` Tian, Kevin
2017-08-16  4:08 ` [virtio-dev] " Adam Tao
2017-08-16  4:08   ` Adam Tao
2017-08-16  4:08   ` Adam Tao
2017-08-17 10:12   ` Jean-Philippe Brucker
2017-08-17 10:12     ` Jean-Philippe Brucker
     [not found]     ` <a2e573f3-bdfb-b5e5-7835-a7597abd48f5-5wv7dgnIgG8@public.gmane.org>
2017-08-17 11:27       ` Bharat Bhushan
2017-08-17 11:27         ` Bharat Bhushan
2017-08-17 11:27     ` Bharat Bhushan
2017-08-17 10:12   ` Jean-Philippe Brucker
2017-08-16  4:08 ` Adam Tao
2017-08-23 10:01 ` Jean-Philippe Brucker
2017-08-23 10:01 ` Jean-Philippe Brucker
2017-08-23 10:01   ` [virtio-dev] " Jean-Philippe Brucker
2017-08-28  7:39   ` Tian, Kevin
2017-08-28  7:39     ` [virtio-dev] " Tian, Kevin
2017-09-06 11:54     ` Jean-Philippe Brucker
2017-09-06 11:54       ` [virtio-dev] " Jean-Philippe Brucker
2017-09-21  6:27       ` Tian, Kevin
2017-09-21  6:27       ` Tian, Kevin
2017-09-21  6:27         ` [virtio-dev] " Tian, Kevin
2017-09-25 13:32         ` Jean-Philippe Brucker
2017-09-25 13:32           ` Jean-Philippe Brucker
2017-09-25 13:32         ` Jean-Philippe Brucker
2017-09-06 11:54     ` Jean-Philippe Brucker
2017-08-23 13:55 ` Auger Eric
2017-08-23 13:55   ` [virtio-dev] " Auger Eric
     [not found]   ` <f9b02ce1-057f-df4f-90d7-52616ad60b88-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-09-06 11:48     ` Jean-Philippe Brucker
2017-09-06 11:48       ` Jean-Philippe Brucker
2017-09-21  6:41       ` Tian, Kevin
2017-09-21  6:41       ` Tian, Kevin
2017-09-21  6:41         ` [virtio-dev] " Tian, Kevin
2017-09-25 13:47         ` Jean-Philippe Brucker
     [not found]         ` <AADFC41AFE54684AB9EE6CBC0274A5D190DDD022-0J0gbvR4kThpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2017-09-25 13:47           ` Jean-Philippe Brucker
2017-09-25 13:47             ` Jean-Philippe Brucker
2017-09-06 11:48   ` Jean-Philippe Brucker
2017-08-23 13:55 ` Auger Eric
2017-09-12 17:13 ` Auger Eric
2017-09-12 17:13   ` [virtio-dev] " Auger Eric
2017-09-13  3:47   ` Bharat Bhushan
     [not found]   ` <072f5a14-baae-57a9-9c5b-b93163c67075-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-09-13  3:47     ` Bharat Bhushan
2017-09-13  3:47       ` [virtio-dev] " Bharat Bhushan
2017-09-19 10:47   ` Jean-Philippe Brucker
2017-09-19 10:47     ` [virtio-dev] " Jean-Philippe Brucker
2017-09-20  9:37     ` Auger Eric
2017-09-20  9:37     ` Auger Eric
2017-09-20  9:37       ` [virtio-dev] " Auger Eric
2017-09-21 10:29       ` Jean-Philippe Brucker
2017-09-21 10:29       ` Jean-Philippe Brucker
2017-09-21 10:29         ` [virtio-dev] " Jean-Philippe Brucker
2017-09-19 10:47   ` Jean-Philippe Brucker
2017-09-12 17:13 ` Auger Eric
2017-10-03 13:04 ` [virtio-dev] " Auger Eric
2017-10-03 13:04 ` Auger Eric
2017-10-03 13:04   ` [virtio-dev] " Auger Eric
     [not found]   ` <0e0db2c0-86bc-a76a-05bd-0b41e00ba926-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-10-09 19:46     ` Jean-Philippe Brucker
2017-10-09 19:46       ` Jean-Philippe Brucker
2017-10-09 19:46   ` Jean-Philippe Brucker
  -- strict thread matches above, loose matches on Subject: below --
2017-08-04 18:19 Jean-Philippe Brucker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.