qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 00/19] vfio-user implementation
@ 2021-07-19  6:27 Elena Ufimtseva
  2021-07-19  6:27 ` [PATCH RFC 01/19] vfio-user: introduce vfio-user protocol specification Elena Ufimtseva
                   ` (19 more replies)
  0 siblings, 20 replies; 55+ messages in thread
From: Elena Ufimtseva @ 2021-07-19  6:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha

Hi

We are happy to introduce the next stage of the multi-process QEMU project[1].

vfio-user is a protocol that allows a device to be emulated in a separate
process outside of QEMU. It encapsulates the messages sent from QEMU to the
kernel VFIO driver, and sends them to a remote process over a UNIX socket.

The vfio-user framework consists of 3 parts:
 1) The protocol specification.
 2) A server - the VFIO generic device in QEMU that exchanges the protocol messages with the client.
 3) A client - remote process that emulates a device.

This patchset implements parts 1 and 2.
The protocol's specification can be found here [2]:
We also include this as the first patch of the series.

The libvfio-user project (https://github.com/nutanix/libvfio-user)
can be used by a remote process to handle the protocol to implement the
third part.
We also worked on implementing a client and will be sending this patch
series shortly.

Contributors:

John G Johnson <john.g.johnson@oracle.com>
John Levon <john.levon@nutanix.com>
Thanos Makatos <thanos.makatos@nutanix.com>
Elena Ufimtseva <elena.ufimtseva@oracle.com>
Jagannathan Raman <jag.raman@oracle.com>

Please send your comments and questions!

Thank you.

References:
[1] https://wiki.qemu.org/Features/MultiProcessQEMU
[2] https://patchwork.kernel.org/project/qemu-devel/patch/20210614104608.212276-1-thanos.makatos@nutanix.com/

John G Johnson (18):
  vfio-user: add VFIO base abstract class
  vfio-user: define VFIO Proxy and communication functions
  vfio-user: Define type vfio_user_pci_dev_info
  vfio-user: connect vfio proxy to remote server
  vfio-user: negotiate protocol with remote server
  vfio-user: define vfio-user pci ops
  vfio-user: VFIO container setup & teardown
  vfio-user: get device info and get irq info
  vfio-user: device region read/write
  vfio-user: get region and DMA map/unmap operations
  vfio-user: probe remote device's BARs
  vfio-user: respond to remote DMA read/write requests
  vfio_user: setup MSI/X interrupts and PCI config operations
  vfio-user: vfio user device realize
  vfio-user: pci reset
  vfio-user: probe remote device ROM BAR
  vfio-user: migration support
  vfio-user: add migration cli options and version negotiation

Thanos Makatos (1):
  vfio-user: introduce vfio-user protocol specification

 docs/devel/index.rst          |    1 +
 docs/devel/vfio-user.rst      | 1809 +++++++++++++++++++++++++++++++++
 hw/vfio/pci.h                 |   25 +-
 hw/vfio/user.h                |  279 +++++
 include/hw/vfio/vfio-common.h |    8 +
 hw/vfio/common.c              |  273 ++++-
 hw/vfio/migration.c           |   35 +-
 hw/vfio/pci.c                 |  547 ++++++++--
 hw/vfio/user.c                |  997 ++++++++++++++++++
 MAINTAINERS                   |   10 +
 hw/vfio/meson.build           |    1 +
 11 files changed, 3879 insertions(+), 106 deletions(-)
 create mode 100644 docs/devel/vfio-user.rst
 create mode 100644 hw/vfio/user.h
 create mode 100644 hw/vfio/user.c

-- 
2.25.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC 01/19] vfio-user: introduce vfio-user protocol specification
  2021-07-19  6:27 [PATCH RFC 00/19] vfio-user implementation Elena Ufimtseva
@ 2021-07-19  6:27 ` Elena Ufimtseva
  2021-07-19  6:27 ` [PATCH RFC 02/19] vfio-user: add VFIO base abstract class Elena Ufimtseva
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 55+ messages in thread
From: Elena Ufimtseva @ 2021-07-19  6:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha

From: Thanos Makatos <thanos.makatos@nutanix.com>

This patch introduces the vfio-user protocol specification (formerly
known as VFIO-over-socket), which is designed to allow devices to be
emulated outside QEMU, in a separate process. vfio-user reuses the
existing VFIO defines, structs and concepts.

This patch is sourced from:
https://patchwork.kernel.org/project/qemu-devel/patch/20210614104608.212276-1-thanos.makatos@nutanix.com/

It has been earlier discussed as an RFC in:
"RFC: use VFIO over a UNIX domain socket to implement device offloading"

Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Thanos Makatos <thanos.makatos@nutanix.com>
Signed-off-by: John Levon <john.levon@nutanix.com>
---
 docs/devel/index.rst     |    1 +
 docs/devel/vfio-user.rst | 1809 ++++++++++++++++++++++++++++++++++++++
 MAINTAINERS              |    6 +
 3 files changed, 1816 insertions(+)
 create mode 100644 docs/devel/vfio-user.rst

diff --git a/docs/devel/index.rst b/docs/devel/index.rst
index 153979caf4..215eb4ff7a 100644
--- a/docs/devel/index.rst
+++ b/docs/devel/index.rst
@@ -42,3 +42,4 @@ modifying QEMU's source code.
    multi-process
    ebpf_rss
    vfio-migration
+   vfio-user
diff --git a/docs/devel/vfio-user.rst b/docs/devel/vfio-user.rst
new file mode 100644
index 0000000000..0b2acec101
--- /dev/null
+++ b/docs/devel/vfio-user.rst
@@ -0,0 +1,1809 @@
+.. include:: <isonum.txt>
+********************************
+vfio-user Protocol Specification
+********************************
+
+--------------
+Version_ 0.9.1
+--------------
+
+.. contents:: Table of Contents
+
+Introduction
+============
+vfio-user is a protocol that allows a device to be emulated in a separate
+process outside of a Virtual Machine Monitor (VMM). vfio-user devices consist
+of a generic VFIO device type, living inside the VMM, which we call the client,
+and the core device implementation, living outside the VMM, which we call the
+server.
+
+The vfio-user specification is partly based on the
+`Linux VFIO ioctl interface <https://www.kernel.org/doc/html/latest/driver-api/vfio.html>`_.
+
+VFIO is a mature and stable API, backed by an extensively used framework. The
+existing VFIO client implementation in QEMU (``qemu/hw/vfio/``) can be largely
+re-used, though there is nothing in this specification that requires that
+particular implementation. None of the VFIO kernel modules are required for
+supporting the protocol, on either the client or server side. Some source
+definitions in VFIO are re-used for vfio-user.
+
+The main idea is to allow a virtual device to function in a separate process in
+the same host over a UNIX domain socket. A UNIX domain socket (``AF_UNIX``) is
+chosen because file descriptors can be trivially sent over it, which in turn
+allows:
+
+* Sharing of client memory for DMA with the server.
+* Sharing of server memory with the client for fast MMIO.
+* Efficient sharing of eventfd's for triggering interrupts.
+
+Other socket types could be used which allow the server to run in a separate
+guest in the same host (``AF_VSOCK``) or remotely (``AF_INET``). Theoretically
+the underlying transport does not necessarily have to be a socket, however we do
+not examine such alternatives. In this protocol version we focus on using a UNIX
+domain socket and introduce basic support for the other two types of sockets
+without considering performance implications.
+
+While passing of file descriptors is desirable for performance reasons, support
+is not necessary for either the client or the server in order to implement the
+protocol. There is always an in-band, message-passing fall back mechanism.
+
+Overview
+========
+
+VFIO is a framework that allows a physical device to be securely passed through
+to a user space process; the device-specific kernel driver does not drive the
+device at all.  Typically, the user space process is a VMM and the device is
+passed through to it in order to achieve high performance. VFIO provides an API
+and the required functionality in the kernel. QEMU has adopted VFIO to allow a
+guest to directly access physical devices, instead of emulating them in
+software.
+
+vfio-user reuses the core VFIO concepts defined in its API, but implements them
+as messages to be sent over a socket. It does not change the kernel-based VFIO
+in any way, in fact none of the VFIO kernel modules need to be loaded to use
+vfio-user. It is also possible for the client to concurrently use the current
+kernel-based VFIO for one device, and vfio-user for another device.
+
+VFIO Device Model
+-----------------
+
+A device under VFIO presents a standard interface to the user process. Many of
+the VFIO operations in the existing interface use the ``ioctl()`` system call, and
+references to the existing interface are called the ``ioctl()`` implementation in
+this document.
+
+The following sections describe the set of messages that implement the vfio-user
+interface over a socket. In many cases, the messages are analogous to data
+structures used in the ``ioctl()`` implementation. Messages derived from the
+``ioctl()`` will have a name derived from the ``ioctl()`` command name.  E.g., the
+``VFIO_DEVICE_GET_INFO`` ``ioctl()`` command becomes a
+``VFIO_USER_DEVICE_GET_INFO`` message.  The purpose of this reuse is to share as
+much code as feasible with the ``ioctl()`` implementation``.
+
+Connection Initiation
+^^^^^^^^^^^^^^^^^^^^^
+
+After the client connects to the server, the initial client message is
+``VFIO_USER_VERSION`` to propose a protocol version and set of capabilities to
+apply to the session. The server replies with a compatible version and set of
+capabilities it supports, or closes the connection if it cannot support the
+advertised version.
+
+Device Information
+^^^^^^^^^^^^^^^^^^
+
+The client uses a ``VFIO_USER_DEVICE_GET_INFO`` message to query the server for
+information about the device. This information includes:
+
+* The device type and whether it supports reset (``VFIO_DEVICE_FLAGS_``),
+* the number of device regions, and
+* the device presents to the client the number of interrupt types the device
+  supports.
+
+Region Information
+^^^^^^^^^^^^^^^^^^
+
+The client uses ``VFIO_USER_DEVICE_GET_REGION_INFO`` messages to query the
+server for information about the device's regions. This information describes:
+
+* Read and write permissions, whether it can be memory mapped, and whether it
+  supports additional capabilities (``VFIO_REGION_INFO_CAP_``).
+* Region index, size, and offset.
+
+When a device region can be mapped by the client, the server provides a file
+descriptor which the client can ``mmap()``. The server is responsible for
+polling for client updates to memory mapped regions.
+
+Region Capabilities
+"""""""""""""""""""
+
+Some regions have additional capabilities that cannot be described adequately
+by the region info data structure. These capabilities are returned in the
+region info reply in a list similar to PCI capabilities in a PCI device's
+configuration space.
+
+Sparse Regions
+""""""""""""""
+A region can be memory-mappable in whole or in part. When only a subset of a
+region can be mapped by the client, a ``VFIO_REGION_INFO_CAP_SPARSE_MMAP``
+capability is included in the region info reply. This capability describes
+which portions can be mapped by the client.
+
+.. Note::
+   For example, in a virtual NVMe controller, sparse regions can be used so
+   that accesses to the NVMe registers (found in the beginning of BAR0) are
+   trapped (an infrequent event), while allowing direct access to the doorbells
+   (an extremely frequent event as every I/O submission requires a write to
+   BAR0), found in the next page after the NVMe registers in BAR0.
+
+Device-Specific Regions
+"""""""""""""""""""""""
+
+A device can define regions additional to the standard ones (e.g. PCI indexes
+0-8). This is achieved by including a ``VFIO_REGION_INFO_CAP_TYPE`` capability
+in the region info reply of a device-specific region. Such regions are reflected
+in ``struct vfio_user_device_info.num_regions``. Thus, for PCI devices this
+value can be equal to, or higher than, ``VFIO_PCI_NUM_REGIONS``.
+
+Region I/O via file descriptors
+-------------------------------
+
+For unmapped regions, region I/O from the client is done via
+``VFIO_USER_REGION_READ/WRITE``.  As an optimization, ioeventfds or ioregionfds
+may be configured for sub-regions of some regions. A client may request
+information on these sub-regions via ``VFIO_USER_DEVICE_GET_REGION_IO_FDS``; by
+configuring the returned file descriptors as ioeventfds or ioregionfds, the
+server can be directly notified of I/O (for example, by KVM) without taking a
+trip through the client.
+
+Interrupts
+^^^^^^^^^^
+
+The client uses ``VFIO_USER_DEVICE_GET_IRQ_INFO`` messages to query the server
+for the device's interrupt types. The interrupt types are specific to the bus
+the device is attached to, and the client is expected to know the capabilities
+of each interrupt type. The server can signal an interrupt by directly injecting
+interrupts into the guest via an event file descriptor. The client configures
+how the server signals an interrupt with ``VFIO_USER_SET_IRQS`` messages.
+
+Device Read and Write
+^^^^^^^^^^^^^^^^^^^^^
+
+When the guest executes load or store operations to an unmapped device region,
+the client forwards these operations to the server with
+``VFIO_USER_REGION_READ`` or ``VFIO_USER_REGION_WRITE`` messages. The server
+will reply with data from the device on read operations or an acknowledgement on
+write operations. See `Read and Write Operations`_.
+
+Client memory access
+--------------------
+
+The client uses ``VFIO_USER_DMA_MAP`` and ``VFIO_USER_DMA_UNMAP`` messages to
+inform the server of the valid DMA ranges that the server can access on behalf
+of a device (typically, VM guest memory). DMA memory may be accessed by the
+server via ``VFIO_USER_DMA_READ`` and ``VFIO_USER_DMA_WRITE`` messages over the
+socket. In this case, the "DMA" part of the naming is a misnomer.
+
+Actual direct memory access of client memory from the server is possible if the
+client provides file descriptors the server can ``mmap()``. Note that ``mmap()``
+privileges cannot be revoked by the client, therefore file descriptors should
+only be exported in environments where the client trusts the server not to
+corrupt guest memory.
+
+See `Read and Write Operations`_.
+
+Client/server interactions
+==========================
+
+Socket
+------
+
+A server can serve:
+
+1) one or more clients, and/or
+2) one or more virtual devices, belonging to one or more clients.
+
+The current protocol specification requires a dedicated socket per
+client/server connection. It is a server-side implementation detail whether a
+single server handles multiple virtual devices from the same or multiple
+clients. The location of the socket is implementation-specific. Multiplexing
+clients, devices, and servers over the same socket is not supported in this
+version of the protocol.
+
+Authentication
+--------------
+
+For ``AF_UNIX``, we rely on OS mandatory access controls on the socket files,
+therefore it is up to the management layer to set up the socket as required.
+Socket types than span guests or hosts will require a proper authentication
+mechanism. Defining that mechanism is deferred to a future version of the
+protocol.
+
+Command Concurrency
+-------------------
+
+A client may pipeline multiple commands without waiting for previous command
+replies.  The server will process commands in the order they are received.  A
+consequence of this is if a client issues a command with the *No_reply* bit,
+then subsequently issues a command without *No_reply*, the older command will
+have been processed before the reply to the younger command is sent by the
+server.  The client must be aware of the device's capability to process
+concurrent commands if pipelining is used.  For example, pipelining allows
+multiple client threads to concurrently access device regions; the client must
+ensure these accesses obey device semantics.
+
+An example is a frame buffer device, where the device may allow concurrent
+access to different areas of video memory, but may have indeterminate behavior
+if concurrent accesses are performed to command or status registers.
+
+Note that unrelated messages sent from the server to the client can appear in
+between a client to server request/reply and vice versa.
+
+Implementers should be prepared for certain commands to exhibit potentially
+unbounded latencies.  For example, ``VFIO_USER_DEVICE_RESET`` may take an
+arbitrarily long time to complete; clients should take care not to block
+unnecessarily.
+
+Socket Disconnection Behavior
+-----------------------------
+The server and the client can disconnect from each other, either intentionally
+or unexpectedly. Both the client and the server need to know how to handle such
+events.
+
+Server Disconnection
+^^^^^^^^^^^^^^^^^^^^
+A server disconnecting from the client may indicate that:
+
+1) A virtual device has been restarted, either intentionally (e.g. because of a
+   device update) or unintentionally (e.g. because of a crash).
+2) A virtual device has been shut down with no intention to be restarted.
+
+It is impossible for the client to know whether or not a failure is
+intermittent or innocuous and should be retried, therefore the client should
+reset the VFIO device when it detects the socket has been disconnected.
+Error recovery will be driven by the guest's device error handling
+behavior.
+
+Client Disconnection
+^^^^^^^^^^^^^^^^^^^^
+The client disconnecting from the server primarily means that the client
+has exited. Currently, this means that the guest is shut down so the device is
+no longer needed therefore the server can automatically exit. However, there
+can be cases where a client disconnection should not result in a server exit:
+
+1) A single server serving multiple clients.
+2) A multi-process QEMU upgrading itself step by step, which is not yet
+   implemented.
+
+Therefore in order for the protocol to be forward compatible, the server should
+respond to a client disconnection as follows:
+
+ - all client memory regions are unmapped and cleaned up (including closing any
+   passed file descriptors)
+ - all IRQ file descriptors passed from the old client are closed
+ - the device state should otherwise be retained
+
+The expectation is that when a client reconnects, it will re-establish IRQ and
+client memory mappings.
+
+If anything happens to the client (such as qemu really did exit), the control
+stack will know about it and can clean up resources accordingly.
+
+Security Considerations
+-----------------------
+
+Speaking generally, vfio-user clients should not trust servers, and vice versa.
+Standard tools and mechanisms should be used on both sides to validate input and
+prevent against denial of service scenarios, buffer overflow, etc.
+
+Request Retry and Response Timeout
+----------------------------------
+A failed command is a command that has been successfully sent and has been
+responded to with an error code. Failure to send the command in the first place
+(e.g. because the socket is disconnected) is a different type of error examined
+earlier in the disconnect section.
+
+.. Note::
+   QEMU's VFIO retries certain operations if they fail. While this makes sense
+   for real HW, we don't know for sure whether it makes sense for virtual
+   devices.
+
+Defining a retry and timeout scheme is deferred to a future version of the
+protocol.
+
+Message sizes
+-------------
+
+Some requests have an ``argsz`` field. In a request, it defines the maximum
+expected reply payload size, which should be at least the size of the fixed
+reply payload headers defined here. The *request* payload size is defined by the
+usual ``msg_size`` field in the header, not the ``argsz`` field.
+
+In a reply, the server sets ``argsz`` field to the size needed for a full
+payload size. This may be less than the requested maximum size. This may be
+larger than the requested maximum size: in that case, the full payload is not
+included in the reply, but the ``argsz`` field in the reply indicates the needed
+size, allowing a client to allocate a larger buffer for holding the reply before
+trying again.
+
+In addition, during negotiation (see  `Version`_), the client and server may
+each specify a ``max_data_xfer_size`` value; this defines the maximum data that
+may be read or written via one of the ``VFIO_USER_DMA/REGION_READ/WRITE``
+messages; see `Read and Write Operations`_.
+
+Protocol Specification
+======================
+
+To distinguish from the base VFIO symbols, all vfio-user symbols are prefixed
+with ``vfio_user`` or ``VFIO_USER``. In this revision, all data is in the
+little-endian format, although this may be relaxed in future revisions in cases
+where the client and server are both big-endian.
+
+Unless otherwise specified, all sizes should be presumed to be in bytes.
+
+.. _Commands:
+
+Commands
+--------
+The following table lists the VFIO message command IDs, and whether the
+message command is sent from the client or the server.
+
+======================================  =========  =================
+Name                                    Command    Request Direction
+======================================  =========  =================
+``VFIO_USER_VERSION``                   1          client -> server
+``VFIO_USER_DMA_MAP``                   2          client -> server
+``VFIO_USER_DMA_UNMAP``                 3          client -> server
+``VFIO_USER_DEVICE_GET_INFO``           4          client -> server
+``VFIO_USER_DEVICE_GET_REGION_INFO``    5          client -> server
+``VFIO_USER_DEVICE_GET_REGION_IO_FDS``  6          client -> server
+``VFIO_USER_DEVICE_GET_IRQ_INFO``       7          client -> server
+``VFIO_USER_DEVICE_SET_IRQS``           8          client -> server
+``VFIO_USER_REGION_READ``               9          client -> server
+``VFIO_USER_REGION_WRITE``              10         client -> server
+``VFIO_USER_DMA_READ``                  11         server -> client
+``VFIO_USER_DMA_WRITE``                 12         server -> client
+``VFIO_USER_DEVICE_RESET``              13         client -> server
+``VFIO_USER_DIRTY_PAGES``               14         client -> server
+======================================  =========  =================
+
+Header
+------
+
+All messages, both command messages and reply messages, are preceded by a
+16-byte header that contains basic information about the message. The header is
+followed by message-specific data described in the sections below.
+
++----------------+--------+-------------+
+| Name           | Offset | Size        |
++================+========+=============+
+| Message ID     | 0      | 2           |
++----------------+--------+-------------+
+| Command        | 2      | 2           |
++----------------+--------+-------------+
+| Message size   | 4      | 4           |
++----------------+--------+-------------+
+| Flags          | 8      | 4           |
++----------------+--------+-------------+
+|                | +-----+------------+ |
+|                | | Bit | Definition | |
+|                | +=====+============+ |
+|                | | 0-3 | Type       | |
+|                | +-----+------------+ |
+|                | | 4   | No_reply   | |
+|                | +-----+------------+ |
+|                | | 5   | Error      | |
+|                | +-----+------------+ |
++----------------+--------+-------------+
+| Error          | 12     | 4           |
++----------------+--------+-------------+
+| <message data> | 16     | variable    |
++----------------+--------+-------------+
+
+* *Message ID* identifies the message, and is echoed in the command's reply
+  message. Message IDs belong entirely to the sender, can be re-used (even
+  concurrently) and the receiver must not make any assumptions about their
+  uniqueness.
+* *Command* specifies the command to be executed, listed in Commands_. It is
+  also set in the reply header.
+* *Message size* contains the size of the entire message, including the header.
+* *Flags* contains attributes of the message:
+
+  * The *Type* bits indicate the message type.
+
+    *  *Command* (value 0x0) indicates a command message.
+    *  *Reply* (value 0x1) indicates a reply message acknowledging a previous
+       command with the same message ID.
+  * *No_reply* in a command message indicates that no reply is needed for this
+    command.  This is commonly used when multiple commands are sent, and only
+    the last needs acknowledgement.
+  * *Error* in a reply message indicates the command being acknowledged had
+    an error. In this case, the *Error* field will be valid.
+
+* *Error* in a reply message is an optional UNIX errno value. It may be zero
+  even if the Error bit is set in Flags. It is reserved in a command message.
+
+Each command message in Commands_ must be replied to with a reply message,
+unless the message sets the *No_Reply* bit.  The reply consists of the header
+with the *Reply* bit set, plus any additional data.
+
+If an error occurs, the reply message must only include the reply header.
+
+As the header is standard in both requests and replies, it is not included in
+the command-specific specifications below; each message definition should be
+appended to the standard header, and the offsets are given from the end of the
+standard header.
+
+``VFIO_USER_VERSION``
+---------------------
+
+.. _Version:
+
+This is the initial message sent by the client after the socket connection is
+established; the same format is used for the server's reply.
+
+Upon establishing a connection, the client must send a ``VFIO_USER_VERSION``
+message proposing a protocol version and a set of capabilities. The server
+compares these with the versions and capabilities it supports and sends a
+``VFIO_USER_VERSION`` reply according to the following rules.
+
+* The major version in the reply must be the same as proposed. If the client
+  does not support the proposed major, it closes the connection.
+* The minor version in the reply must be equal to or less than the minor
+  version proposed.
+* The capability list must be a subset of those proposed. If the server
+  requires a capability the client did not include, it closes the connection.
+
+The protocol major version will only change when incompatible protocol changes
+are made, such as changing the message format. The minor version may change
+when compatible changes are made, such as adding new messages or capabilities,
+Both the client and server must support all minor versions less than the
+maximum minor version it supports. E.g., an implementation that supports
+version 1.3 must also support 1.0 through 1.2.
+
+When making a change to this specification, the protocol version number must
+be included in the form "added in version X.Y"
+
+Request
+^^^^^^^
+
+==============  ======  ====
+Name            Offset  Size
+==============  ======  ====
+version major   0       2
+version minor   2       2
+version data    4       variable (including terminating NUL). Optional.
+==============  ======  ====
+
+The version data is an optional UTF-8 encoded JSON byte array with the following
+format:
+
++--------------+--------+-----------------------------------+
+| Name         | Type   | Description                       |
++==============+========+===================================+
+| capabilities | object | Contains common capabilities that |
+|              |        | the sender supports. Optional.    |
++--------------+--------+-----------------------------------+
+
+Capabilities:
+
++--------------------+--------+------------------------------------------------+
+| Name               | Type   | Description                                    |
++====================+========+================================================+
+| max_msg_fds        | number | Maximum number of file descriptors that can be |
+|                    |        | received by the sender in one message.         |
+|                    |        | Optional. If not specified then the receiver   |
+|                    |        | must assume a value of ``1``.                  |
++--------------------+--------+------------------------------------------------+
+| max_data_xfer_size | number | Maximum ``count`` for data transfer messages;  |
+|                    |        | see `Read and Write Operations`_. Optional,    |
+|                    |        | with a default value of 1048576 bytes.         |
++--------------------+--------+------------------------------------------------+
+| migration          | object | Migration capability parameters. If missing    |
+|                    |        | then migration is not supported by the sender. |
++--------------------+--------+------------------------------------------------+
+
+The migration capability contains the following name/value pairs:
+
++--------+--------+-----------------------------------------------+
+| Name   | Type   | Description                                   |
++========+========+===============================================+
+| pgsize | number | Page size of dirty pages bitmap. The smallest |
+|        |        | between the client and the server is used.    |
++--------+--------+-----------------------------------------------+
+
+Reply
+^^^^^
+
+The same message format is used in the server's reply with the semantics
+described above.
+
+``VFIO_USER_DMA_MAP``
+---------------------
+
+This command message is sent by the client to the server to inform it of the
+memory regions the server can access. It must be sent before the server can
+perform any DMA to the client. It is normally sent directly after the version
+handshake is completed, but may also occur when memory is added to the client,
+or if the client uses a vIOMMU.
+
+Request
+^^^^^^^
+
+The request payload for this message is a structure of the following format:
+
++-------------+--------+-------------+
+| Name        | Offset | Size        |
++=============+========+=============+
+| argsz       | 0      | 4           |
++-------------+--------+-------------+
+| flags       | 4      | 4           |
++-------------+--------+-------------+
+|             | +-----+------------+ |
+|             | | Bit | Definition | |
+|             | +=====+============+ |
+|             | | 0   | readable   | |
+|             | +-----+------------+ |
+|             | | 1   | writeable  | |
+|             | +-----+------------+ |
++-------------+--------+-------------+
+| offset      | 8      | 8           |
++-------------+--------+-------------+
+| address     | 16     | 8           |
++-------------+--------+-------------+
+| size        | 24     | 8           |
++-------------+--------+-------------+
+
+* *argsz* is the size of the above structure. Note there is no reply payload,
+  so this field differs from other message types.
+* *flags* contains the following region attributes:
+
+  * *readable* indicates that the region can be read from.
+
+  * *writeable* indicates that the region can be written to.
+
+* *offset* is the file offset of the region with respect to the associated file
+  descriptor, or zero if the region is not mappable
+* *address* is the base DMA address of the region.
+* *size* is the size of the region.
+
+This structure is 32 bytes in size, so the message size is 16 + 32 bytes.
+
+If the DMA region being added can be directly mapped by the server, a file
+descriptor must be sent as part of the message meta-data. The region can be
+mapped via the mmap() system call. On ``AF_UNIX`` sockets, the file descriptor
+must be passed as ``SCM_RIGHTS`` type ancillary data.  Otherwise, if the DMA
+region cannot be directly mapped by the server, no file descriptor must be sent
+as part of the message meta-data and the DMA region can be accessed by the
+server using ``VFIO_USER_DMA_READ`` and ``VFIO_USER_DMA_WRITE`` messages,
+explained in `Read and Write Operations`_. A command to map over an existing
+region must be failed by the server with ``EEXIST`` set in error field in the
+reply.
+
+Reply
+^^^^^
+
+There is no payload in the reply message.
+
+``VFIO_USER_DMA_UNMAP``
+-----------------------
+
+This command message is sent by the client to the server to inform it that a
+DMA region, previously made available via a ``VFIO_USER_DMA_MAP`` command
+message, is no longer available for DMA. It typically occurs when memory is
+subtracted from the client or if the client uses a vIOMMU. The DMA region is
+described by the following structure:
+
+Request
+^^^^^^^
+
+The request payload for this message is a structure of the following format:
+
++--------------+--------+------------------------+
+| Name         | Offset | Size                   |
++==============+========+========================+
+| argsz        | 0      | 4                      |
++--------------+--------+------------------------+
+| flags        | 4      | 4                      |
++--------------+--------+------------------------+
+|              | +-----+-----------------------+ |
+|              | | Bit | Definition            | |
+|              | +=====+=======================+ |
+|              | | 0   | get dirty page bitmap | |
+|              | +-----+-----------------------+ |
++--------------+--------+------------------------+
+| address      | 8      | 8                      |
++--------------+--------+------------------------+
+| size         | 16     | 8                      |
++--------------+--------+------------------------+
+
+* *argsz* is the maximum size of the reply payload.
+* *flags* contains the following DMA region attributes:
+
+  * *get dirty page bitmap* indicates that a dirty page bitmap must be
+    populated before unmapping the DMA region. The client must provide a
+    `VFIO Bitmap`_ structure, explained below, immediately following this
+    entry.
+
+* *address* is the base DMA address of the DMA region.
+* *size* is the size of the DMA region.
+
+The address and size of the DMA region being unmapped must match exactly a
+previous mapping. The size of request message depends on whether or not the
+*get dirty page bitmap* bit is set in Flags:
+
+* If not set, the size of the total request message is: 16 + 24.
+
+* If set, the size of the total request message is: 16 + 24 + 16.
+
+.. _VFIO Bitmap:
+
+VFIO Bitmap Format
+""""""""""""""""""
+
++--------+--------+------+
+| Name   | Offset | Size |
++========+========+======+
+| pgsize | 0      | 8    |
++--------+--------+------+
+| size   | 8      | 8    |
++--------+--------+------+
+
+* *pgsize* is the page size for the bitmap, in bytes.
+* *size* is the size for the bitmap, in bytes, excluding the VFIO bitmap header.
+
+Reply
+^^^^^
+
+Upon receiving a ``VFIO_USER_DMA_UNMAP`` command, if the file descriptor is
+mapped then the server must release all references to that DMA region before
+replying, which potentially includes in-flight DMA transactions.
+
+The server responds with the original DMA entry in the request. If the
+*get dirty page bitmap* bit is set in flags in the request, then
+the server also includes the `VFIO Bitmap`_ structure sent in the request,
+followed by the corresponding dirty page bitmap, where each bit represents
+one page of size *pgsize* in `VFIO Bitmap`_ .
+
+The total size of the total reply message is:
+16 + 24 + (16 + *size* in `VFIO Bitmap`_ if *get dirty page bitmap* is set).
+
+``VFIO_USER_DEVICE_GET_INFO``
+-----------------------------
+
+This command message is sent by the client to the server to query for basic
+information about the device.
+
+Request
+^^^^^^^
+
++-------------+--------+--------------------------+
+| Name        | Offset | Size                     |
++=============+========+==========================+
+| argsz       | 0      | 4                        |
++-------------+--------+--------------------------+
+| flags       | 4      | 4                        |
++-------------+--------+--------------------------+
+|             | +-----+-------------------------+ |
+|             | | Bit | Definition              | |
+|             | +=====+=========================+ |
+|             | | 0   | VFIO_DEVICE_FLAGS_RESET | |
+|             | +-----+-------------------------+ |
+|             | | 1   | VFIO_DEVICE_FLAGS_PCI   | |
+|             | +-----+-------------------------+ |
++-------------+--------+--------------------------+
+| num_regions | 8      | 4                        |
++-------------+--------+--------------------------+
+| num_irqs    | 12     | 4                        |
++-------------+--------+--------------------------+
+
+* *argsz* is the maximum size of the reply payload
+* all other fields must be zero.
+
+Reply
+^^^^^
+
++-------------+--------+--------------------------+
+| Name        | Offset | Size                     |
++=============+========+==========================+
+| argsz       | 0      | 4                        |
++-------------+--------+--------------------------+
+| flags       | 4      | 4                        |
++-------------+--------+--------------------------+
+|             | +-----+-------------------------+ |
+|             | | Bit | Definition              | |
+|             | +=====+=========================+ |
+|             | | 0   | VFIO_DEVICE_FLAGS_RESET | |
+|             | +-----+-------------------------+ |
+|             | | 1   | VFIO_DEVICE_FLAGS_PCI   | |
+|             | +-----+-------------------------+ |
++-------------+--------+--------------------------+
+| num_regions | 8      | 4                        |
++-------------+--------+--------------------------+
+| num_irqs    | 12     | 4                        |
++-------------+--------+--------------------------+
+
+* *argsz* is the size required for the full reply payload (16 bytes today)
+* *flags* contains the following device attributes.
+
+  * ``VFIO_DEVICE_FLAGS_RESET`` indicates that the device supports the
+    ``VFIO_USER_DEVICE_RESET`` message.
+  * ``VFIO_DEVICE_FLAGS_PCI`` indicates that the device is a PCI device.
+
+* *num_regions* is the number of memory regions that the device exposes.
+* *num_irqs* is the number of distinct interrupt types that the device supports.
+
+This version of the protocol only supports PCI devices. Additional devices may
+be supported in future versions.
+
+``VFIO_USER_DEVICE_GET_REGION_INFO``
+------------------------------------
+
+This command message is sent by the client to the server to query for
+information about device regions. The VFIO region info structure is defined in
+``<linux/vfio.h>`` (``struct vfio_region_info``).
+
+Request
+^^^^^^^
+
++------------+--------+------------------------------+
+| Name       | Offset | Size                         |
++============+========+==============================+
+| argsz      | 0      | 4                            |
++------------+--------+------------------------------+
+| flags      | 4      | 4                            |
++------------+--------+------------------------------+
+| index      | 8      | 4                            |
++------------+--------+------------------------------+
+| cap_offset | 12     | 4                            |
++------------+--------+------------------------------+
+| size       | 16     | 8                            |
++------------+--------+------------------------------+
+| offset     | 24     | 8                            |
++------------+--------+------------------------------+
+
+* *argsz* the maximum size of the reply payload
+* *index* is the index of memory region being queried, it is the only field
+  that is required to be set in the command message.
+* all other fields must be zero.
+
+Reply
+^^^^^
+
++------------+--------+------------------------------+
+| Name       | Offset | Size                         |
++============+========+==============================+
+| argsz      | 0      | 4                            |
++------------+--------+------------------------------+
+| flags      | 4      | 4                            |
++------------+--------+------------------------------+
+|            | +-----+-----------------------------+ |
+|            | | Bit | Definition                  | |
+|            | +=====+=============================+ |
+|            | | 0   | VFIO_REGION_INFO_FLAG_READ  | |
+|            | +-----+-----------------------------+ |
+|            | | 1   | VFIO_REGION_INFO_FLAG_WRITE | |
+|            | +-----+-----------------------------+ |
+|            | | 2   | VFIO_REGION_INFO_FLAG_MMAP  | |
+|            | +-----+-----------------------------+ |
+|            | | 3   | VFIO_REGION_INFO_FLAG_CAPS  | |
+|            | +-----+-----------------------------+ |
++------------+--------+------------------------------+
++------------+--------+------------------------------+
+| index      | 8      | 4                            |
++------------+--------+------------------------------+
+| cap_offset | 12     | 4                            |
++------------+--------+------------------------------+
+| size       | 16     | 8                            |
++------------+--------+------------------------------+
+| offset     | 24     | 8                            |
++------------+--------+------------------------------+
+
+* *argsz* is the size required for the full reply payload (region info structure
+  plus the size of any region capabilities)
+* *flags* are attributes of the region:
+
+  * ``VFIO_REGION_INFO_FLAG_READ`` allows client read access to the region.
+  * ``VFIO_REGION_INFO_FLAG_WRITE`` allows client write access to the region.
+  * ``VFIO_REGION_INFO_FLAG_MMAP`` specifies the client can mmap() the region.
+    When this flag is set, the reply will include a file descriptor in its
+    meta-data. On ``AF_UNIX`` sockets, the file descriptors will be passed as
+    ``SCM_RIGHTS`` type ancillary data.
+  * ``VFIO_REGION_INFO_FLAG_CAPS`` indicates additional capabilities found in the
+    reply.
+
+* *index* is the index of memory region being queried, it is the only field
+  that is required to be set in the command message.
+* *cap_offset* describes where additional region capabilities can be found.
+  cap_offset is relative to the beginning of the VFIO region info structure.
+  The data structure it points is a VFIO cap header defined in
+  ``<linux/vfio.h>``.
+* *size* is the size of the region.
+* *offset* is the offset that should be given to the mmap() system call for
+  regions with the MMAP attribute. It is also used as the base offset when
+  mapping a VFIO sparse mmap area, described below.
+
+VFIO region capabilities
+""""""""""""""""""""""""
+
+The VFIO region information can also include a capabilities list. This list is
+similar to a PCI capability list - each entry has a common header that
+identifies a capability and where the next capability in the list can be found.
+The VFIO capability header format is defined in ``<linux/vfio.h>`` (``struct
+vfio_info_cap_header``).
+
+VFIO cap header format
+""""""""""""""""""""""
+
++---------+--------+------+
+| Name    | Offset | Size |
++=========+========+======+
+| id      | 0      | 2    |
++---------+--------+------+
+| version | 2      | 2    |
++---------+--------+------+
+| next    | 4      | 4    |
++---------+--------+------+
+
+* *id* is the capability identity.
+* *version* is a capability-specific version number.
+* *next* specifies the offset of the next capability in the capability list. It
+  is relative to the beginning of the VFIO region info structure.
+
+VFIO sparse mmap cap header
+"""""""""""""""""""""""""""
+
++------------------+----------------------------------+
+| Name             | Value                            |
++==================+==================================+
+| id               | VFIO_REGION_INFO_CAP_SPARSE_MMAP |
++------------------+----------------------------------+
+| version          | 0x1                              |
++------------------+----------------------------------+
+| next             | <next>                           |
++------------------+----------------------------------+
+| sparse mmap info | VFIO region info sparse mmap     |
++------------------+----------------------------------+
+
+This capability is defined when only a subrange of the region supports
+direct access by the client via mmap(). The VFIO sparse mmap area is defined in
+``<linux/vfio.h>`` (``struct vfio_region_sparse_mmap_area`` and ``struct
+vfio_region_info_cap_sparse_mmap``).
+
+VFIO region info cap sparse mmap
+""""""""""""""""""""""""""""""""
+
++----------+--------+------+
+| Name     | Offset | Size |
++==========+========+======+
+| nr_areas | 0      | 4    |
++----------+--------+------+
+| reserved | 4      | 4    |
++----------+--------+------+
+| offset   | 8      | 8    |
++----------+--------+------+
+| size     | 16     | 9    |
++----------+--------+------+
+| ...      |        |      |
++----------+--------+------+
+
+* *nr_areas* is the number of sparse mmap areas in the region.
+* *offset* and size describe a single area that can be mapped by the client.
+  There will be *nr_areas* pairs of offset and size. The offset will be added to
+  the base offset given in the ``VFIO_USER_DEVICE_GET_REGION_INFO`` to form the
+  offset argument of the subsequent mmap() call.
+
+The VFIO sparse mmap area is defined in ``<linux/vfio.h>`` (``struct
+vfio_region_info_cap_sparse_mmap``).
+
+VFIO region type cap header
+"""""""""""""""""""""""""""
+
++------------------+---------------------------+
+| Name             | Value                     |
++==================+===========================+
+| id               | VFIO_REGION_INFO_CAP_TYPE |
++------------------+---------------------------+
+| version          | 0x1                       |
++------------------+---------------------------+
+| next             | <next>                    |
++------------------+---------------------------+
+| region info type | VFIO region info type     |
++------------------+---------------------------+
+
+This capability is defined when a region is specific to the device.
+
+VFIO region info type cap
+"""""""""""""""""""""""""
+
+The VFIO region info type is defined in ``<linux/vfio.h>``
+(``struct vfio_region_info_cap_type``).
+
++---------+--------+------+
+| Name    | Offset | Size |
++=========+========+======+
+| type    | 0      | 4    |
++---------+--------+------+
+| subtype | 4      | 4    |
++---------+--------+------+
+
+The only device-specific region type and subtype supported by vfio-user is
+``VFIO_REGION_TYPE_MIGRATION`` (3) and ``VFIO_REGION_SUBTYPE_MIGRATION`` (1).
+
+``VFIO_USER_DEVICE_GET_REGION_IO_FDS``
+--------------------------------------
+
+Clients can access regions via ``VFIO_USER_REGION_READ/WRITE`` or, if provided, by
+``mmap()`` of a file descriptor provided by the server.
+
+``VFIO_USER_DEVICE_GET_REGION_IO_FDS`` provides an alternative access mechanism via
+file descriptors. This is an optional feature intended for performance
+improvements where an underlying sub-system (such as KVM) supports communication
+across such file descriptors to the vfio-user server, without needing to
+round-trip through the client.
+
+The server returns an array of sub-regions for the requested region. Each
+sub-region describes a span (offset and size) of a region, along with the
+requested file descriptor notification mechanism to use.  Each sub-region in the
+response message may choose to use a different method, as defined below.  The
+two mechanisms supported in this specification are ioeventfds and ioregionfds.
+
+The server in addition returns a file descriptor in the ancillary data; clients
+are expected to configure each sub-region's file descriptor with the requested
+notification method. For example, a client could configure KVM with the
+requested ioeventfd via a ``KVM_IOEVENTFD`` ``ioctl()``.
+
+Request
+^^^^^^^
+
++-------------+--------+------+
+| Name        | Offset | Size |
++=============+========+======+
+| argsz       | 0      | 4    |
++-------------+--------+------+
+| flags       | 4      | 4    |
++-------------+--------+------+
+| index       | 8      | 4    |
++-------------+--------+------+
+| count       | 12     | 4    |
++-------------+--------+------+
+
+* *argsz* the maximum size of the reply payload
+* *index* is the index of memory region being queried
+* all other fields must be zero
+
+The client must set ``flags`` to zero and specify the region being queried in
+the ``index``.
+
+Reply
+^^^^^
+
++-------------+--------+------+
+| Name        | Offset | Size |
++=============+========+======+
+| argsz       | 0      | 4    |
++-------------+--------+------+
+| flags       | 4      | 4    |
++-------------+--------+------+
+| index       | 8      | 4    |
++-------------+--------+------+
+| count       | 12     | 4    |
++-------------+--------+------+
+| sub-regions | 16     | ...  |
++-------------+--------+------+
+
+* *argsz* is the size of the region IO FD info structure plus the
+  total size of the sub-region array. Thus, each array entry "i" is at offset
+  i * ((argsz - 32) / count). Note that currently this is 40 bytes for both IO
+  FD types, but this is not to be relied on. As elsewhere, this indicates the
+  full reply payload size needed.
+* *flags* must be zero
+* *index* is the index of memory region being queried
+* *count* is the number of sub-regions in the array
+* *sub-regions* is the array of Sub-Region IO FD info structures
+
+The reply message will additionally include at least one file descriptor in the
+ancillary data. Note that more than one sub-region may share the same file
+descriptor.
+
+Note that it is the client's responsibility to verify the requested values (for
+example, that the requested offset does not exceed the region's bounds).
+
+Each sub-region given in the response has one of two possible structures,
+depending whether *type* is ``VFIO_USER_IO_FD_TYPE_IOEVENTFD`` or
+``VFIO_USER_IO_FD_TYPE_IOREGIONFD``:
+
+Sub-Region IO FD info format (ioeventfd)
+""""""""""""""""""""""""""""""""""""""""
+
++-----------+--------+------+
+| Name      | Offset | Size |
++===========+========+======+
+| offset    | 0      | 8    |
++-----------+--------+------+
+| size      | 8      | 8    |
++-----------+--------+------+
+| fd_index  | 16     | 4    |
++-----------+--------+------+
+| type      | 20     | 4    |
++-----------+--------+------+
+| flags     | 24     | 4    |
++-----------+--------+------+
+| padding   | 28     | 4    |
++-----------+--------+------+
+| datamatch | 32     | 8    |
++-----------+--------+------+
+
+* *offset* is the offset of the start of the sub-region within the region
+  requested ("physical address offset" for the region)
+* *size* is the length of the sub-region. This may be zero if the access size is
+  not relevant, which may allow for optimizations
+* *fd_index* is the index in the ancillary data of the FD to use for ioeventfd
+  notification; it may be shared.
+* *type* is ``VFIO_USER_IO_FD_TYPE_IOEVENTFD``
+* *flags* is any of:
+
+  * ``KVM_IOEVENTFD_FLAG_DATAMATCH``
+  * ``KVM_IOEVENTFD_FLAG_PIO``
+  * ``KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY`` (FIXME: makes sense?)
+
+* *datamatch* is the datamatch value if needed
+
+See https://www.kernel.org/doc/Documentation/virtual/kvm/api.txt, *4.59
+KVM_IOEVENTFD* for further context on the ioeventfd-specific fields.
+
+Sub-Region IO FD info format (ioregionfd)
+"""""""""""""""""""""""""""""""""""""""""
+
++-----------+--------+------+
+| Name      | Offset | Size |
++===========+========+======+
+| offset    | 0      | 8    |
++-----------+--------+------+
+| size      | 8      | 8    |
++-----------+--------+------+
+| fd_index  | 16     | 4    |
++-----------+--------+------+
+| type      | 20     | 4    |
++-----------+--------+------+
+| flags     | 24     | 4    |
++-----------+--------+------+
+| padding   | 28     | 4    |
++-----------+--------+------+
+| user_data | 32     | 8    |
++-----------+--------+------+
+
+* *offset* is the offset of the start of the sub-region within the region
+  requested ("physical address offset" for the region)
+* *size* is the length of the sub-region. This may be zero if the access size is
+  not relevant, which may allow for optimizations; ``KVM_IOREGION_POSTED_WRITES``
+  must be set in *flags* in this case
+* *fd_index* is the index in the ancillary data of the FD to use for ioregionfd
+  messages; it may be shared
+* *type* is ``VFIO_USER_IO_FD_TYPE_IOREGIONFD``
+* *flags* is any of:
+
+  * ``KVM_IOREGION_PIO``
+  * ``KVM_IOREGION_POSTED_WRITES``
+
+* *user_data* is an opaque value passed back to the server via a message on the
+  file descriptor
+
+For further information on the ioregionfd-specific fields, see:
+https://lore.kernel.org/kvm/cover.1613828726.git.eafanasova@gmail.com/
+
+(FIXME: update with final API docs.)
+
+``VFIO_USER_DEVICE_GET_IRQ_INFO``
+---------------------------------
+
+This command message is sent by the client to the server to query for
+information about device interrupt types. The VFIO IRQ info structure is
+defined in ``<linux/vfio.h>`` (``struct vfio_irq_info``).
+
+Request
+^^^^^^^
+
++-------+--------+---------------------------+
+| Name  | Offset | Size                      |
++=======+========+===========================+
+| argsz | 0      | 4                         |
++-------+--------+---------------------------+
+| flags | 4      | 4                         |
++-------+--------+---------------------------+
+|       | +-----+--------------------------+ |
+|       | | Bit | Definition               | |
+|       | +=====+==========================+ |
+|       | | 0   | VFIO_IRQ_INFO_EVENTFD    | |
+|       | +-----+--------------------------+ |
+|       | | 1   | VFIO_IRQ_INFO_MASKABLE   | |
+|       | +-----+--------------------------+ |
+|       | | 2   | VFIO_IRQ_INFO_AUTOMASKED | |
+|       | +-----+--------------------------+ |
+|       | | 3   | VFIO_IRQ_INFO_NORESIZE   | |
+|       | +-----+--------------------------+ |
++-------+--------+---------------------------+
+| index | 8      | 4                         |
++-------+--------+---------------------------+
+| count | 12     | 4                         |
++-------+--------+---------------------------+
+
+* *argsz* is the maximum size of the reply payload (16 bytes today)
+* index is the index of IRQ type being queried (e.g. ``VFIO_PCI_MSIX_IRQ_INDEX``)
+* all other fields must be zero
+
+Reply
+^^^^^
+
++-------+--------+---------------------------+
+| Name  | Offset | Size                      |
++=======+========+===========================+
+| argsz | 0      | 4                         |
++-------+--------+---------------------------+
+| flags | 4      | 4                         |
++-------+--------+---------------------------+
+|       | +-----+--------------------------+ |
+|       | | Bit | Definition               | |
+|       | +=====+==========================+ |
+|       | | 0   | VFIO_IRQ_INFO_EVENTFD    | |
+|       | +-----+--------------------------+ |
+|       | | 1   | VFIO_IRQ_INFO_MASKABLE   | |
+|       | +-----+--------------------------+ |
+|       | | 2   | VFIO_IRQ_INFO_AUTOMASKED | |
+|       | +-----+--------------------------+ |
+|       | | 3   | VFIO_IRQ_INFO_NORESIZE   | |
+|       | +-----+--------------------------+ |
++-------+--------+---------------------------+
+| index | 8      | 4                         |
++-------+--------+---------------------------+
+| count | 12     | 4                         |
++-------+--------+---------------------------+
+
+* *argsz* is the size required for the full reply payload (16 bytes today)
+* *flags* defines IRQ attributes:
+
+  * ``VFIO_IRQ_INFO_EVENTFD`` indicates the IRQ type can support server eventfd
+    signalling.
+  * ``VFIO_IRQ_INFO_MASKABLE`` indicates that the IRQ type supports the ``MASK``
+    and ``UNMASK`` actions in a ``VFIO_USER_DEVICE_SET_IRQS`` message.
+  * ``VFIO_IRQ_INFO_AUTOMASKED`` indicates the IRQ type masks itself after being
+    triggered, and the client must send an ``UNMASK`` action to receive new
+    interrupts.
+  * ``VFIO_IRQ_INFO_NORESIZE`` indicates ``VFIO_USER_SET_IRQS`` operations setup
+    interrupts as a set, and new sub-indexes cannot be enabled without disabling
+    the entire type.
+* index is the index of IRQ type being queried
+* count describes the number of interrupts of the queried type.
+
+``VFIO_USER_DEVICE_SET_IRQS``
+-----------------------------
+
+This command message is sent by the client to the server to set actions for
+device interrupt types. The VFIO IRQ set structure is defined in
+``<linux/vfio.h>`` (``struct vfio_irq_set``).
+
+Request
+^^^^^^^
+
++-------+--------+------------------------------+
+| Name  | Offset | Size                         |
++=======+========+==============================+
+| argsz | 0      | 4                            |
++-------+--------+------------------------------+
+| flags | 4      | 4                            |
++-------+--------+------------------------------+
+|       | +-----+-----------------------------+ |
+|       | | Bit | Definition                  | |
+|       | +=====+=============================+ |
+|       | | 0   | VFIO_IRQ_SET_DATA_NONE      | |
+|       | +-----+-----------------------------+ |
+|       | | 1   | VFIO_IRQ_SET_DATA_BOOL      | |
+|       | +-----+-----------------------------+ |
+|       | | 2   | VFIO_IRQ_SET_DATA_EVENTFD   | |
+|       | +-----+-----------------------------+ |
+|       | | 3   | VFIO_IRQ_SET_ACTION_MASK    | |
+|       | +-----+-----------------------------+ |
+|       | | 4   | VFIO_IRQ_SET_ACTION_UNMASK  | |
+|       | +-----+-----------------------------+ |
+|       | | 5   | VFIO_IRQ_SET_ACTION_TRIGGER | |
+|       | +-----+-----------------------------+ |
++-------+--------+------------------------------+
+| index | 8      | 4                            |
++-------+--------+------------------------------+
+| start | 12     | 4                            |
++-------+--------+------------------------------+
+| count | 16     | 4                            |
++-------+--------+------------------------------+
+| data  | 20     | variable                     |
++-------+--------+------------------------------+
+
+* *argsz* is the size of the VFIO IRQ set request payload, including any *data*
+  field. Note there is no reply payload, so this field differs from other
+  message types.
+* *flags* defines the action performed on the interrupt range. The ``DATA``
+  flags describe the data field sent in the message; the ``ACTION`` flags
+  describe the action to be performed. The flags are mutually exclusive for
+  both sets.
+
+  * ``VFIO_IRQ_SET_DATA_NONE`` indicates there is no data field in the command.
+    The action is performed unconditionally.
+  * ``VFIO_IRQ_SET_DATA_BOOL`` indicates the data field is an array of boolean
+    bytes. The action is performed if the corresponding boolean is true.
+  * ``VFIO_IRQ_SET_DATA_EVENTFD`` indicates an array of event file descriptors
+    was sent in the message meta-data. These descriptors will be signalled when
+    the action defined by the action flags occurs. In ``AF_UNIX`` sockets, the
+    descriptors are sent as ``SCM_RIGHTS`` type ancillary data.
+    If no file descriptors are provided, this de-assigns the specified
+    previously configured interrupts.
+  * ``VFIO_IRQ_SET_ACTION_MASK`` indicates a masking event. It can be used with
+    ``VFIO_IRQ_SET_DATA_BOOL`` or ``VFIO_IRQ_SET_DATA_NONE`` to mask an interrupt,
+    or with ``VFIO_IRQ_SET_DATA_EVENTFD`` to generate an event when the guest masks
+    the interrupt.
+  * ``VFIO_IRQ_SET_ACTION_UNMASK`` indicates an unmasking event. It can be used
+    with ``VFIO_IRQ_SET_DATA_BOOL`` or ``VFIO_IRQ_SET_DATA_NONE`` to unmask an
+    interrupt, or with ``VFIO_IRQ_SET_DATA_EVENTFD`` to generate an event when the
+    guest unmasks the interrupt.
+  * ``VFIO_IRQ_SET_ACTION_TRIGGER`` indicates a triggering event. It can be used
+    with ``VFIO_IRQ_SET_DATA_BOOL`` or ``VFIO_IRQ_SET_DATA_NONE`` to trigger an
+    interrupt, or with ``VFIO_IRQ_SET_DATA_EVENTFD`` to generate an event when the
+    server triggers the interrupt.
+
+* *index* is the index of IRQ type being setup.
+* *start* is the start of the sub-index being set.
+* *count* describes the number of sub-indexes being set. As a special case, a
+  count (and start) of 0, with data flags of ``VFIO_IRQ_SET_DATA_NONE`` disables
+  all interrupts of the index.
+* *data* is an optional field included when the
+  ``VFIO_IRQ_SET_DATA_BOOL`` flag is present. It contains an array of booleans
+  that specify whether the action is to be performed on the corresponding
+  index. It's used when the action is only performed on a subset of the range
+  specified.
+
+Not all interrupt types support every combination of data and action flags.
+The client must know the capabilities of the device and IRQ index before it
+sends a ``VFIO_USER_DEVICE_SET_IRQ`` message.
+
+In typical operation, a specific IRQ may operate as follows:
+
+1. The client sends a ``VFIO_USER_DEVICE_SET_IRQ`` message with
+   ``flags=(VFIO_IRQ_SET_DATA_EVENTFD|VFIO_IRQ_SET_ACTION_TRIGGER)`` along
+   with an eventfd. This associates the IRQ with a particular eventfd on the
+   server side.
+
+#. The client may send a ``VFIO_USER_DEVICE_SET_IRQ`` message with
+   ``flags=(VFIO_IRQ_SET_DATA_EVENTFD|VFIO_IRQ_SET_ACTION_MASK/UNMASK)`` along
+   with another eventfd. This associates the given eventfd with the
+   mask/unmask state on the server side.
+
+#. The server may trigger the IRQ by writing 1 to the eventfd.
+
+#. The server may mask/unmask an IRQ which will write 1 to the corresponding
+   mask/unmask eventfd, if there is one.
+
+5. A client may trigger a device IRQ itself, by sending a
+   ``VFIO_USER_DEVICE_SET_IRQ`` message with
+   ``flags=(VFIO_IRQ_SET_DATA_NONE/BOOL|VFIO_IRQ_SET_ACTION_TRIGGER)``.
+
+6. A client may mask or unmask the IRQ, by sending a
+   ``VFIO_USER_DEVICE_SET_IRQ`` message with
+   ``flags=(VFIO_IRQ_SET_DATA_NONE/BOOL|VFIO_IRQ_SET_ACTION_MASK/UNMASK)``.
+
+Reply
+^^^^^
+
+There is no payload in the reply.
+
+.. _Read and Write Operations:
+
+Note that all of these operations must be supported by the client and/or server,
+even if the corresponding memory or device region has been shared as mappable.
+
+The ``count`` field must not exceed the value of ``max_data_xfer_size`` of the
+peer, for both reads and writes.
+
+``VFIO_USER_REGION_READ``
+-------------------------
+
+If a device region is not mappable, it's not directly accessible by the client
+via ``mmap()`` of the underlying file descriptor. In this case, a client can
+read from a device region with this message.
+
+Request
+^^^^^^^
+
++--------+--------+----------+
+| Name   | Offset | Size     |
++========+========+==========+
+| offset | 0      | 8        |
++--------+--------+----------+
+| region | 8      | 4        |
++--------+--------+----------+
+| count  | 12     | 4        |
++--------+--------+----------+
+
+* *offset* into the region being accessed.
+* *region* is the index of the region being accessed.
+* *count* is the size of the data to be transferred.
+
+Reply
+^^^^^
+
++--------+--------+----------+
+| Name   | Offset | Size     |
++========+========+==========+
+| offset | 0      | 8        |
++--------+--------+----------+
+| region | 8      | 4        |
++--------+--------+----------+
+| count  | 12     | 4        |
++--------+--------+----------+
+| data   | 16     | variable |
++--------+--------+----------+
+
+* *offset* into the region accessed.
+* *region* is the index of the region accessed.
+* *count* is the size of the data transferred.
+* *data* is the data that was read from the device region.
+
+``VFIO_USER_REGION_WRITE``
+--------------------------
+
+If a device region is not mappable, it's not directly accessible by the client
+via mmap() of the underlying fd. In this case, a client can write to a device
+region with this message.
+
+Request
+^^^^^^^
+
++--------+--------+----------+
+| Name   | Offset | Size     |
++========+========+==========+
+| offset | 0      | 8        |
++--------+--------+----------+
+| region | 8      | 4        |
++--------+--------+----------+
+| count  | 12     | 4        |
++--------+--------+----------+
+| data   | 16     | variable |
++--------+--------+----------+
+
+* *offset* into the region being accessed.
+* *region* is the index of the region being accessed.
+* *count* is the size of the data to be transferred.
+* *data* is the data to write
+
+Reply
+^^^^^
+
++--------+--------+----------+
+| Name   | Offset | Size     |
++========+========+==========+
+| offset | 0      | 8        |
++--------+--------+----------+
+| region | 8      | 4        |
++--------+--------+----------+
+| count  | 12     | 4        |
++--------+--------+----------+
+
+* *offset* into the region accessed.
+* *region* is the index of the region accessed.
+* *count* is the size of the data transferred.
+
+``VFIO_USER_DMA_READ``
+-----------------------
+
+If the client has not shared mappable memory, the server can use this message to
+read from guest memory.
+
+Request
+^^^^^^^
+
++---------+--------+----------+
+| Name    | Offset | Size     |
++=========+========+==========+
+| address | 0      | 8        |
++---------+--------+----------+
+| count   | 8      | 8        |
++---------+--------+----------+
+
+* *address* is the client DMA memory address being accessed. This address must have
+  been previously exported to the server with a ``VFIO_USER_DMA_MAP`` message.
+* *count* is the size of the data to be transferred.
+
+Reply
+^^^^^
+
++---------+--------+----------+
+| Name    | Offset | Size     |
++=========+========+==========+
+| address | 0      | 8        |
++---------+--------+----------+
+| count   | 8      | 8        |
++---------+--------+----------+
+| data    | 16     | variable |
++---------+--------+----------+
+
+* *address* is the client DMA memory address being accessed.
+* *count* is the size of the data transferred.
+* *data* is the data read.
+
+``VFIO_USER_DMA_WRITE``
+-----------------------
+
+If the client has not shared mappable memory, the server can use this message to
+write to guest memory.
+
+Request
+^^^^^^^
+
++---------+--------+----------+
+| Name    | Offset | Size     |
++=========+========+==========+
+| address | 0      | 8        |
++---------+--------+----------+
+| count   | 8      | 8        |
++---------+--------+----------+
+| data    | 16     | variable |
++---------+--------+----------+
+
+* *address* is the client DMA memory address being accessed. This address must have
+  been previously exported to the server with a ``VFIO_USER_DMA_MAP`` message.
+* *count* is the size of the data to be transferred.
+* *data* is the data to write
+
+Reply
+^^^^^
+
++---------+--------+----------+
+| Name    | Offset | Size     |
++=========+========+==========+
+| address | 0      | 8        |
++---------+--------+----------+
+| count   | 8      | 4        |
++---------+--------+----------+
+
+* *address* is the client DMA memory address being accessed.
+* *count* is the size of the data transferred.
+
+``VFIO_USER_DEVICE_RESET``
+--------------------------
+
+This command message is sent from the client to the server to reset the device.
+Neither the request or reply have a payload.
+
+``VFIO_USER_DIRTY_PAGES``
+-------------------------
+
+This command is analogous to ``VFIO_IOMMU_DIRTY_PAGES``. It is sent by the client
+to the server in order to control logging of dirty pages, usually during a live
+migration.
+
+Dirty page tracking is optional for server implementation; clients should not
+rely on it.
+
+Request
+^^^^^^^
+
++-------+--------+-----------------------------------------+
+| Name  | Offset | Size                                    |
++=======+========+=========================================+
+| argsz | 0      | 4                                       |
++-------+--------+-----------------------------------------+
+| flags | 4      | 4                                       |
++-------+--------+-----------------------------------------+
+|       | +-----+----------------------------------------+ |
+|       | | Bit | Definition                             | |
+|       | +=====+========================================+ |
+|       | | 0   | VFIO_IOMMU_DIRTY_PAGES_FLAG_START      | |
+|       | +-----+----------------------------------------+ |
+|       | | 1   | VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP       | |
+|       | +-----+----------------------------------------+ |
+|       | | 2   | VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP | |
+|       | +-----+----------------------------------------+ |
++-------+--------+-----------------------------------------+
+
+* *argsz* is the size of the VFIO dirty bitmap info structure for
+  ``START/STOP``; and for ``GET_BITMAP``, the maximum size of the reply payload
+
+* *flags* defines the action to be performed by the server:
+
+  * ``VFIO_IOMMU_DIRTY_PAGES_FLAG_START`` instructs the server to start logging
+    pages it dirties. Logging continues until explicitly disabled by
+    ``VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP``.
+
+  * ``VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP`` instructs the server to stop logging
+    dirty pages.
+
+  * ``VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP`` requests the server to return
+    the dirty bitmap for a specific IOVA range. The IOVA range is specified by
+    a "VFIO Bitmap Range" structure, which must immediately follow this
+    "VFIO Dirty Pages" structure. See `VFIO Bitmap Range Format`_.
+    This operation is only valid if logging of dirty pages has been previously
+    started.
+
+  These flags are mutually exclusive with each other.
+
+This part of the request is analogous to VFIO's ``struct
+vfio_iommu_type1_dirty_bitmap``.
+
+.. _VFIO Bitmap Range Format:
+
+VFIO Bitmap Range Format
+""""""""""""""""""""""""
+
++--------+--------+------+
+| Name   | Offset | Size |
++========+========+======+
+| iova   | 0      | 8    |
++--------+--------+------+
+| size   | 8      | 8    |
++--------+--------+------+
+| bitmap | 16     | 24   |
++--------+--------+------+
+
+* *iova* is the IOVA offset
+
+* *size* is the size of the IOVA region
+
+* *bitmap* is the VFIO Bitmap explained in `VFIO Bitmap`_.
+
+This part of the request is analogous to VFIO's ``struct
+vfio_iommu_type1_dirty_bitmap_get``.
+
+Reply
+^^^^^
+
+For ``VFIO_IOMMU_DIRTY_PAGES_FLAG_START`` or
+``VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP``, there is no reply payload.
+
+For ``VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP``, the reply payload is as follows:
+
++--------------+--------+-----------------------------------------+
+| Name         | Offset | Size                                    |
++==============+========+=========================================+
+| argsz        | 0      | 4                                       |
++--------------+--------+-----------------------------------------+
+| flags        | 4      | 4                                       |
++--------------+--------+-----------------------------------------+
+|              | +-----+----------------------------------------+ |
+|              | | Bit | Definition                             | |
+|              | +=====+========================================+ |
+|              | | 2   | VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP | |
+|              | +-----+----------------------------------------+ |
++--------------+--------+-----------------------------------------+
+| bitmap range | 8      | 40                                      |
++--------------+--------+-----------------------------------------+
+| bitmap       | 48     | variable                                |
++--------------+--------+-----------------------------------------+
+
+* *argsz* is the size required for the full reply payload (dirty pages structure
+  + bitmap range structure + actual bitmap)
+* *flags* is ``VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP``
+* *bitmap range* is the same bitmap range struct provided in the request, as
+  defined in `VFIO Bitmap Range Format`_.
+* *bitmap* is the actual dirty pages bitmap corresponding to the range request
+
+VFIO Device Migration Info
+--------------------------
+
+A device may contain a migration region (of type
+``VFIO_REGION_TYPE_MIGRATION``).  The beginning of the region must contain
+``struct vfio_device_migration_info``, defined in ``<linux/vfio.h>``. This
+subregion is accessed like any other part of a standard vfio-user region
+using ``VFIO_USER_REGION_READ``/``VFIO_USER_REGION_WRITE``.
+
++---------------+--------+-----------------------------+
+| Name          | Offset | Size                        |
++===============+========+=============================+
+| device_state  | 0      | 4                           |
++---------------+--------+-----------------------------+
+|               | +-----+----------------------------+ |
+|               | | Bit | Definition                 | |
+|               | +=====+============================+ |
+|               | | 0   | VFIO_DEVICE_STATE_RUNNING  | |
+|               | +-----+----------------------------+ |
+|               | | 1   | VFIO_DEVICE_STATE_SAVING   | |
+|               | +-----+----------------------------+ |
+|               | | 2   | VFIO_DEVICE_STATE_RESUMING | |
+|               | +-----+----------------------------+ |
++---------------+--------+-----------------------------+
+| reserved      | 4      | 4                           |
++---------------+--------+-----------------------------+
+| pending_bytes | 8      | 8                           |
++---------------+--------+-----------------------------+
+| data_offset   | 16     | 8                           |
++---------------+--------+-----------------------------+
+| data_size     | 24     | 8                           |
++---------------+--------+-----------------------------+
+
+* *device_state* defines the state of the device:
+
+  The client initiates device state transition by writing the intended state.
+  The server must respond only after it has successfully transitioned to the new
+  state. If an error occurs then the server must respond to the
+  ``VFIO_USER_REGION_WRITE`` operation with the Error field set accordingly and
+  must remain at the previous state, or in case of internal error it must
+  transition to the error state, defined as
+  ``VFIO_DEVICE_STATE_RESUMING | VFIO_DEVICE_STATE_SAVING``. The client must
+  re-read the device state in order to determine it afresh.
+
+  The following device states are defined:
+
+  +-----------+---------+----------+-----------------------------------+
+  | _RESUMING | _SAVING | _RUNNING | Description                       |
+  +===========+=========+==========+===================================+
+  | 0         | 0       | 0        | Device is stopped.                |
+  +-----------+---------+----------+-----------------------------------+
+  | 0         | 0       | 1        | Device is running, default state. |
+  +-----------+---------+----------+-----------------------------------+
+  | 0         | 1       | 0        | Stop-and-copy state               |
+  +-----------+---------+----------+-----------------------------------+
+  | 0         | 1       | 1        | Pre-copy state                    |
+  +-----------+---------+----------+-----------------------------------+
+  | 1         | 0       | 0        | Resuming                          |
+  +-----------+---------+----------+-----------------------------------+
+  | 1         | 0       | 1        | Invalid state                     |
+  +-----------+---------+----------+-----------------------------------+
+  | 1         | 1       | 0        | Error state                       |
+  +-----------+---------+----------+-----------------------------------+
+  | 1         | 1       | 1        | Invalid state                     |
+  +-----------+---------+----------+-----------------------------------+
+
+  Valid state transitions are shown in the following table:
+
+  +-------------------------+---------+---------+---------------+----------+----------+
+  | |darr| From / To |rarr| | Stopped | Running | Stop-and-copy | Pre-copy | Resuming |
+  +=========================+=========+=========+===============+==========+==========+
+  | Stopped                 |    \-   |    1    |       0       |    0     |     0    |
+  +-------------------------+---------+---------+---------------+----------+----------+
+  | Running                 |    1    |    \-   |       1       |    1     |     1    |
+  +-------------------------+---------+---------+---------------+----------+----------+
+  | Stop-and-copy           |    1    |    1    |       \-      |    0     |     0    |
+  +-------------------------+---------+---------+---------------+----------+----------+
+  | Pre-copy                |    0    |    0    |       1       |    \-    |     0    |
+  +-------------------------+---------+---------+---------------+----------+----------+
+  | Resuming                |    0    |    1    |       0       |    0     |     \-   |
+  +-------------------------+---------+---------+---------------+----------+----------+
+
+  A device is migrated to the destination as follows:
+
+  * The source client transitions the device state from the running state to
+    the pre-copy state. This transition is optional for the client but must be
+    supported by the server. The source server starts sending device state data
+    to the source client through the migration region while the device is
+    running.
+
+  * The source client transitions the device state from the running state or the
+    pre-copy state to the stop-and-copy state. The source server stops the
+    device, saves device state and sends it to the source client through the
+    migration region.
+
+  The source client is responsible for sending the migration data to the
+  destination client.
+
+  A device is resumed on the destination as follows:
+
+  * The destination client transitions the device state from the running state
+    to the resuming state. The destination server uses the device state data
+    received through the migration region to resume the device.
+
+  * The destination client provides saved device state to the destination
+    server and then transitions the device to back to the running state.
+
+* *reserved* This field is reserved and any access to it must be ignored by the
+  server.
+
+* *pending_bytes* Remaining bytes to be migrated by the server. This field is
+  read only.
+
+* *data_offset* Offset in the migration region where the client must:
+
+  * read from, during the pre-copy or stop-and-copy state, or
+
+  * write to, during the resuming state.
+
+  This field is read only.
+
+* *data_size* Contains the size, in bytes, of the amount of data copied to:
+
+  * the source migration region by the source server during the pre-copy or
+    stop-and copy state, or
+
+  * the destination migration region by the destination client during the
+    resuming state.
+
+Device-specific data must be stored at any position after
+``struct vfio_device_migration_info``. Note that the migration region can be
+memory mappable, even partially. In practise, only the migration data portion
+can be memory mapped.
+
+The client processes device state data during the pre-copy and the
+stop-and-copy state in the following iterative manner:
+
+  1. The client reads ``pending_bytes`` to mark a new iteration. Repeated reads
+     of this field is an idempotent operation. If there are no migration data
+     to be consumed then the next step depends on the current device state:
+
+     * pre-copy: the client must try again.
+
+     * stop-and-copy: this procedure can end and the device can now start
+       resuming on the destination.
+
+  2. The client reads ``data_offset``; at this point the server must make
+     available a portion of migration data at this offset to be read by the
+     client, which must happen *before* completing the read operation. The
+     amount of data to be read must be stored in the ``data_size`` field, which
+     the client reads next.
+
+  3. The client reads ``data_size`` to determine the amount of migration data
+     available.
+
+  4. The client reads and processes the migration data.
+
+  5. Go to step 1.
+
+Note that the client can transition the device from the pre-copy state to the
+stop-and-copy state at any time; ``pending_bytes`` does not need to become zero.
+
+The client initializes the device state on the destination by setting the
+device state in the resuming state and writing the migration data to the
+destination migration region at ``data_offset`` offset. The client can write the
+source migration data in an iterative manner and the server must consume this
+data before completing each write operation, updating the ``data_offset`` field.
+The server must apply the source migration data on the device resume state. The
+client must write data on the same order and transaction size as read.
+
+If an error occurs then the server must fail the read or write operation. It is
+an implementation detail of the client how to handle errors.
+
+Appendices
+==========
+
+Unused VFIO ``ioctl()`` commands
+--------------------------------
+
+The following VFIO commands do not have an equivalent vfio-user command:
+
+* ``VFIO_GET_API_VERSION``
+* ``VFIO_CHECK_EXTENSION``
+* ``VFIO_SET_IOMMU``
+* ``VFIO_GROUP_GET_STATUS``
+* ``VFIO_GROUP_SET_CONTAINER``
+* ``VFIO_GROUP_UNSET_CONTAINER``
+* ``VFIO_GROUP_GET_DEVICE_FD``
+* ``VFIO_IOMMU_GET_INFO``
+
+However, once support for live migration for VFIO devices is finalized some
+of the above commands may have to be handled by the client in their
+corresponding vfio-user form. This will be addressed in a future protocol
+version.
+
+VFIO groups and containers
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The current VFIO implementation includes group and container idioms that
+describe how a device relates to the host IOMMU. In the vfio-user
+implementation, the IOMMU is implemented in SW by the client, and is not
+visible to the server. The simplest idea would be that the client put each
+device into its own group and container.
+
+Backend Program Conventions
+---------------------------
+
+vfio-user backend program conventions are based on the vhost-user ones.
+
+* The backend program must not daemonize itself.
+* No assumptions must be made as to what access the backend program has on the
+  system.
+* File descriptors 0, 1 and 2 must exist, must have regular
+  stdin/stdout/stderr semantics, and can be redirected.
+* The backend program must honor the SIGTERM signal.
+* The backend program must accept the following commands line options:
+
+  * ``--socket-path=PATH``: path to UNIX domain socket,
+  * ``--fd=FDNUM``: file descriptor for UNIX domain socket, incompatible with
+    ``--socket-path``
+* The backend program must be accompanied with a JSON file stored under
+  ``/usr/share/vfio-user``.
+
+TODO add schema similar to docs/interop/vhost-user.json.
diff --git a/MAINTAINERS b/MAINTAINERS
index 4256ad1adb..12d69f3a45 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1880,6 +1880,12 @@ F: hw/vfio/ap.c
 F: docs/system/s390x/vfio-ap.rst
 L: qemu-s390x@nongnu.org
 
+vfio-user
+M: John G Johnson <john.g.johnson@oracle.com>
+M: Thanos Makatos <thanos.makatos@nutanix.com>
+S: Supported
+F: docs/devel/vfio-user.rst
+
 vhost
 M: Michael S. Tsirkin <mst@redhat.com>
 S: Supported
-- 
2.25.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC 02/19] vfio-user: add VFIO base abstract class
  2021-07-19  6:27 [PATCH RFC 00/19] vfio-user implementation Elena Ufimtseva
  2021-07-19  6:27 ` [PATCH RFC 01/19] vfio-user: introduce vfio-user protocol specification Elena Ufimtseva
@ 2021-07-19  6:27 ` Elena Ufimtseva
  2021-07-19  6:27 ` [PATCH RFC 03/19] vfio-user: define VFIO Proxy and communication functions Elena Ufimtseva
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 55+ messages in thread
From: Elena Ufimtseva @ 2021-07-19  6:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha

From: John G Johnson <john.g.johnson@oracle.com>

Add an abstract base class both the kernel driver
and user socket implementations can use to share code.

Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/vfio/pci.h | 25 ++++++++++++++++++--
 hw/vfio/pci.c | 63 ++++++++++++++++++++++++++++++++-------------------
 2 files changed, 63 insertions(+), 25 deletions(-)

diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 64777516d1..ba2f51d98f 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -114,8 +114,13 @@ typedef struct VFIOMSIXInfo {
     unsigned long *pending;
 } VFIOMSIXInfo;
 
-#define TYPE_VFIO_PCI "vfio-pci"
-OBJECT_DECLARE_SIMPLE_TYPE(VFIOPCIDevice, VFIO_PCI)
+/*
+ * TYPE_VFIO_PCI_BASE is an abstract type used to share code
+ * between VFIO implementations that use a kernel driver
+ * with those that use user sockers.
+ */
+#define TYPE_VFIO_PCI_BASE "vfio-pci-base"
+OBJECT_DECLARE_SIMPLE_TYPE(VFIOPCIDevice, VFIO_PCI_BASE)
 
 struct VFIOPCIDevice {
     PCIDevice pdev;
@@ -175,6 +180,22 @@ struct VFIOPCIDevice {
     Notifier irqchip_change_notifier;
 };
 
+#define TYPE_VFIO_PCI "vfio-pci"
+OBJECT_DECLARE_SIMPLE_TYPE(VFIOKernPCIDevice, VFIO_PCI)
+
+struct VFIOKernPCIDevice {
+    VFIOPCIDevice device;
+};
+
+#define TYPE_VFIO_USER_PCI "vfio-user-pci"
+OBJECT_DECLARE_SIMPLE_TYPE(VFIOUserPCIDevice, VFIO_USER_PCI)
+
+struct VFIOUserPCIDevice {
+    VFIOPCIDevice device;
+    char *sock_name;
+    bool secure;
+};
+
 /* Use uin32_t for vendor & device so PCI_ANY_ID expands and cannot match hw */
 static inline bool vfio_pci_is(VFIOPCIDevice *vdev, uint32_t vendor, uint32_t device)
 {
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index e1ea1d8a23..bea95efc33 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -231,7 +231,7 @@ static void vfio_intx_update(VFIOPCIDevice *vdev, PCIINTxRoute *route)
 
 static void vfio_intx_routing_notifier(PCIDevice *pdev)
 {
-    VFIOPCIDevice *vdev = VFIO_PCI(pdev);
+    VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
     PCIINTxRoute route;
 
     if (vdev->interrupt != VFIO_INT_INTx) {
@@ -457,7 +457,7 @@ static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
 static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
                                    MSIMessage *msg, IOHandler *handler)
 {
-    VFIOPCIDevice *vdev = VFIO_PCI(pdev);
+    VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
     VFIOMSIVector *vector;
     int ret;
 
@@ -542,7 +542,7 @@ static int vfio_msix_vector_use(PCIDevice *pdev,
 
 static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
 {
-    VFIOPCIDevice *vdev = VFIO_PCI(pdev);
+    VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
     VFIOMSIVector *vector = &vdev->msi_vectors[nr];
 
     trace_vfio_msix_vector_release(vdev->vbasedev.name, nr);
@@ -1063,7 +1063,7 @@ static const MemoryRegionOps vfio_vga_ops = {
  */
 static void vfio_sub_page_bar_update_mapping(PCIDevice *pdev, int bar)
 {
-    VFIOPCIDevice *vdev = VFIO_PCI(pdev);
+    VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
     VFIORegion *region = &vdev->bars[bar].region;
     MemoryRegion *mmap_mr, *region_mr, *base_mr;
     PCIIORegion *r;
@@ -1109,7 +1109,7 @@ static void vfio_sub_page_bar_update_mapping(PCIDevice *pdev, int bar)
  */
 uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len)
 {
-    VFIOPCIDevice *vdev = VFIO_PCI(pdev);
+    VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
     uint32_t emu_bits = 0, emu_val = 0, phys_val = 0, val;
 
     memcpy(&emu_bits, vdev->emulated_config_bits + addr, len);
@@ -1142,7 +1142,7 @@ uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len)
 void vfio_pci_write_config(PCIDevice *pdev,
                            uint32_t addr, uint32_t val, int len)
 {
-    VFIOPCIDevice *vdev = VFIO_PCI(pdev);
+    VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
     uint32_t val_le = cpu_to_le32(val);
 
     trace_vfio_pci_write_config(vdev->vbasedev.name, addr, val, len);
@@ -2782,7 +2782,7 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
 
 static void vfio_realize(PCIDevice *pdev, Error **errp)
 {
-    VFIOPCIDevice *vdev = VFIO_PCI(pdev);
+    VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
     VFIODevice *vbasedev_iter;
     VFIOGroup *group;
     char *tmp, *subsys, group_path[PATH_MAX], *group_name;
@@ -3105,7 +3105,7 @@ error:
 
 static void vfio_instance_finalize(Object *obj)
 {
-    VFIOPCIDevice *vdev = VFIO_PCI(obj);
+    VFIOPCIDevice *vdev = VFIO_PCI_BASE(obj);
     VFIOGroup *group = vdev->vbasedev.group;
 
     vfio_display_finalize(vdev);
@@ -3125,7 +3125,7 @@ static void vfio_instance_finalize(Object *obj)
 
 static void vfio_exitfn(PCIDevice *pdev)
 {
-    VFIOPCIDevice *vdev = VFIO_PCI(pdev);
+    VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
 
     vfio_unregister_req_notifier(vdev);
     vfio_unregister_err_notifier(vdev);
@@ -3144,7 +3144,7 @@ static void vfio_exitfn(PCIDevice *pdev)
 
 static void vfio_pci_reset(DeviceState *dev)
 {
-    VFIOPCIDevice *vdev = VFIO_PCI(dev);
+    VFIOPCIDevice *vdev = VFIO_PCI_BASE(dev);
 
     trace_vfio_pci_reset(vdev->vbasedev.name);
 
@@ -3184,7 +3184,7 @@ post_reset:
 static void vfio_instance_init(Object *obj)
 {
     PCIDevice *pci_dev = PCI_DEVICE(obj);
-    VFIOPCIDevice *vdev = VFIO_PCI(obj);
+    VFIOPCIDevice *vdev = VFIO_PCI_BASE(obj);
 
     device_add_bootindex_property(obj, &vdev->bootindex,
                                   "bootindex", NULL,
@@ -3253,28 +3253,24 @@ static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
-static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
+static void vfio_pci_base_dev_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
     PCIDeviceClass *pdc = PCI_DEVICE_CLASS(klass);
 
-    dc->reset = vfio_pci_reset;
-    device_class_set_props(dc, vfio_pci_dev_properties);
-    dc->desc = "VFIO-based PCI device assignment";
+    dc->desc = "VFIO PCI base device";
     set_bit(DEVICE_CATEGORY_MISC, dc->categories);
-    pdc->realize = vfio_realize;
     pdc->exit = vfio_exitfn;
     pdc->config_read = vfio_pci_read_config;
     pdc->config_write = vfio_pci_write_config;
 }
 
-static const TypeInfo vfio_pci_dev_info = {
-    .name = TYPE_VFIO_PCI,
+static const TypeInfo vfio_pci_base_dev_info = {
+    .name = TYPE_VFIO_PCI_BASE,
     .parent = TYPE_PCI_DEVICE,
-    .instance_size = sizeof(VFIOPCIDevice),
-    .class_init = vfio_pci_dev_class_init,
-    .instance_init = vfio_instance_init,
-    .instance_finalize = vfio_instance_finalize,
+    .instance_size = 0,
+    .abstract = true,
+    .class_init = vfio_pci_base_dev_class_init,
     .interfaces = (InterfaceInfo[]) {
         { INTERFACE_PCIE_DEVICE },
         { INTERFACE_CONVENTIONAL_PCI_DEVICE },
@@ -3282,6 +3278,26 @@ static const TypeInfo vfio_pci_dev_info = {
     },
 };
 
+static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    PCIDeviceClass *pdc = PCI_DEVICE_CLASS(klass);
+
+    dc->reset = vfio_pci_reset;
+    device_class_set_props(dc, vfio_pci_dev_properties);
+    dc->desc = "VFIO-based PCI device assignment";
+    pdc->realize = vfio_realize;
+}
+
+static const TypeInfo vfio_pci_dev_info = {
+    .name = TYPE_VFIO_PCI,
+    .parent = TYPE_VFIO_PCI_BASE,
+    .instance_size = sizeof(VFIOKernPCIDevice),
+    .class_init = vfio_pci_dev_class_init,
+    .instance_init = vfio_instance_init,
+    .instance_finalize = vfio_instance_finalize,
+};
+
 static Property vfio_pci_dev_nohotplug_properties[] = {
     DEFINE_PROP_BOOL("ramfb", VFIOPCIDevice, enable_ramfb, false),
     DEFINE_PROP_END_OF_LIST(),
@@ -3298,12 +3314,13 @@ static void vfio_pci_nohotplug_dev_class_init(ObjectClass *klass, void *data)
 static const TypeInfo vfio_pci_nohotplug_dev_info = {
     .name = TYPE_VFIO_PCI_NOHOTPLUG,
     .parent = TYPE_VFIO_PCI,
-    .instance_size = sizeof(VFIOPCIDevice),
+    .instance_size = sizeof(VFIOKernPCIDevice),
     .class_init = vfio_pci_nohotplug_dev_class_init,
 };
 
 static void register_vfio_pci_dev_type(void)
 {
+    type_register_static(&vfio_pci_base_dev_info);
     type_register_static(&vfio_pci_dev_info);
     type_register_static(&vfio_pci_nohotplug_dev_info);
 }
-- 
2.25.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC 03/19] vfio-user: define VFIO Proxy and communication functions
  2021-07-19  6:27 [PATCH RFC 00/19] vfio-user implementation Elena Ufimtseva
  2021-07-19  6:27 ` [PATCH RFC 01/19] vfio-user: introduce vfio-user protocol specification Elena Ufimtseva
  2021-07-19  6:27 ` [PATCH RFC 02/19] vfio-user: add VFIO base abstract class Elena Ufimtseva
@ 2021-07-19  6:27 ` Elena Ufimtseva
  2021-07-27 16:34   ` Stefan Hajnoczi
  2021-07-19  6:27 ` [PATCH RFC 04/19] vfio-user: Define type vfio_user_pci_dev_info Elena Ufimtseva
                   ` (16 subsequent siblings)
  19 siblings, 1 reply; 55+ messages in thread
From: Elena Ufimtseva @ 2021-07-19  6:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha

From: John G Johnson <john.g.johnson@oracle.com>

Add user.c and user.h files for vfio-user with the basic
send and receive functions.

Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/vfio/user.h                | 120 ++++++++++++++
 include/hw/vfio/vfio-common.h |   2 +
 hw/vfio/user.c                | 286 ++++++++++++++++++++++++++++++++++
 MAINTAINERS                   |   4 +
 hw/vfio/meson.build           |   1 +
 5 files changed, 413 insertions(+)
 create mode 100644 hw/vfio/user.h
 create mode 100644 hw/vfio/user.c

diff --git a/hw/vfio/user.h b/hw/vfio/user.h
new file mode 100644
index 0000000000..cdbc074579
--- /dev/null
+++ b/hw/vfio/user.h
@@ -0,0 +1,120 @@
+#ifndef VFIO_USER_H
+#define VFIO_USER_H
+
+/*
+ * vfio protocol over a UNIX socket.
+ *
+ * Copyright © 2018, 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * Each message has a standard header that describes the command
+ * being sent, which is almost always a VFIO ioctl().
+ *
+ * The header may be followed by command-specfic data, such as the
+ * region and offset info for read and write commands.
+ */
+
+/* commands */
+enum vfio_user_command {
+    VFIO_USER_VERSION                   = 1,
+    VFIO_USER_DMA_MAP                   = 2,
+    VFIO_USER_DMA_UNMAP                 = 3,
+    VFIO_USER_DEVICE_GET_INFO           = 4,
+    VFIO_USER_DEVICE_GET_REGION_INFO    = 5,
+    VFIO_USER_DEVICE_GET_REGION_IO_FDS  = 6,
+    VFIO_USER_DEVICE_GET_IRQ_INFO       = 7,
+    VFIO_USER_DEVICE_SET_IRQS           = 8,
+    VFIO_USER_REGION_READ               = 9,
+    VFIO_USER_REGION_WRITE              = 10,
+    VFIO_USER_DMA_READ                  = 11,
+    VFIO_USER_DMA_WRITE                 = 12,
+    VFIO_USER_DEVICE_RESET              = 13,
+    VFIO_USER_DIRTY_PAGES               = 14,
+    VFIO_USER_MAX,
+};
+
+/* flags */
+#define VFIO_USER_REQUEST       0x0
+#define VFIO_USER_REPLY         0x1
+#define VFIO_USER_TYPE          0xF
+
+#define VFIO_USER_NO_REPLY      0x10
+#define VFIO_USER_ERROR         0x20
+
+typedef struct vfio_user_hdr {
+    uint16_t id;
+    uint16_t command;
+    uint32_t size;
+    uint32_t flags;
+    uint32_t error_reply;
+} vfio_user_hdr_t;
+
+/*
+ * VFIO_USER_VERSION
+ */
+#define VFIO_USER_MAJOR_VER     0
+#define VFIO_USER_MINOR_VER     0
+
+struct vfio_user_version {
+    vfio_user_hdr_t hdr;
+    uint16_t major;
+    uint16_t minor;
+    char capabilities[];
+};
+
+#define VFIO_USER_DEF_MAX_FDS   8
+#define VFIO_USER_MAX_MAX_FDS   16
+
+#define VFIO_USER_DEF_MAX_XFER  (1024 * 1024)
+#define VFIO_USER_MAX_MAX_XFER  (64 * 1024 * 1024)
+
+typedef struct VFIOUserFDs {
+    int send_fds;
+    int recv_fds;
+    int *fds;
+} VFIOUserFDs;
+
+typedef struct VFIOUserReply {
+    QTAILQ_ENTRY(VFIOUserReply) next;
+    vfio_user_hdr_t *msg;
+    VFIOUserFDs *fds;
+    int rsize;
+    uint32_t id;
+    QemuCond cv;
+    uint8_t complete;
+} VFIOUserReply;
+
+enum proxy_state {
+    CONNECTED = 1,
+    RECV_ERROR = 2,
+    CLOSING = 3,
+    CLOSED = 4,
+};
+
+typedef struct VFIOProxy {
+    QLIST_ENTRY(VFIOProxy) next;
+    char *sockname;
+    struct QIOChannel *ioc;
+    int (*request)(void *opaque, char *buf, VFIOUserFDs *fds);
+    void *reqarg;
+    int flags;
+    QemuCond close_cv;
+
+    /*
+     * above only changed when iolock is held
+     * below are protected by per-proxy lock
+     */
+    QemuMutex lock;
+    QTAILQ_HEAD(, VFIOUserReply) free;
+    QTAILQ_HEAD(, VFIOUserReply) pending;
+    enum proxy_state state;
+    int close_wait;
+} VFIOProxy;
+
+#define VFIO_PROXY_CLIENT       0x1
+
+void vfio_user_recv(void *opaque);
+void vfio_user_send_reply(VFIOProxy *proxy, char *buf, int ret);
+#endif /* VFIO_USER_H */
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 8af11b0a76..f43dc6e5d0 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -75,6 +75,7 @@ typedef struct VFIOAddressSpace {
 } VFIOAddressSpace;
 
 struct VFIOGroup;
+typedef struct VFIOProxy VFIOProxy;
 
 typedef struct VFIOContainer {
     VFIOAddressSpace *space;
@@ -143,6 +144,7 @@ typedef struct VFIODevice {
     VFIOMigration *migration;
     Error *migration_blocker;
     OnOffAuto pre_copy_dirty_page_tracking;
+    VFIOProxy *proxy;
 } VFIODevice;
 
 struct VFIODeviceOps {
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
new file mode 100644
index 0000000000..021d5540e0
--- /dev/null
+++ b/hw/vfio/user.c
@@ -0,0 +1,286 @@
+/*
+ * vfio protocol over a UNIX socket.
+ *
+ * Copyright © 2018, 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include <linux/vfio.h>
+#include <sys/ioctl.h>
+
+#include "qemu/error-report.h"
+#include "qapi/error.h"
+#include "qemu/main-loop.h"
+#include "hw/hw.h"
+#include "hw/vfio/vfio-common.h"
+#include "hw/vfio/vfio.h"
+#include "qemu/sockets.h"
+#include "io/channel.h"
+#include "io/channel-util.h"
+#include "sysemu/iothread.h"
+#include "user.h"
+
+static uint64_t max_xfer_size = VFIO_USER_DEF_MAX_XFER;
+static IOThread *vfio_user_iothread;
+static void vfio_user_send_locked(VFIOProxy *proxy, vfio_user_hdr_t *msg,
+                                  VFIOUserFDs *fds);
+static void vfio_user_send(VFIOProxy *proxy, vfio_user_hdr_t *msg,
+                           VFIOUserFDs *fds);
+static void vfio_user_shutdown(VFIOProxy *proxy);
+
+static void vfio_user_shutdown(VFIOProxy *proxy)
+{
+    qio_channel_shutdown(proxy->ioc, QIO_CHANNEL_SHUTDOWN_READ, NULL);
+    qio_channel_set_aio_fd_handler(proxy->ioc,
+                                   iothread_get_aio_context(vfio_user_iothread),
+                                   NULL, NULL, NULL);
+}
+
+void vfio_user_send_reply(VFIOProxy *proxy, char *buf, int ret)
+{
+    vfio_user_hdr_t *hdr = (vfio_user_hdr_t *)buf;
+
+    /*
+     * convert header to associated reply
+     * positive ret is reply size, negative is error code
+     */
+    hdr->flags = VFIO_USER_REPLY;
+    if (ret > 0) {
+        hdr->size = ret;
+    } else if (ret < 0) {
+        hdr->flags |= VFIO_USER_ERROR;
+        hdr->error_reply = -ret;
+        hdr->size = sizeof(*hdr);
+    }
+    vfio_user_send(proxy, hdr, NULL);
+}
+
+void vfio_user_recv(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOProxy *proxy = vbasedev->proxy;
+    VFIOUserReply *reply = NULL;
+    g_autofree int *fdp = NULL;
+    VFIOUserFDs reqfds = { 0, 0, fdp };
+    vfio_user_hdr_t msg;
+    struct iovec iov = {
+        .iov_base = &msg,
+        .iov_len = sizeof(msg),
+    };
+    int isreply, i, ret;
+    size_t msgleft, numfds = 0;
+    char *data = NULL;
+    g_autofree char *buf = NULL;
+    Error *local_err = NULL;
+
+    qemu_mutex_lock(&proxy->lock);
+    if (proxy->state == CLOSING) {
+        qemu_mutex_unlock(&proxy->lock);
+        return;
+    }
+
+    ret = qio_channel_readv_full(proxy->ioc, &iov, 1, &fdp, &numfds,
+                                 &local_err);
+    if (ret <= 0) {
+        /* read error or other side closed connection */
+        error_setg_errno(&local_err, errno, "vfio_user_recv read error");
+        goto fatal;
+    }
+
+    if (ret < sizeof(msg)) {
+        error_setg(&local_err, "vfio_user_recv short read of header");
+        goto err;
+    }
+
+    /*
+     * For replies, find the matching pending request
+     */
+    switch (msg.flags & VFIO_USER_TYPE) {
+    case VFIO_USER_REQUEST:
+        isreply = 0;
+        break;
+    case VFIO_USER_REPLY:
+        isreply = 1;
+        break;
+    default:
+        error_setg(&local_err, "vfio_user_recv unknown message type");
+        goto err;
+    }
+
+    if (isreply) {
+        QTAILQ_FOREACH(reply, &proxy->pending, next) {
+            if (msg.id == reply->id) {
+                break;
+            }
+        }
+        if (reply == NULL) {
+            error_setg(&local_err, "vfio_user_recv unexpected reply");
+            goto err;
+        }
+        QTAILQ_REMOVE(&proxy->pending, reply, next);
+
+        /*
+         * Process any received FDs
+         */
+        if (numfds != 0) {
+            if (reply->fds == NULL || reply->fds->recv_fds < numfds) {
+                error_setg(&local_err, "vfio_user_recv unexpected FDs");
+                goto err;
+            }
+            reply->fds->recv_fds = numfds;
+            memcpy(reply->fds->fds, fdp, numfds * sizeof(int));
+        }
+
+    } else {
+        /*
+         * The client doesn't expect any FDs in requests, but
+         * they will be expected on the server
+         */
+        if (numfds != 0 && (proxy->flags & VFIO_PROXY_CLIENT)) {
+            error_setg(&local_err, "vfio_user_recv fd in client reply");
+            goto err;
+        }
+        reqfds.recv_fds = numfds;
+    }
+
+    /*
+     * put the whole message into a single buffer
+     */
+    msgleft = msg.size - sizeof(msg);
+    if (isreply) {
+        if (msg.size > reply->rsize) {
+            error_setg(&local_err,
+                       "vfio_user_recv reply larger than recv buffer");
+            goto fatal;
+        }
+        *reply->msg = msg;
+        data = (char *)reply->msg + sizeof(msg);
+    } else {
+        if (msg.size > max_xfer_size) {
+            error_setg(&local_err, "vfio_user_recv request larger than max");
+            goto fatal;
+        }
+        buf = g_malloc0(msg.size);
+        memcpy(buf, &msg, sizeof(msg));
+        data = buf + sizeof(msg);
+    }
+
+    if (msgleft != 0) {
+        ret = qio_channel_read(proxy->ioc, data, msgleft, &local_err);
+        if (ret < 0) {
+            goto fatal;
+        }
+        if (ret != msgleft) {
+            error_setg(&local_err, "vfio_user_recv short read of msg body");
+            goto err;
+        }
+    }
+
+    /*
+     * Replies signal a waiter, requests get processed by vfio code
+     * that may assume the iothread lock is held.
+     */
+    qemu_mutex_unlock(&proxy->lock);
+    if (isreply) {
+        reply->complete = 1;
+        qemu_cond_signal(&reply->cv);
+    } else {
+        qemu_mutex_lock_iothread();
+        /*
+         * make sure proxy wasn't closed while we waited
+         * checking without holding the proxy lock is safe
+         * since state is only set to CLOSING when iolock is held
+         */
+        if (proxy->state != CLOSING) {
+            ret = proxy->request(proxy->reqarg, buf, &reqfds);
+            if (ret < 0 && !(msg.flags & VFIO_USER_NO_REPLY)) {
+                vfio_user_send_reply(proxy, buf, ret);
+            }
+        }
+        qemu_mutex_unlock_iothread();
+    }
+
+    return;
+ fatal:
+    vfio_user_shutdown(proxy);
+    proxy->state = RECV_ERROR;
+
+ err:
+    qemu_mutex_unlock(&proxy->lock);
+    for (i = 0; i < numfds; i++) {
+        close(fdp[i]);
+    }
+    if (reply != NULL) {
+        /* force an error to keep sending thread from hanging */
+        reply->msg->flags |= VFIO_USER_ERROR;
+        reply->msg->error_reply = EINVAL;
+        reply->complete = 1;
+        qemu_cond_signal(&reply->cv);
+    }
+    error_report_err(local_err);
+}
+
+static void vfio_user_send_locked(VFIOProxy *proxy, vfio_user_hdr_t *msg,
+                                  VFIOUserFDs *fds)
+{
+    struct iovec iov = {
+        .iov_base = msg,
+        .iov_len = msg->size,
+    };
+    size_t numfds = 0;
+    int msgleft, ret, *fdp = NULL;
+    char *buf;
+    Error *local_err = NULL;
+
+    if (proxy->state != CONNECTED) {
+        msg->flags |= VFIO_USER_ERROR;
+        msg->error_reply = ECONNRESET;
+        return;
+    }
+
+    if (fds != NULL && fds->send_fds != 0) {
+        numfds = fds->send_fds;
+        fdp = fds->fds;
+    }
+    ret = qio_channel_writev_full(proxy->ioc, &iov, 1, fdp, numfds, &local_err);
+    if (ret < 0) {
+        goto err;
+    }
+    if (ret == msg->size) {
+        return;
+    }
+
+    buf = iov.iov_base + ret;
+    msgleft = iov.iov_len - ret;
+    do {
+        ret = qio_channel_write(proxy->ioc, buf, msgleft, &local_err);
+        if (ret < 0) {
+            goto err;
+        }
+        buf += ret, msgleft -= ret;
+    } while (msgleft != 0);
+    return;
+
+ err:
+    error_report_err(local_err);
+}
+
+static void vfio_user_send(VFIOProxy *proxy, vfio_user_hdr_t *msg,
+                           VFIOUserFDs *fds)
+{
+    bool iolock = qemu_mutex_iothread_locked();
+
+    if (iolock) {
+        qemu_mutex_unlock_iothread();
+    }
+    qemu_mutex_lock(&proxy->lock);
+    vfio_user_send_locked(proxy, msg, fds);
+    qemu_mutex_unlock(&proxy->lock);
+    if (iolock) {
+        qemu_mutex_lock_iothread();
+    }
+}
diff --git a/MAINTAINERS b/MAINTAINERS
index 12d69f3a45..aa4df6c418 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1883,8 +1883,12 @@ L: qemu-s390x@nongnu.org
 vfio-user
 M: John G Johnson <john.g.johnson@oracle.com>
 M: Thanos Makatos <thanos.makatos@nutanix.com>
+M: Elena Ufimtseva <elena.ufimtseva@oracle.com>
+M: Jagannathan Raman <jag.raman@oracle.com>
 S: Supported
 F: docs/devel/vfio-user.rst
+F: hw/vfio/user.c
+F: hw/vfio/user.h
 
 vhost
 M: Michael S. Tsirkin <mst@redhat.com>
diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index da9af297a0..739b30be73 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -8,6 +8,7 @@ vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
   'display.c',
   'pci-quirks.c',
   'pci.c',
+  'user.c',
 ))
 vfio_ss.add(when: 'CONFIG_VFIO_CCW', if_true: files('ccw.c'))
 vfio_ss.add(when: 'CONFIG_VFIO_PLATFORM', if_true: files('platform.c'))
-- 
2.25.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC 04/19] vfio-user: Define type vfio_user_pci_dev_info
  2021-07-19  6:27 [PATCH RFC 00/19] vfio-user implementation Elena Ufimtseva
                   ` (2 preceding siblings ...)
  2021-07-19  6:27 ` [PATCH RFC 03/19] vfio-user: define VFIO Proxy and communication functions Elena Ufimtseva
@ 2021-07-19  6:27 ` Elena Ufimtseva
  2021-07-28 10:16   ` Stefan Hajnoczi
  2021-07-19  6:27 ` [PATCH RFC 05/19] vfio-user: connect vfio proxy to remote server Elena Ufimtseva
                   ` (15 subsequent siblings)
  19 siblings, 1 reply; 55+ messages in thread
From: Elena Ufimtseva @ 2021-07-19  6:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha

From: John G Johnson <john.g.johnson@oracle.com>

New class for vfio-user with its class and instance
constructors and destructors.

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/vfio/pci.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index bea95efc33..554b562769 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -42,6 +42,7 @@
 #include "qapi/error.h"
 #include "migration/blocker.h"
 #include "migration/qemu-file.h"
+#include "hw/vfio/user.h"
 
 #define TYPE_VFIO_PCI_NOHOTPLUG "vfio-pci-nohotplug"
 
@@ -3326,3 +3327,51 @@ static void register_vfio_pci_dev_type(void)
 }
 
 type_init(register_vfio_pci_dev_type)
+
+static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
+{
+    ERRP_GUARD();
+    VFIOUserPCIDevice *udev = VFIO_USER_PCI(pdev);
+
+    if (!udev->sock_name) {
+        error_setg(errp, "No socket specified");
+        error_append_hint(errp, "Use -device vfio-user-pci,socket=<name>\n");
+        return;
+    }
+}
+
+static void vfio_user_instance_finalize(Object *obj)
+{
+}
+
+static Property vfio_user_pci_dev_properties[] = {
+    DEFINE_PROP_STRING("socket", VFIOUserPCIDevice, sock_name),
+    DEFINE_PROP_BOOL("secure-dma", VFIOUserPCIDevice, secure, false),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+static void vfio_user_pci_dev_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    PCIDeviceClass *pdc = PCI_DEVICE_CLASS(klass);
+
+    device_class_set_props(dc, vfio_user_pci_dev_properties);
+    dc->desc = "VFIO over socket PCI device assignment";
+    pdc->realize = vfio_user_pci_realize;
+}
+
+static const TypeInfo vfio_user_pci_dev_info = {
+    .name = TYPE_VFIO_USER_PCI,
+    .parent = TYPE_VFIO_PCI_BASE,
+    .instance_size = sizeof(VFIOUserPCIDevice),
+    .class_init = vfio_user_pci_dev_class_init,
+    .instance_init = vfio_instance_init,
+    .instance_finalize = vfio_user_instance_finalize,
+};
+
+static void register_vfio_user_dev_type(void)
+{
+    type_register_static(&vfio_user_pci_dev_info);
+}
+
+type_init(register_vfio_user_dev_type)
-- 
2.25.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC 05/19] vfio-user: connect vfio proxy to remote server
  2021-07-19  6:27 [PATCH RFC 00/19] vfio-user implementation Elena Ufimtseva
                   ` (3 preceding siblings ...)
  2021-07-19  6:27 ` [PATCH RFC 04/19] vfio-user: Define type vfio_user_pci_dev_info Elena Ufimtseva
@ 2021-07-19  6:27 ` Elena Ufimtseva
  2021-07-19  6:27 ` [PATCH RFC 06/19] vfio-user: negotiate protocol with " Elena Ufimtseva
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 55+ messages in thread
From: Elena Ufimtseva @ 2021-07-19  6:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha

From: John G Johnson <john.g.johnson@oracle.com>

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/vfio/user.h |  2 ++
 hw/vfio/pci.c  | 16 ++++++++++
 hw/vfio/user.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 105 insertions(+)

diff --git a/hw/vfio/user.h b/hw/vfio/user.h
index cdbc074579..12106ccb6a 100644
--- a/hw/vfio/user.h
+++ b/hw/vfio/user.h
@@ -117,4 +117,6 @@ typedef struct VFIOProxy {
 
 void vfio_user_recv(void *opaque);
 void vfio_user_send_reply(VFIOProxy *proxy, char *buf, int ret);
+VFIOProxy *vfio_user_connect_dev(char *sockname, Error **errp);
+void vfio_user_disconnect(VFIOProxy *proxy);
 #endif /* VFIO_USER_H */
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 554b562769..1effdcd5c0 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3332,16 +3332,32 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
 {
     ERRP_GUARD();
     VFIOUserPCIDevice *udev = VFIO_USER_PCI(pdev);
+    VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    VFIOProxy *proxy;
+    Error *err = NULL;
 
     if (!udev->sock_name) {
         error_setg(errp, "No socket specified");
         error_append_hint(errp, "Use -device vfio-user-pci,socket=<name>\n");
         return;
     }
+    proxy = vfio_user_connect_dev(udev->sock_name, &err);
+    if (!proxy) {
+        error_setg(errp, "Remote proxy not found");
+        return;
+    }
+    vbasedev->proxy = proxy;
 }
 
 static void vfio_user_instance_finalize(Object *obj)
 {
+    VFIOPCIDevice *vdev = VFIO_PCI_BASE(obj);
+    VFIODevice *vbasedev = &vdev->vbasedev;
+
+    vfio_put_device(vdev);
+
+    vfio_user_disconnect(vbasedev->proxy);
 }
 
 static Property vfio_user_pci_dev_properties[] = {
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
index 021d5540e0..371ee9cd8b 100644
--- a/hw/vfio/user.c
+++ b/hw/vfio/user.c
@@ -284,3 +284,90 @@ static void vfio_user_send(VFIOProxy *proxy, vfio_user_hdr_t *msg,
         qemu_mutex_lock_iothread();
     }
 }
+
+static QLIST_HEAD(, VFIOProxy) vfio_user_sockets =
+    QLIST_HEAD_INITIALIZER(vfio_user_sockets);
+
+VFIOProxy *vfio_user_connect_dev(char *sockname, Error **errp)
+{
+    VFIOProxy *proxy;
+    struct QIOChannel *ioc;
+    int sockfd;
+
+    sockfd = unix_connect(sockname, errp);
+    if (sockfd == -1) {
+        return NULL;
+    }
+
+    ioc = qio_channel_new_fd(sockfd, errp);
+    if (ioc == NULL) {
+        close(sockfd);
+        return NULL;
+    }
+    qio_channel_set_blocking(ioc, true, NULL);
+
+    proxy = g_malloc0(sizeof(VFIOProxy));
+    proxy->sockname = sockname;
+    proxy->ioc = ioc;
+    proxy->flags = VFIO_PROXY_CLIENT;
+    proxy->state = CONNECTED;
+    qemu_cond_init(&proxy->close_cv);
+
+    if (vfio_user_iothread == NULL) {
+        vfio_user_iothread = iothread_create("VFIO user", errp);
+    }
+
+    qemu_mutex_init(&proxy->lock);
+    QTAILQ_INIT(&proxy->free);
+    QTAILQ_INIT(&proxy->pending);
+    QLIST_INSERT_HEAD(&vfio_user_sockets, proxy, next);
+
+    return proxy;
+}
+
+void vfio_user_disconnect(VFIOProxy *proxy)
+{
+    VFIOUserReply *r1, *r2;
+
+    qemu_mutex_lock(&proxy->lock);
+
+    /* our side is quitting */
+    if (proxy->state == CONNECTED) {
+        vfio_user_shutdown(proxy);
+        if (!QTAILQ_EMPTY(&proxy->pending)) {
+            error_printf("vfio_user_disconnect: outstanding requests\n");
+        }
+    }
+    qio_channel_close(proxy->ioc, NULL);
+    proxy->state = CLOSING;
+
+    QTAILQ_FOREACH_SAFE(r1, &proxy->pending, next, r2) {
+        qemu_cond_destroy(&r1->cv);
+        QTAILQ_REMOVE(&proxy->pending, r1, next);
+        g_free(r1);
+    }
+    QTAILQ_FOREACH_SAFE(r1, &proxy->free, next, r2) {
+        qemu_cond_destroy(&r1->cv);
+        QTAILQ_REMOVE(&proxy->free, r1, next);
+        g_free(r1);
+    }
+
+    /* drop locks so the iothread can make progress */
+    qemu_mutex_unlock_iothread();
+    qemu_cond_wait(&proxy->close_cv, &proxy->lock);
+
+    /* we now hold the only ref to proxy */
+    qemu_mutex_unlock(&proxy->lock);
+    qemu_cond_destroy(&proxy->close_cv);
+    qemu_mutex_destroy(&proxy->lock);
+
+    qemu_mutex_lock_iothread();
+
+    QLIST_REMOVE(proxy, next);
+    if (QLIST_EMPTY(&vfio_user_sockets)) {
+        iothread_destroy(vfio_user_iothread);
+        vfio_user_iothread = NULL;
+    }
+
+    g_free(proxy);
+}
-- 
2.25.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC 06/19] vfio-user: negotiate protocol with remote server
  2021-07-19  6:27 [PATCH RFC 00/19] vfio-user implementation Elena Ufimtseva
                   ` (4 preceding siblings ...)
  2021-07-19  6:27 ` [PATCH RFC 05/19] vfio-user: connect vfio proxy to remote server Elena Ufimtseva
@ 2021-07-19  6:27 ` Elena Ufimtseva
  2021-07-19  6:27 ` [PATCH RFC 07/19] vfio-user: define vfio-user pci ops Elena Ufimtseva
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 55+ messages in thread
From: Elena Ufimtseva @ 2021-07-19  6:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha

From: John G Johnson <john.g.johnson@oracle.com>

Send version and capabilities and validate reply.

Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
---
 hw/vfio/user.h |   8 ++
 hw/vfio/pci.c  |  10 +++
 hw/vfio/user.c | 223 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 241 insertions(+)

diff --git a/hw/vfio/user.h b/hw/vfio/user.h
index 12106ccb6a..844496ef82 100644
--- a/hw/vfio/user.h
+++ b/hw/vfio/user.h
@@ -64,6 +64,13 @@ struct vfio_user_version {
     char capabilities[];
 };
 
+
+#define VFIO_USER_CAP           "capabilities"
+
+/* "capabilities" members */
+#define VFIO_USER_CAP_MAX_FDS   "max_msg_fds"
+#define VFIO_USER_CAP_MAX_XFER  "max_data_xfer_size"
+
 #define VFIO_USER_DEF_MAX_FDS   8
 #define VFIO_USER_MAX_MAX_FDS   16
 
@@ -119,4 +126,5 @@ void vfio_user_recv(void *opaque);
 void vfio_user_send_reply(VFIOProxy *proxy, char *buf, int ret);
 VFIOProxy *vfio_user_connect_dev(char *sockname, Error **errp);
 void vfio_user_disconnect(VFIOProxy *proxy);
+int vfio_user_validate_version(VFIODevice *vbasedev, Error **errp);
 #endif /* VFIO_USER_H */
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 1effdcd5c0..8ca1431cca 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3348,6 +3348,16 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
         return;
     }
     vbasedev->proxy = proxy;
+
+    vfio_user_validate_version(vbasedev, &err);
+    if (err != NULL) {
+        error_propagate(errp, err);
+        goto error;
+    }
+    return;
+
+ error:
+    error_prepend(errp, VFIO_MSG_PREFIX, vdev->vbasedev.name);
 }
 
 static void vfio_user_instance_finalize(Object *obj)
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
index 371ee9cd8b..24dd45b55d 100644
--- a/hw/vfio/user.c
+++ b/hw/vfio/user.c
@@ -22,15 +22,25 @@
 #include "io/channel.h"
 #include "io/channel-util.h"
 #include "sysemu/iothread.h"
+#include "qapi/qmp/qdict.h"
+#include "qapi/qmp/qjson.h"
+#include "qapi/qmp/qnull.h"
+#include "qapi/qmp/qstring.h"
+#include "qapi/qmp/qnum.h"
 #include "user.h"
 
 static uint64_t max_xfer_size = VFIO_USER_DEF_MAX_XFER;
+static uint64_t max_send_fds = VFIO_USER_DEF_MAX_FDS;
 static IOThread *vfio_user_iothread;
 static void vfio_user_send_locked(VFIOProxy *proxy, vfio_user_hdr_t *msg,
                                   VFIOUserFDs *fds);
 static void vfio_user_send(VFIOProxy *proxy, vfio_user_hdr_t *msg,
                            VFIOUserFDs *fds);
 static void vfio_user_shutdown(VFIOProxy *proxy);
+static void vfio_user_request_msg(vfio_user_hdr_t *hdr, uint16_t cmd,
+                                  uint32_t size, uint32_t flags);
+static void vfio_user_send_recv(VFIOProxy *proxy, vfio_user_hdr_t *msg,
+                                VFIOUserFDs *fds, int rsize);
 
 static void vfio_user_shutdown(VFIOProxy *proxy)
 {
@@ -40,6 +50,72 @@ static void vfio_user_shutdown(VFIOProxy *proxy)
                                    NULL, NULL, NULL);
 }
 
+static void vfio_user_request_msg(vfio_user_hdr_t *hdr, uint16_t cmd,
+                                  uint32_t size, uint32_t flags)
+{
+    static uint16_t next_id;
+
+    hdr->id = qatomic_fetch_inc(&next_id);
+    hdr->command = cmd;
+    hdr->size = size;
+    hdr->flags = (flags & ~VFIO_USER_TYPE) | VFIO_USER_REQUEST;
+    hdr->error_reply = 0;
+}
+
+static int wait_time = 1000;   /* wait 1 sec for replies */
+
+static void vfio_user_send_recv(VFIOProxy *proxy, vfio_user_hdr_t *msg,
+                                VFIOUserFDs *fds, int rsize)
+{
+    VFIOUserReply *reply;
+    bool iolock = qemu_mutex_iothread_locked();
+
+    if (msg->flags & VFIO_USER_NO_REPLY) {
+        error_printf("vfio_user_send_recv on async message\n");
+        return;
+    }
+
+    /*
+     * We will block later, so use a per-proxy lock and let
+     * the iothreads run while we sleep.
+     */
+    if (iolock) {
+        qemu_mutex_unlock_iothread();
+    }
+    qemu_mutex_lock(&proxy->lock);
+
+    reply = QTAILQ_FIRST(&proxy->free);
+    if (reply != NULL) {
+        QTAILQ_REMOVE(&proxy->free, reply, next);
+        reply->complete = 0;
+    } else {
+        reply = g_malloc0(sizeof(*reply));
+        qemu_cond_init(&reply->cv);
+    }
+    reply->msg = msg;
+    reply->fds = fds;
+    reply->id = msg->id;
+    reply->rsize = rsize ? rsize : msg->size;
+    QTAILQ_INSERT_TAIL(&proxy->pending, reply, next);
+
+    vfio_user_send_locked(proxy, msg, fds);
+    if ((msg->flags & VFIO_USER_ERROR) == 0) {
+        while (reply->complete == 0) {
+            if (!qemu_cond_timedwait(&reply->cv, &proxy->lock, wait_time)) {
+                msg->flags |= VFIO_USER_ERROR;
+                msg->error_reply = ETIMEDOUT;
+                break;
+            }
+        }
+    }
+
+    QTAILQ_INSERT_HEAD(&proxy->free, reply, next);
+    qemu_mutex_unlock(&proxy->lock);
+    if (iolock) {
+        qemu_mutex_lock_iothread();
+    }
+}
+
 void vfio_user_send_reply(VFIOProxy *proxy, char *buf, int ret)
 {
     vfio_user_hdr_t *hdr = (vfio_user_hdr_t *)buf;
@@ -285,6 +361,153 @@ static void vfio_user_send(VFIOProxy *proxy, vfio_user_hdr_t *msg,
     }
 }
 
+struct cap_entry {
+    const char *name;
+    int (*check)(QObject *qobj, Error **errp);
+};
+
+static int caps_parse(QDict *qdict, struct cap_entry caps[], Error **errp)
+{
+    QObject *qobj;
+    struct cap_entry *p;
+
+    for (p = caps; p->name != NULL; p++) {
+        qobj = qdict_get(qdict, p->name);
+        if (qobj != NULL) {
+            if (p->check(qobj, errp)) {
+                return -1;
+            }
+            qdict_del(qdict, p->name);
+        }
+    }
+
+    /* warning, for now */
+    if (qdict_size(qdict) != 0) {
+        error_printf("spurious capabilities\n");
+    }
+    return 0;
+}
+
+static int check_max_fds(QObject *qobj, Error **errp)
+{
+    QNum *qn = qobject_to(QNum, qobj);
+
+    if (qn == NULL || !qnum_get_try_uint(qn, &max_send_fds) ||
+        max_send_fds > VFIO_USER_MAX_MAX_FDS) {
+        error_setg(errp, "malformed %s", VFIO_USER_CAP_MAX_FDS);
+        return -1;
+    }
+    return 0;
+}
+
+static int check_max_xfer(QObject *qobj, Error **errp)
+{
+    QNum *qn = qobject_to(QNum, qobj);
+
+    if (qn == NULL || !qnum_get_try_uint(qn, &max_xfer_size) ||
+        max_xfer_size > VFIO_USER_MAX_MAX_XFER) {
+        error_setg(errp, "malformed %s", VFIO_USER_CAP_MAX_XFER);
+        return -1;
+    }
+    return 0;
+}
+
+static struct cap_entry caps_cap[] = {
+    { VFIO_USER_CAP_MAX_FDS, check_max_fds },
+    { VFIO_USER_CAP_MAX_XFER, check_max_xfer },
+    { NULL }
+};
+
+static int check_cap(QObject *qobj, Error **errp)
+{
+   QDict *qdict = qobject_to(QDict, qobj);
+
+    if (qdict == NULL || caps_parse(qdict, caps_cap, errp)) {
+        error_setg(errp, "malformed %s", VFIO_USER_CAP);
+        return -1;
+    }
+    return 0;
+}
+
+static struct cap_entry ver_0_0[] = {
+    { VFIO_USER_CAP, check_cap },
+    { NULL }
+};
+
+static int caps_check(int minor, const char *caps, Error **errp)
+{
+    QObject *qobj;
+    QDict *qdict;
+    int ret;
+
+    qobj = qobject_from_json(caps, NULL);
+    if (qobj == NULL) {
+        error_setg(errp, "malformed capabilities %s", caps);
+        return -1;
+    }
+    qdict = qobject_to(QDict, qobj);
+    if (qdict == NULL) {
+        error_setg(errp, "capabilities %s not an object", caps);
+        qobject_unref(qobj);
+        return -1;
+    }
+    ret = caps_parse(qdict, ver_0_0, errp);
+
+    qobject_unref(qobj);
+    return ret;
+}
+
+static GString *caps_json(void)
+{
+    QDict *dict = qdict_new();
+    QDict *capdict = qdict_new();
+    GString *str;
+
+    qdict_put_int(capdict, VFIO_USER_CAP_MAX_FDS, VFIO_USER_MAX_MAX_FDS);
+    qdict_put_int(capdict, VFIO_USER_CAP_MAX_XFER, VFIO_USER_DEF_MAX_XFER);
+
+    qdict_put_obj(dict, VFIO_USER_CAP, QOBJECT(capdict));
+
+    str = qobject_to_json(QOBJECT(dict));
+    qobject_unref(dict);
+    return str;
+}
+
+int vfio_user_validate_version(VFIODevice *vbasedev, Error **errp)
+{
+    g_autofree struct vfio_user_version *msgp;
+    GString *caps;
+    int size, caplen;
+
+    caps = caps_json();
+    caplen = caps->len + 1;
+    size = sizeof(*msgp) + caplen;
+    msgp = g_malloc0(size);
+
+    vfio_user_request_msg(&msgp->hdr, VFIO_USER_VERSION, size, 0);
+    msgp->major = VFIO_USER_MAJOR_VER;
+    msgp->minor = VFIO_USER_MINOR_VER;
+    memcpy(&msgp->capabilities, caps->str, caplen);
+    g_string_free(caps, true);
+
+    vfio_user_send_recv(vbasedev->proxy, &msgp->hdr, NULL, 0);
+    if (msgp->hdr.flags & VFIO_USER_ERROR) {
+        error_setg_errno(errp, msgp->hdr.error_reply, "version reply");
+        return -1;
+    }
+
+    if (msgp->major != VFIO_USER_MAJOR_VER ||
+        msgp->minor > VFIO_USER_MINOR_VER) {
+        error_setg(errp, "incompatible server version");
+        return -1;
+    }
+    if (caps_check(msgp->minor, (char *)msgp + sizeof(*msgp), errp) != 0) {
+        return -1;
+    }
+
+    return 0;
+}
+
 static QLIST_HEAD(, VFIOProxy) vfio_user_sockets =
     QLIST_HEAD_INITIALIZER(vfio_user_sockets);
 
-- 
2.25.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC 07/19] vfio-user: define vfio-user pci ops
  2021-07-19  6:27 [PATCH RFC 00/19] vfio-user implementation Elena Ufimtseva
                   ` (5 preceding siblings ...)
  2021-07-19  6:27 ` [PATCH RFC 06/19] vfio-user: negotiate protocol with " Elena Ufimtseva
@ 2021-07-19  6:27 ` Elena Ufimtseva
  2021-07-19  6:27 ` [PATCH RFC 08/19] vfio-user: VFIO container setup & teardown Elena Ufimtseva
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 55+ messages in thread
From: Elena Ufimtseva @ 2021-07-19  6:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha

From: John G Johnson <john.g.johnson@oracle.com>

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/vfio/pci.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 8ca1431cca..388b7d82d7 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3328,6 +3328,29 @@ static void register_vfio_pci_dev_type(void)
 
 type_init(register_vfio_pci_dev_type)
 
+/*
+ * Emulated devices don't use host hot reset
+ */
+static int vfio_user_pci_no_reset(VFIODevice *vbasedev)
+{
+    error_printf("vfio-user - no hot reset\n");
+    return 0;
+}
+
+static void vfio_user_pci_not_needed(VFIODevice *vbasedev)
+{
+    vbasedev->needs_reset = false;
+}
+
+static VFIODeviceOps vfio_user_pci_ops = {
+    .vfio_compute_needs_reset = vfio_user_pci_not_needed,
+    .vfio_hot_reset_multi = vfio_user_pci_no_reset,
+    .vfio_eoi = vfio_intx_eoi,
+    .vfio_get_object = vfio_pci_get_object,
+    .vfio_save_config = vfio_pci_save_config,
+    .vfio_load_config = vfio_pci_load_config,
+};
+
 static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
 {
     ERRP_GUARD();
@@ -3354,6 +3377,14 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
         error_propagate(errp, err);
         goto error;
     }
+
+    vbasedev->name = g_strdup_printf("VFIO user <%s>", udev->sock_name);
+    vbasedev->dev = DEVICE(vdev);
+    vbasedev->fd = -1;
+    vbasedev->type = VFIO_DEVICE_TYPE_PCI;
+    vbasedev->no_mmap = false;
+    vbasedev->ops = &vfio_user_pci_ops;
+
     return;
 
  error:
-- 
2.25.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC 08/19] vfio-user: VFIO container setup & teardown
  2021-07-19  6:27 [PATCH RFC 00/19] vfio-user implementation Elena Ufimtseva
                   ` (6 preceding siblings ...)
  2021-07-19  6:27 ` [PATCH RFC 07/19] vfio-user: define vfio-user pci ops Elena Ufimtseva
@ 2021-07-19  6:27 ` Elena Ufimtseva
  2021-07-19  6:27 ` [PATCH RFC 09/19] vfio-user: get device info and get irq info Elena Ufimtseva
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 55+ messages in thread
From: Elena Ufimtseva @ 2021-07-19  6:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha

From: John G Johnson <john.g.johnson@oracle.com>

Create SW-emulated containers and groups for vfio-user
in lieu of the host IOMMU based ones used by the kernel
driver VFIO implementation.

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 include/hw/vfio/vfio-common.h |  3 ++
 hw/vfio/common.c              | 70 +++++++++++++++++++++++++++++++++++
 hw/vfio/pci.c                 | 19 ++++++++++
 3 files changed, 92 insertions(+)

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index f43dc6e5d0..491a92b4f5 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -90,6 +90,7 @@ typedef struct VFIOContainer {
     uint64_t max_dirty_bitmap_size;
     unsigned long pgsizes;
     unsigned int dma_max_mappings;
+    VFIOProxy *proxy;
     QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
     QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
     QLIST_HEAD(, VFIOGroup) group_list;
@@ -214,6 +215,8 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
 void vfio_put_group(VFIOGroup *group);
 int vfio_get_device(VFIOGroup *group, const char *name,
                     VFIODevice *vbasedev, Error **errp);
+void vfio_connect_proxy(VFIOProxy *proxy, VFIOGroup *group, AddressSpace *as);
+void vfio_disconnect_proxy(VFIOGroup *group);
 
 extern const MemoryRegionOps vfio_region_ops;
 typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 8728d4d5c2..45acdeeb46 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -2206,6 +2206,41 @@ put_space_exit:
     return ret;
 }
 
+void vfio_connect_proxy(VFIOProxy *proxy, VFIOGroup *group, AddressSpace *as)
+{
+    VFIOAddressSpace *space;
+    VFIOContainer *container;
+
+    /*
+     * try to mirror vfio_connect_container()
+     * as much as possible
+     */
+
+    space = vfio_get_address_space(as);
+
+    container = g_malloc0(sizeof(*container));
+    container->space = space;
+    container->fd = -1;
+    QLIST_INIT(&container->hostwin_list);
+    container->proxy = proxy;
+
+    container->iommu_type = VFIO_TYPE1_IOMMU;
+    vfio_host_win_add(container, 0, (hwaddr)-1, 4096);
+    container->pgsizes = 4096;
+
+    QLIST_INIT(&container->group_list);
+    QLIST_INSERT_HEAD(&space->containers, container, next);
+
+    QLIST_INIT(&container->giommu_list);
+
+    group->container = container;
+    QLIST_INSERT_HEAD(&container->group_list, group, container_next);
+
+    container->listener = vfio_memory_listener;
+    memory_listener_register(&container->listener, container->space->as);
+    container->initialized = true;
+}
+
 static void vfio_disconnect_container(VFIOGroup *group)
 {
     VFIOContainer *container = group->container;
@@ -2248,6 +2283,41 @@ static void vfio_disconnect_container(VFIOGroup *group)
     }
 }
 
+void vfio_disconnect_proxy(VFIOGroup *group)
+{
+    VFIOContainer *container = group->container;
+    VFIOAddressSpace *space = container->space;
+    VFIOGuestIOMMU *giommu, *tmp;
+
+    /*
+     * try to mirror vfio_disconnect_container()
+     * as much as possible, knowing each device
+     * is in one group and one container
+     */
+
+    QLIST_REMOVE(group, container_next);
+    group->container = NULL;
+
+    /*
+     * Explicitly release the listener first before unset container,
+     * since unset may destroy the backend container if it's the last
+     * group.
+     */
+    memory_listener_unregister(&container->listener);
+
+    QLIST_REMOVE(container, next);
+
+    QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
+        memory_region_unregister_iommu_notifier(
+            MEMORY_REGION(giommu->iommu), &giommu->n);
+        QLIST_REMOVE(giommu, giommu_next);
+        g_free(giommu);
+    }
+
+    g_free(container);
+    vfio_put_address_space(space);
+}
+
 VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
 {
     VFIOGroup *group;
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 388b7d82d7..5ed42ad858 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3358,6 +3358,7 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
     VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
     VFIODevice *vbasedev = &vdev->vbasedev;
     VFIOProxy *proxy;
+    VFIOGroup *group = NULL;
     Error *err = NULL;
 
     if (!udev->sock_name) {
@@ -3385,6 +3386,19 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
     vbasedev->no_mmap = false;
     vbasedev->ops = &vfio_user_pci_ops;
 
+    /*
+     * each device gets its own group and container
+     * make them unrelated to any host IOMMU groupings
+     */
+    group = g_malloc0(sizeof(*group));
+    group->fd = -1;
+    group->groupid = -1;
+    QLIST_INIT(&group->device_list);
+    QLIST_INSERT_HEAD(&group->device_list, vbasedev, next);
+    vbasedev->group = group;
+
+    vfio_connect_proxy(proxy, group, pci_device_iommu_address_space(pdev));
+
     return;
 
  error:
@@ -3395,6 +3409,11 @@ static void vfio_user_instance_finalize(Object *obj)
 {
     VFIOPCIDevice *vdev = VFIO_PCI_BASE(obj);
     VFIODevice *vbasedev = &vdev->vbasedev;
+    VFIOGroup *group = vbasedev->group;
+
+    vfio_disconnect_proxy(group);
+    g_free(group);
+    vbasedev->group = NULL;
 
     vfio_put_device(vdev);
 
-- 
2.25.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC 09/19] vfio-user: get device info and get irq info
  2021-07-19  6:27 [PATCH RFC 00/19] vfio-user implementation Elena Ufimtseva
                   ` (7 preceding siblings ...)
  2021-07-19  6:27 ` [PATCH RFC 08/19] vfio-user: VFIO container setup & teardown Elena Ufimtseva
@ 2021-07-19  6:27 ` Elena Ufimtseva
  2021-07-19  6:27 ` [PATCH RFC 10/19] vfio-user: device region read/write Elena Ufimtseva
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 55+ messages in thread
From: Elena Ufimtseva @ 2021-07-19  6:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha

From: John G Johnson <john.g.johnson@oracle.com>

Send VFIO_USER_DEVICE_GET_INFO and
VFIO_USER_DEVICE_GET_IRQ_INFO commands.

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/vfio/user.h | 27 +++++++++++++++++++++++++++
 hw/vfio/pci.c  | 32 +++++++++++++++++++++++++++++---
 hw/vfio/user.c | 40 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 96 insertions(+), 3 deletions(-)

diff --git a/hw/vfio/user.h b/hw/vfio/user.h
index 844496ef82..9f51e14c7c 100644
--- a/hw/vfio/user.h
+++ b/hw/vfio/user.h
@@ -122,9 +122,36 @@ typedef struct VFIOProxy {
 
 #define VFIO_PROXY_CLIENT       0x1
 
+/*
+ * VFIO_USER_DEVICE_GET_INFO
+ * imported from struct_device_info
+ */
+struct vfio_user_device_info {
+    vfio_user_hdr_t hdr;
+    uint32_t argsz;
+    uint32_t flags;
+    uint32_t num_regions;
+    uint32_t num_irqs;
+    uint32_t cap_offset;
+};
+
+/*
+ * VFIO_USER_DEVICE_GET_IRQ_INFO
+ * imported from struct vfio_irq_info
+ */
+struct vfio_user_irq_info {
+    vfio_user_hdr_t hdr;
+    uint32_t argsz;
+    uint32_t flags;
+    uint32_t index;
+    uint32_t count;
+};
+
 void vfio_user_recv(void *opaque);
 void vfio_user_send_reply(VFIOProxy *proxy, char *buf, int ret);
 VFIOProxy *vfio_user_connect_dev(char *sockname, Error **errp);
 void vfio_user_disconnect(VFIOProxy *proxy);
 int vfio_user_validate_version(VFIODevice *vbasedev, Error **errp);
+int vfio_user_get_info(VFIODevice *vbasedev);
+int vfio_user_get_irq_info(VFIODevice *vbasedev, struct vfio_irq_info *info);
 #endif /* VFIO_USER_H */
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 5ed42ad858..029a191bcb 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2620,7 +2620,12 @@ static void vfio_populate_device(VFIOPCIDevice *vdev, Error **errp)
 
     irq_info.index = VFIO_PCI_ERR_IRQ_INDEX;
 
-    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info);
+    if (vbasedev->proxy != NULL) {
+        ret = vfio_user_get_irq_info(vbasedev, &irq_info);
+    } else {
+        ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info);
+    }
+
     if (ret) {
         /* This can fail for an old kernel or legacy PCI dev */
         trace_vfio_populate_device_get_irq_info_failure(strerror(errno));
@@ -2739,8 +2744,16 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
         return;
     }
 
-    if (ioctl(vdev->vbasedev.fd,
-              VFIO_DEVICE_GET_IRQ_INFO, &irq_info) < 0 || irq_info.count < 1) {
+    if (vdev->vbasedev.proxy != NULL) {
+        if (vfio_user_get_irq_info(&vdev->vbasedev, &irq_info) < 0) {
+            return;
+        }
+    } else {
+        if (ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info) < 0) {
+            return;
+        }
+    }
+    if (irq_info.count < 1) {
         return;
     }
 
@@ -3359,6 +3372,7 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
     VFIODevice *vbasedev = &vdev->vbasedev;
     VFIOProxy *proxy;
     VFIOGroup *group = NULL;
+    int ret;
     Error *err = NULL;
 
     if (!udev->sock_name) {
@@ -3399,6 +3413,18 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
 
     vfio_connect_proxy(proxy, group, pci_device_iommu_address_space(pdev));
 
+    ret = vfio_user_get_info(&vdev->vbasedev);
+    if (ret) {
+        error_setg_errno(errp, -ret, "get info failure");
+        goto error;
+    }
+
+    vfio_populate_device(vdev, &err);
+    if (err) {
+        error_propagate(errp, err);
+        goto error;
+    }
+
     return;
 
  error:
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
index 24dd45b55d..a282b7b7b8 100644
--- a/hw/vfio/user.c
+++ b/hw/vfio/user.c
@@ -508,6 +508,27 @@ int vfio_user_validate_version(VFIODevice *vbasedev, Error **errp)
     return 0;
 }
 
+int vfio_user_get_info(VFIODevice *vbasedev)
+{
+    struct vfio_user_device_info msg;
+
+    memset(&msg, 0, sizeof(msg));
+    vfio_user_request_msg(&msg.hdr, VFIO_USER_DEVICE_GET_INFO, sizeof(msg), 0);
+    msg.argsz = sizeof(struct vfio_device_info);
+
+    vfio_user_send_recv(vbasedev->proxy, &msg.hdr, NULL, 0);
+    if (msg.hdr.flags & VFIO_USER_ERROR) {
+        return -msg.hdr.error_reply;
+    }
+
+    vbasedev->num_irqs = msg.num_irqs;
+    vbasedev->num_regions = msg.num_regions;
+    vbasedev->flags = msg.flags;
+    vbasedev->reset_works = !!(msg.flags & VFIO_DEVICE_FLAGS_RESET);
+    return 0;
+
+}
+
 static QLIST_HEAD(, VFIOProxy) vfio_user_sockets =
     QLIST_HEAD_INITIALIZER(vfio_user_sockets);
 
@@ -594,3 +615,22 @@ void vfio_user_disconnect(VFIOProxy *proxy)
 
     g_free(proxy);
 }
+
+int vfio_user_get_irq_info(VFIODevice *vbasedev, struct vfio_irq_info *info)
+{
+    struct vfio_user_irq_info msg;
+
+    memset(&msg, 0, sizeof(msg));
+    vfio_user_request_msg(&msg.hdr, VFIO_USER_DEVICE_GET_IRQ_INFO,
+                          sizeof(msg), 0);
+    msg.argsz = info->argsz;
+    msg.index = info->index;
+
+    vfio_user_send_recv(vbasedev->proxy, &msg.hdr, NULL, 0);
+    if (msg.hdr.flags & VFIO_USER_ERROR) {
+        return -msg.hdr.error_reply;
+    }
+
+    memcpy(info, &msg.argsz, sizeof(*info));
+    return 0;
+}
-- 
2.25.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC 10/19] vfio-user: device region read/write
  2021-07-19  6:27 [PATCH RFC 00/19] vfio-user implementation Elena Ufimtseva
                   ` (8 preceding siblings ...)
  2021-07-19  6:27 ` [PATCH RFC 09/19] vfio-user: get device info and get irq info Elena Ufimtseva
@ 2021-07-19  6:27 ` Elena Ufimtseva
  2021-07-19  6:27 ` [PATCH RFC 11/19] vfio-user: get region and DMA map/unmap operations Elena Ufimtseva
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 55+ messages in thread
From: Elena Ufimtseva @ 2021-07-19  6:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha

From: John G Johnson <john.g.johnson@oracle.com>

Send VFIO_REGION_READ and VFIO_REGION_WRITE commands.

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/vfio/user.h   | 16 ++++++++++++++++
 hw/vfio/common.c | 17 +++++++++++++++--
 hw/vfio/pci.c    | 13 +++++++++++++
 hw/vfio/user.c   | 45 +++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 89 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/user.h b/hw/vfio/user.h
index 9f51e14c7c..17c4d90ef1 100644
--- a/hw/vfio/user.h
+++ b/hw/vfio/user.h
@@ -147,6 +147,18 @@ struct vfio_user_irq_info {
     uint32_t count;
 };
 
+/*
+ * VFIO_USER_REGION_READ
+ * VFIO_USER_REGION_WRITE
+ */
+struct vfio_user_region_rw {
+    vfio_user_hdr_t hdr;
+    uint64_t offset;
+    uint32_t region;
+    uint32_t count;
+    char data[];
+};
+
 void vfio_user_recv(void *opaque);
 void vfio_user_send_reply(VFIOProxy *proxy, char *buf, int ret);
 VFIOProxy *vfio_user_connect_dev(char *sockname, Error **errp);
@@ -154,4 +166,8 @@ void vfio_user_disconnect(VFIOProxy *proxy);
 int vfio_user_validate_version(VFIODevice *vbasedev, Error **errp);
 int vfio_user_get_info(VFIODevice *vbasedev);
 int vfio_user_get_irq_info(VFIODevice *vbasedev, struct vfio_irq_info *info);
+int vfio_user_region_read(VFIODevice *vbasedev, uint32_t index, uint64_t offset,
+                          uint32_t count, void *data);
+int vfio_user_region_write(VFIODevice *vbasedev, uint32_t index,
+                           uint64_t offset, uint32_t count, void *data);
 #endif /* VFIO_USER_H */
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 45acdeeb46..74041cc438 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -40,6 +40,7 @@
 #include "trace.h"
 #include "qapi/error.h"
 #include "migration/migration.h"
+#include "hw/vfio/user.h"
 
 VFIOGroupList vfio_group_list =
     QLIST_HEAD_INITIALIZER(vfio_group_list);
@@ -214,6 +215,7 @@ void vfio_region_write(void *opaque, hwaddr addr,
         uint32_t dword;
         uint64_t qword;
     } buf;
+    int ret;
 
     switch (size) {
     case 1:
@@ -233,7 +235,12 @@ void vfio_region_write(void *opaque, hwaddr addr,
         break;
     }
 
-    if (pwrite(vbasedev->fd, &buf, size, region->fd_offset + addr) != size) {
+    if (vbasedev->proxy != NULL) {
+        ret = vfio_user_region_write(vbasedev, region->nr, addr, size, &data);
+    } else {
+        ret = pwrite(vbasedev->fd, &buf, size, region->fd_offset + addr);
+    }
+    if (ret != size) {
         error_report("%s(%s:region%d+0x%"HWADDR_PRIx", 0x%"PRIx64
                      ",%d) failed: %m",
                      __func__, vbasedev->name, region->nr,
@@ -265,8 +272,14 @@ uint64_t vfio_region_read(void *opaque,
         uint64_t qword;
     } buf;
     uint64_t data = 0;
+    int ret;
 
-    if (pread(vbasedev->fd, &buf, size, region->fd_offset + addr) != size) {
+    if (vbasedev->proxy != NULL) {
+        ret = vfio_user_region_read(vbasedev, region->nr, addr, size, &buf);
+    } else {
+        ret = pread(vbasedev->fd, &buf, size, region->fd_offset + addr);
+    }
+    if (ret != size) {
         error_report("%s(%s:region%d+0x%"HWADDR_PRIx", %d) failed: %m",
                      __func__, vbasedev->name, region->nr,
                      addr, size);
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 029a191bcb..1054978e5e 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3424,6 +3424,19 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
         error_propagate(errp, err);
         goto error;
     }
+    /* Get a copy of config space */
+    ret = vfio_user_region_read(vbasedev, VFIO_PCI_CONFIG_REGION_INDEX, 0,
+                                MIN(pci_config_size(pdev), vdev->config_size),
+                                pdev->config);
+    if (ret < 0) {
+        goto error;
+    }
+
+    /* vfio emulates a lot for us, but some bits need extra love */
+    vdev->emulated_config_bits = g_malloc0(vdev->config_size);
+
+    /* QEMU can also add or extend BARs */
+    memset(vdev->emulated_config_bits + PCI_BASE_ADDRESS_0, 0xff, 6 * 4);
 
     return;
 
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
index a282b7b7b8..2bb6f8650e 100644
--- a/hw/vfio/user.c
+++ b/hw/vfio/user.c
@@ -634,3 +634,48 @@ int vfio_user_get_irq_info(VFIODevice *vbasedev, struct vfio_irq_info *info)
     memcpy(info, &msg.argsz, sizeof(*info));
     return 0;
 }
+
+int vfio_user_region_read(VFIODevice *vbasedev, uint32_t index, uint64_t offset,
+                                 uint32_t count, void *data)
+{
+    g_autofree struct vfio_user_region_rw *msgp = NULL;
+    int size = sizeof(*msgp) + count;
+
+    /* most reads are just registers, only allocate for larger ones */
+    msgp = g_malloc0(size);
+    vfio_user_request_msg(&msgp->hdr, VFIO_USER_REGION_READ, sizeof(*msgp), 0);
+    msgp->offset = offset;
+    msgp->region = index;
+    msgp->count = count;
+
+    vfio_user_send_recv(vbasedev->proxy, &msgp->hdr, NULL, size);
+    if (msgp->hdr.flags & VFIO_USER_ERROR) {
+        return -msgp->hdr.error_reply;
+    } else if (msgp->count > count) {
+        return -E2BIG;
+    } else {
+        memcpy(data, &msgp->data, msgp->count);
+    }
+
+    return msgp->count;
+}
+
+int vfio_user_region_write(VFIODevice *vbasedev, uint32_t index,
+                           uint64_t offset, uint32_t count, void *data)
+{
+    g_autofree struct vfio_user_region_rw *msgp = NULL;
+    int size = sizeof(*msgp) + count;
+
+    /* most writes are just registers, only allocate for larger ones */
+    msgp = g_malloc0(size);
+    vfio_user_request_msg(&msgp->hdr, VFIO_USER_REGION_WRITE, size,
+                          VFIO_USER_NO_REPLY);
+    msgp->offset = offset;
+    msgp->region = index;
+    msgp->count = count;
+    memcpy(&msgp->data, data, count);
+
+    vfio_user_send(vbasedev->proxy, &msgp->hdr, NULL);
+
+    return count;
+}
-- 
2.25.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC 11/19] vfio-user: get region and DMA map/unmap operations
  2021-07-19  6:27 [PATCH RFC 00/19] vfio-user implementation Elena Ufimtseva
                   ` (9 preceding siblings ...)
  2021-07-19  6:27 ` [PATCH RFC 10/19] vfio-user: device region read/write Elena Ufimtseva
@ 2021-07-19  6:27 ` Elena Ufimtseva
  2021-07-19  6:27 ` [PATCH RFC 12/19] vfio-user: probe remote device's BARs Elena Ufimtseva
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 55+ messages in thread
From: Elena Ufimtseva @ 2021-07-19  6:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha

From: John G Johnson <john.g.johnson@oracle.com>

Send VFIO_USER_DEVICE_GET_REGION_INFO to get device
regions and VFIO_USER_DMA_MAP/UNMAP to tell remote
server the DMA addresses it can access.

Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
---
 hw/vfio/user.h                |  54 ++++++++++++++++++
 include/hw/vfio/vfio-common.h |   2 +
 hw/vfio/common.c              |  84 +++++++++++++++++++++++++---
 hw/vfio/pci.c                 |   4 ++
 hw/vfio/user.c                | 100 ++++++++++++++++++++++++++++++++++
 5 files changed, 236 insertions(+), 8 deletions(-)

diff --git a/hw/vfio/user.h b/hw/vfio/user.h
index 17c4d90ef1..351fdb3ee1 100644
--- a/hw/vfio/user.h
+++ b/hw/vfio/user.h
@@ -121,6 +121,7 @@ typedef struct VFIOProxy {
 } VFIOProxy;
 
 #define VFIO_PROXY_CLIENT       0x1
+#define VFIO_PROXY_SECURE       0x2
 
 /*
  * VFIO_USER_DEVICE_GET_INFO
@@ -159,6 +160,52 @@ struct vfio_user_region_rw {
     char data[];
 };
 
+/*
+ * VFIO_USER_DMA_MAP
+ * imported from struct vfio_iommu_type1_dma_map
+ */
+struct vfio_user_dma_map {
+    vfio_user_hdr_t hdr;
+    uint32_t argsz;
+    uint32_t flags;
+    uint64_t offset;    /* FD offset */
+    uint64_t iova;
+    uint64_t size;
+};
+
+/*imported from struct vfio_bitmap */
+struct vfio_user_bitmap {
+    uint64_t pgsize;
+    uint64_t size;
+    char data[];
+};
+
+/*
+ * VFIO_USER_DMA_UNMAP
+ * imported from struct vfio_iommu_type1_dma_unmap
+ */
+struct vfio_user_dma_unmap {
+    vfio_user_hdr_t hdr;
+    uint32_t argsz;
+    uint32_t flags;
+    uint64_t iova;
+    uint64_t size;
+};
+
+/*
+ * VFIO_USER_DEVICE_GET_REGION_INFO
+ * imported from struct_vfio_region_info
+ */
+struct vfio_user_region_info {
+    vfio_user_hdr_t hdr;
+    uint32_t argsz;
+    uint32_t flags;
+    uint32_t index;
+    uint32_t cap_offset;
+    uint64_t size;
+    uint64_t offset;
+};
+
 void vfio_user_recv(void *opaque);
 void vfio_user_send_reply(VFIOProxy *proxy, char *buf, int ret);
 VFIOProxy *vfio_user_connect_dev(char *sockname, Error **errp);
@@ -170,4 +217,11 @@ int vfio_user_region_read(VFIODevice *vbasedev, uint32_t index, uint64_t offset,
                           uint32_t count, void *data);
 int vfio_user_region_write(VFIODevice *vbasedev, uint32_t index,
                            uint64_t offset, uint32_t count, void *data);
+int vfio_user_dma_map(VFIOProxy *proxy, struct vfio_iommu_type1_dma_map *map,
+                      VFIOUserFDs *fds);
+int vfio_user_dma_unmap(VFIOProxy *proxy,
+                        struct vfio_iommu_type1_dma_unmap *unmap,
+                        struct vfio_bitmap *bitmap);
+int vfio_user_get_region_info(VFIODevice *vbasedev, int index,
+                              struct vfio_region_info *info, VFIOUserFDs *fds);
 #endif /* VFIO_USER_H */
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 491a92b4f5..d7b717594b 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -146,6 +146,8 @@ typedef struct VFIODevice {
     Error *migration_blocker;
     OnOffAuto pre_copy_dirty_page_tracking;
     VFIOProxy *proxy;
+    struct vfio_region_info **regions;
+    int *regfds;
 } VFIODevice;
 
 struct VFIODeviceOps {
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 74041cc438..52a092e168 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -477,6 +477,10 @@ static int vfio_dma_unmap(VFIOContainer *container,
         return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
     }
 
+    if (container->proxy != NULL) {
+        return vfio_user_dma_unmap(container->proxy, &unmap, NULL);
+    }
+
     while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
         /*
          * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
@@ -503,7 +507,7 @@ static int vfio_dma_unmap(VFIOContainer *container,
     return 0;
 }
 
-static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
+static int vfio_dma_map(VFIOContainer *container, MemoryRegion *mr, hwaddr iova,
                         ram_addr_t size, void *vaddr, bool readonly)
 {
     struct vfio_iommu_type1_dma_map map = {
@@ -518,6 +522,24 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
         map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
     }
 
+    if (container->proxy != NULL) {
+        VFIOUserFDs fds;
+        int fd;
+
+        fd = memory_region_get_fd(mr);
+        if (fd != -1 && !(container->proxy->flags & VFIO_PROXY_SECURE)) {
+            fds.send_fds = 1;
+            fds.recv_fds = 0;
+            fds.fds = &fd;
+            map.vaddr = qemu_ram_block_host_offset(mr->ram_block, vaddr);
+
+            return vfio_user_dma_map(container->proxy, &map, &fds);
+        } else {
+            map.vaddr = 0;
+            return vfio_user_dma_map(container->proxy, &map, NULL);
+        }
+    }
+
     /*
      * Try the mapping, if it fails with EBUSY, unmap the region and try
      * again.  This shouldn't be necessary, but we sometimes see it in
@@ -586,7 +608,8 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
 
 /* Called with rcu_read_lock held.  */
 static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
-                               ram_addr_t *ram_addr, bool *read_only)
+                               ram_addr_t *ram_addr, bool *read_only,
+                               MemoryRegion **mrp)
 {
     MemoryRegion *mr;
     hwaddr xlat;
@@ -667,6 +690,10 @@ static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
         *read_only = !writable || mr->readonly;
     }
 
+    if (mrp != NULL) {
+        *mrp = mr;
+    }
+
     return true;
 }
 
@@ -674,6 +701,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
 {
     VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
     VFIOContainer *container = giommu->container;
+    MemoryRegion *mr;
     hwaddr iova = iotlb->iova + giommu->iommu_offset;
     void *vaddr;
     int ret;
@@ -692,7 +720,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
         bool read_only;
 
-        if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only)) {
+        if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only, &mr)) {
             goto out;
         }
         /*
@@ -702,7 +730,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
          * of vaddr will always be there, even if the memory object is
          * destroyed and its backing memory munmap-ed.
          */
-        ret = vfio_dma_map(container, iova,
+        ret = vfio_dma_map(container, mr, iova,
                            iotlb->addr_mask + 1, vaddr,
                            read_only);
         if (ret) {
@@ -764,7 +792,7 @@ static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
                section->offset_within_address_space;
         vaddr = memory_region_get_ram_ptr(section->mr) + start;
 
-        ret = vfio_dma_map(vrdl->container, iova, next - start,
+        ret = vfio_dma_map(vrdl->container, section->mr, iova, next - start,
                            vaddr, section->readonly);
         if (ret) {
             /* Rollback */
@@ -1064,7 +1092,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
         }
     }
 
-    ret = vfio_dma_map(container, iova, int128_get64(llsize),
+    ret = vfio_dma_map(container, section->mr, iova, int128_get64(llsize),
                        vaddr, section->readonly);
     if (ret) {
         error_setg(&err, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
@@ -1330,7 +1358,7 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     }
 
     rcu_read_lock();
-    if (vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL)) {
+    if (vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL, NULL)) {
         int ret;
 
         ret = vfio_get_dirty_bitmap(container, iova, iotlb->addr_mask + 1,
@@ -2493,6 +2521,24 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
                          struct vfio_region_info **info)
 {
     size_t argsz = sizeof(struct vfio_region_info);
+    int fd = -1;
+    int ret;
+
+    /* create region cache */
+    if (vbasedev->regions == NULL) {
+        vbasedev->regions = g_new0(struct vfio_region_info *,
+                                   vbasedev->num_regions);
+        if (vbasedev->proxy != NULL) {
+            vbasedev->regfds = g_new0(int, vbasedev->num_regions);
+        }
+    }
+    /* check cache */
+    if (vbasedev->regions[index] != NULL) {
+        *info = g_malloc0(vbasedev->regions[index]->argsz);
+        memcpy(*info, vbasedev->regions[index],
+               vbasedev->regions[index]->argsz);
+        return 0;
+    }
 
     *info = g_malloc0(argsz);
 
@@ -2500,7 +2546,17 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
 retry:
     (*info)->argsz = argsz;
 
-    if (ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, *info)) {
+    if (vbasedev->proxy != NULL) {
+        VFIOUserFDs fds = { 0, 1, &fd};
+
+        ret = vfio_user_get_region_info(vbasedev, index, *info, &fds);
+    } else {
+        ret = ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, *info);
+        if (ret < 0) {
+            ret = -errno;
+        }
+    }
+    if (ret != 0) {
         g_free(*info);
         *info = NULL;
         return -errno;
@@ -2509,10 +2565,22 @@ retry:
     if ((*info)->argsz > argsz) {
         argsz = (*info)->argsz;
         *info = g_realloc(*info, argsz);
+        if (fd != -1) {
+            close(fd);
+            fd = -1;
+        }
 
         goto retry;
     }
 
+    /* fill cache */
+    vbasedev->regions[index] = g_malloc0(argsz);
+    memcpy(vbasedev->regions[index], *info, argsz);
+    *vbasedev->regions[index] = **info;
+    if (vbasedev->regfds != NULL) {
+        vbasedev->regfds[index] = fd;
+    }
+
     return 0;
 }
 
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 1054978e5e..054e673552 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3387,6 +3387,10 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
     }
     vbasedev->proxy = proxy;
 
+    if (udev->secure) {
+        proxy->flags |= VFIO_PROXY_SECURE;
+    }
+
     vfio_user_validate_version(vbasedev, &err);
     if (err != NULL) {
         error_propagate(errp, err);
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
index 2bb6f8650e..eea8b9b402 100644
--- a/hw/vfio/user.c
+++ b/hw/vfio/user.c
@@ -679,3 +679,103 @@ int vfio_user_region_write(VFIODevice *vbasedev, uint32_t index,
 
     return count;
 }
+
+int vfio_user_dma_map(VFIOProxy *proxy, struct vfio_iommu_type1_dma_map *map,
+                      VFIOUserFDs *fds)
+{
+    struct vfio_user_dma_map msg;
+    int ret;
+
+    vfio_user_request_msg(&msg.hdr, VFIO_USER_DMA_MAP, sizeof(msg), 0);
+    msg.argsz = map->argsz;
+    msg.flags = map->flags;
+    msg.offset = map->vaddr;
+    msg.iova = map->iova;
+    msg.size = map->size;
+
+    vfio_user_send_recv(proxy, &msg.hdr, fds, 0);
+    ret = (msg.hdr.flags & VFIO_USER_ERROR) ? -msg.hdr.error_reply : 0;
+    return ret;
+}
+
+int vfio_user_dma_unmap(VFIOProxy *proxy,
+                        struct vfio_iommu_type1_dma_unmap *unmap,
+                        struct vfio_bitmap *bitmap)
+{
+    g_autofree struct {
+        struct vfio_user_dma_unmap msg;
+        struct vfio_user_bitmap bitmap;
+    } *msgp = NULL;
+    int msize, rsize;
+
+    if (bitmap == NULL && (unmap->flags &
+                           VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP)) {
+        error_printf("vfio_user_dma_unmap mismatched flags and bitmap\n");
+        return -EINVAL;
+    }
+
+    /*
+     * If a dirty bitmap is returned, allocate extra space for it
+     * otherwise, just send the unmap request
+     */
+    if (bitmap != NULL) {
+        msize = sizeof(*msgp);
+        rsize = msize + bitmap->size;
+        msgp = g_malloc0(rsize);
+        msgp->bitmap.pgsize = bitmap->pgsize;
+        msgp->bitmap.size = bitmap->size;
+    } else {
+        msize = rsize = sizeof(struct vfio_user_dma_unmap);
+        msgp = g_malloc0(rsize);
+    }
+
+    vfio_user_request_msg(&msgp->msg.hdr, VFIO_USER_DMA_UNMAP, msize, 0);
+    msgp->msg.argsz = unmap->argsz;
+    msgp->msg.flags = unmap->flags;
+    msgp->msg.iova = unmap->iova;
+    msgp->msg.size = unmap->size;
+
+    vfio_user_send_recv(proxy, &msgp->msg.hdr, NULL, rsize);
+    if (msgp->msg.hdr.flags & VFIO_USER_ERROR) {
+        return -msgp->msg.hdr.error_reply;
+    }
+
+    if (bitmap != NULL) {
+        memcpy(bitmap->data, &msgp->bitmap.data, bitmap->size);
+    }
+
+    return 0;
+}
+
+int vfio_user_get_region_info(VFIODevice *vbasedev, int index,
+                              struct vfio_region_info *info, VFIOUserFDs *fds)
+{
+    g_autofree struct vfio_user_region_info *msgp = NULL;
+    int size;
+
+    /* data returned can be larger than vfio_region_info */
+    if (info->argsz < sizeof(*info)) {
+        error_printf("vfio_user_get_region_info argsz too small\n");
+        return -EINVAL;
+    }
+    if (fds != NULL && fds->send_fds != 0) {
+        error_printf("vfio_user_get_region_info can't send FDs\n");
+        return -EINVAL;
+    }
+
+    size = info->argsz + sizeof(vfio_user_hdr_t);
+    msgp = g_malloc0(size);
+
+    vfio_user_request_msg(&msgp->hdr, VFIO_USER_DEVICE_GET_REGION_INFO,
+                          sizeof(*msgp), 0);
+    msgp->argsz = info->argsz;
+    msgp->index = info->index;
+
+    vfio_user_send_recv(vbasedev->proxy, &msgp->hdr, fds, size);
+    if (msgp->hdr.flags & VFIO_USER_ERROR) {
+        return -msgp->hdr.error_reply;
+    }
+
+    memcpy(info, &msgp->argsz, info->argsz);
+    return 0;
+}
-- 
2.25.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC 12/19] vfio-user: probe remote device's BARs
  2021-07-19  6:27 [PATCH RFC 00/19] vfio-user implementation Elena Ufimtseva
                   ` (10 preceding siblings ...)
  2021-07-19  6:27 ` [PATCH RFC 11/19] vfio-user: get region and DMA map/unmap operations Elena Ufimtseva
@ 2021-07-19  6:27 ` Elena Ufimtseva
  2021-07-19 22:59   ` Alex Williamson
  2021-07-19  6:27 ` [PATCH RFC 13/19] vfio-user: respond to remote DMA read/write requests Elena Ufimtseva
                   ` (7 subsequent siblings)
  19 siblings, 1 reply; 55+ messages in thread
From: Elena Ufimtseva @ 2021-07-19  6:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha

From: John G Johnson <john.g.johnson@oracle.com>

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/vfio/pci.c | 32 +++++++++++++++++++++++++++-----
 1 file changed, 27 insertions(+), 5 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 054e673552..a8d2e59470 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1619,11 +1619,17 @@ static void vfio_bar_prepare(VFIOPCIDevice *vdev, int nr)
     }
 
     /* Determine what type of BAR this is for registration */
-    ret = pread(vdev->vbasedev.fd, &pci_bar, sizeof(pci_bar),
-                vdev->config_offset + PCI_BASE_ADDRESS_0 + (4 * nr));
-    if (ret != sizeof(pci_bar)) {
-        error_report("vfio: Failed to read BAR %d (%m)", nr);
-        return;
+    if (vdev->vbasedev.proxy != NULL) {
+        /* during setup, config space was initialized from remote */
+        memcpy(&pci_bar, vdev->pdev.config + PCI_BASE_ADDRESS_0 + (4 * nr),
+               sizeof(pci_bar));
+    } else {
+        ret = pread(vdev->vbasedev.fd, &pci_bar, sizeof(pci_bar),
+                    vdev->config_offset + PCI_BASE_ADDRESS_0 + (4 * nr));
+        if (ret != sizeof(pci_bar)) {
+            error_report("vfio: Failed to read BAR %d (%m)", nr);
+            return;
+        }
     }
 
     pci_bar = le32_to_cpu(pci_bar);
@@ -3442,6 +3448,22 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
     /* QEMU can also add or extend BARs */
     memset(vdev->emulated_config_bits + PCI_BASE_ADDRESS_0, 0xff, 6 * 4);
 
+    /*
+     * Local QEMU overrides aren't allowed
+     * They must be done in the device process
+     */
+    if (pdev->cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
+        error_setg(errp, "Multi-function must be specified by device process");
+        goto error;
+    }
+    if (pdev->romfile) {
+        error_setg(errp, "Romfile must be specified by device process");
+        goto error;
+    }
+
+    vfio_bars_prepare(vdev);
+
+
     return;
 
  error:
-- 
2.25.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC 13/19] vfio-user: respond to remote DMA read/write requests
  2021-07-19  6:27 [PATCH RFC 00/19] vfio-user implementation Elena Ufimtseva
                   ` (11 preceding siblings ...)
  2021-07-19  6:27 ` [PATCH RFC 12/19] vfio-user: probe remote device's BARs Elena Ufimtseva
@ 2021-07-19  6:27 ` Elena Ufimtseva
  2021-07-19  6:27 ` [PATCH RFC 14/19] vfio_user: setup MSI/X interrupts and PCI config operations Elena Ufimtseva
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 55+ messages in thread
From: Elena Ufimtseva @ 2021-07-19  6:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha

From: John G Johnson <john.g.johnson@oracle.com>

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/vfio/user.h | 16 ++++++++++++
 hw/vfio/pci.c  | 67 ++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/user.c | 21 +++++++++++++++-
 3 files changed, 103 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/user.h b/hw/vfio/user.h
index 351fdb3ee1..d08d94ed92 100644
--- a/hw/vfio/user.h
+++ b/hw/vfio/user.h
@@ -206,6 +206,17 @@ struct vfio_user_region_info {
     uint64_t offset;
 };
 
+/*
+ * VFIO_USER_DMA_READ
+ * VFIO_USER_DMA_WRITE
+ */
+struct vfio_user_dma_rw {
+    vfio_user_hdr_t hdr;
+    uint64_t offset;
+    uint32_t count;
+    char data[];
+};
+
 void vfio_user_recv(void *opaque);
 void vfio_user_send_reply(VFIOProxy *proxy, char *buf, int ret);
 VFIOProxy *vfio_user_connect_dev(char *sockname, Error **errp);
@@ -224,4 +235,9 @@ int vfio_user_dma_unmap(VFIOProxy *proxy,
                         struct vfio_bitmap *bitmap);
 int vfio_user_get_region_info(VFIODevice *vbasedev, int index,
                               struct vfio_region_info *info, VFIOUserFDs *fds);
+uint64_t vfio_user_max_xfer(void);
+void vfio_user_set_reqhandler(VFIODevice *vbasdev,
+                              int (*handler)(void *opaque, char *buf,
+                                             VFIOUserFDs *fds),
+                                             void *reqarg);
 #endif /* VFIO_USER_H */
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index a8d2e59470..7042c178dd 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3347,6 +3347,72 @@ static void register_vfio_pci_dev_type(void)
 
 type_init(register_vfio_pci_dev_type)
 
+static int vfio_user_dma_read(VFIOPCIDevice *vdev, struct vfio_user_dma_rw *msg)
+{
+    PCIDevice *pdev = &vdev->pdev;
+    char *buf;
+    int size = msg->count + sizeof(struct vfio_user_dma_rw);
+
+    if (msg->hdr.flags & VFIO_USER_NO_REPLY) {
+        return -EINVAL;
+    }
+    if (msg->count > vfio_user_max_xfer()) {
+        return -E2BIG;
+    }
+
+    buf = g_malloc0(size);
+    memcpy(buf, msg, sizeof(*msg));
+
+    pci_dma_read(pdev, msg->offset, buf + sizeof(*msg), msg->count);
+
+    vfio_user_send_reply(vdev->vbasedev.proxy, buf, size);
+    g_free(buf);
+    return 0;
+}
+
+static int vfio_user_dma_write(VFIOPCIDevice *vdev,
+                               struct vfio_user_dma_rw *msg)
+{
+    PCIDevice *pdev = &vdev->pdev;
+    char *buf = (char *)msg + sizeof(*msg);
+
+    /* make sure transfer count isn't larger than the message data */
+    if (msg->count > msg->hdr.size - sizeof(*msg)) {
+        return -E2BIG;
+    }
+
+    pci_dma_write(pdev, msg->offset, buf, msg->count);
+
+    if ((msg->hdr.flags & VFIO_USER_NO_REPLY) == 0) {
+        vfio_user_send_reply(vdev->vbasedev.proxy, (char *)msg,
+                             sizeof(msg->hdr));
+    }
+    return 0;
+}
+
+static int vfio_user_pci_process_req(void *opaque, char *buf, VFIOUserFDs *fds)
+{
+    VFIOPCIDevice *vdev = opaque;
+    vfio_user_hdr_t *hdr = (vfio_user_hdr_t *)buf;
+    int ret;
+
+    if (fds->recv_fds != 0) {
+        return -EINVAL;
+    }
+    switch (hdr->command) {
+    case VFIO_USER_DMA_READ:
+        ret = vfio_user_dma_read(vdev, (struct vfio_user_dma_rw *)hdr);
+        break;
+    case VFIO_USER_DMA_WRITE:
+        ret = vfio_user_dma_write(vdev, (struct vfio_user_dma_rw *)hdr);
+        break;
+    default:
+        error_printf("vfio_user_process_req unknown cmd %d\n", hdr->command);
+        ret = -ENOSYS;
+    }
+    return ret;
+}
+
 /*
  * Emulated devices don't use host hot reset
  */
@@ -3392,6 +3458,7 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
         return;
     }
     vbasedev->proxy = proxy;
+    vfio_user_set_reqhandler(vbasedev, vfio_user_pci_process_req, vdev);
 
     if (udev->secure) {
         proxy->flags |= VFIO_PROXY_SECURE;
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
index eea8b9b402..8bedbc19f3 100644
--- a/hw/vfio/user.c
+++ b/hw/vfio/user.c
@@ -42,6 +42,11 @@ static void vfio_user_request_msg(vfio_user_hdr_t *hdr, uint16_t cmd,
 static void vfio_user_send_recv(VFIOProxy *proxy, vfio_user_hdr_t *msg,
                                 VFIOUserFDs *fds, int rsize);
 
+uint64_t vfio_user_max_xfer(void)
+{
+    return max_xfer_size;
+}
+
 static void vfio_user_shutdown(VFIOProxy *proxy)
 {
     qio_channel_shutdown(proxy->ioc, QIO_CHANNEL_SHUTDOWN_READ, NULL);
@@ -236,7 +241,7 @@ void vfio_user_recv(void *opaque)
         *reply->msg = msg;
         data = (char *)reply->msg + sizeof(msg);
     } else {
-        if (msg.size > max_xfer_size) {
+        if (msg.size > max_xfer_size + sizeof(struct vfio_user_dma_rw)) {
             error_setg(&local_err, "vfio_user_recv request larger than max");
             goto fatal;
         }
@@ -779,3 +784,17 @@ int vfio_user_get_region_info(VFIODevice *vbasedev, int index,
     memcpy(info, &msgp->argsz, info->argsz);
     return 0;
 }
+
+void vfio_user_set_reqhandler(VFIODevice *vbasedev,
+                              int (*handler)(void *opaque, char *buf,
+                                             VFIOUserFDs *fds),
+                              void *reqarg)
+{
+    VFIOProxy *proxy = vbasedev->proxy;
+
+    proxy->request = handler;
+    proxy->reqarg = reqarg;
+    qio_channel_set_aio_fd_handler(proxy->ioc,
+                                   iothread_get_aio_context(vfio_user_iothread),
+                                   vfio_user_recv, NULL, vbasedev);
+}
-- 
2.25.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC 14/19] vfio_user: setup MSI/X interrupts and PCI config operations
  2021-07-19  6:27 [PATCH RFC 00/19] vfio-user implementation Elena Ufimtseva
                   ` (12 preceding siblings ...)
  2021-07-19  6:27 ` [PATCH RFC 13/19] vfio-user: respond to remote DMA read/write requests Elena Ufimtseva
@ 2021-07-19  6:27 ` Elena Ufimtseva
  2021-07-19  6:27 ` [PATCH RFC 15/19] vfio-user: vfio user device realize Elena Ufimtseva
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 55+ messages in thread
From: Elena Ufimtseva @ 2021-07-19  6:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha

From: John G Johnson <john.g.johnson@oracle.com>

Send VFIO_USER_DEVICE_SET_IRQS to setup interrup configuration.
vfio_pci_write_config/vfio_pci_read_config iforms the remote
server of PCI config space reads and writes.

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/vfio/user.h   | 14 ++++++++
 hw/vfio/common.c | 26 ++++++++++++---
 hw/vfio/pci.c    | 71 ++++++++++++++++++++++++++++-----------
 hw/vfio/user.c   | 87 ++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 173 insertions(+), 25 deletions(-)

diff --git a/hw/vfio/user.h b/hw/vfio/user.h
index d08d94ed92..afb85952da 100644
--- a/hw/vfio/user.h
+++ b/hw/vfio/user.h
@@ -217,6 +217,19 @@ struct vfio_user_dma_rw {
     char data[];
 };
 
+/*
+ * VFIO_USER_DEVICE_SET_IRQS
+ * imported from struct vfio_irq_set
+ */
+struct vfio_user_irq_set {
+    vfio_user_hdr_t hdr;
+    uint32_t argsz;
+    uint32_t flags;
+    uint32_t index;
+    uint32_t start;
+    uint32_t count;
+};
+
 void vfio_user_recv(void *opaque);
 void vfio_user_send_reply(VFIOProxy *proxy, char *buf, int ret);
 VFIOProxy *vfio_user_connect_dev(char *sockname, Error **errp);
@@ -240,4 +253,5 @@ void vfio_user_set_reqhandler(VFIODevice *vbasdev,
                               int (*handler)(void *opaque, char *buf,
                                              VFIOUserFDs *fds),
                                              void *reqarg);
+int vfio_user_set_irqs(VFIODevice *vbasedev, struct vfio_irq_set *irq);
 #endif /* VFIO_USER_H */
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 52a092e168..9b68416599 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -71,7 +71,11 @@ void vfio_disable_irqindex(VFIODevice *vbasedev, int index)
         .count = 0,
     };
 
-    ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
+    if (vbasedev->proxy != NULL) {
+        vfio_user_set_irqs(vbasedev, &irq_set);
+    } else {
+        ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
+    }
 }
 
 void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index)
@@ -84,7 +88,11 @@ void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index)
         .count = 1,
     };
 
-    ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
+    if (vbasedev->proxy != NULL) {
+        vfio_user_set_irqs(vbasedev, &irq_set);
+    } else {
+        ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
+    }
 }
 
 void vfio_mask_single_irqindex(VFIODevice *vbasedev, int index)
@@ -97,7 +105,11 @@ void vfio_mask_single_irqindex(VFIODevice *vbasedev, int index)
         .count = 1,
     };
 
-    ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
+    if (vbasedev->proxy != NULL) {
+        vfio_user_set_irqs(vbasedev, &irq_set);
+    } else {
+        ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
+    }
 }
 
 static inline const char *action_to_str(int action)
@@ -178,8 +190,12 @@ int vfio_set_irq_signaling(VFIODevice *vbasedev, int index, int subindex,
     pfd = (int32_t *)&irq_set->data;
     *pfd = fd;
 
-    if (ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, irq_set)) {
-        ret = -errno;
+    if (vbasedev->proxy != NULL) {
+        ret = vfio_user_set_irqs(vbasedev, irq_set);
+    } else {
+        if (ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, irq_set)) {
+            ret = -errno;
+        }
     }
     g_free(irq_set);
 
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 7042c178dd..3362e8f3f5 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -403,7 +403,11 @@ static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool msix)
         fds[i] = fd;
     }
 
-    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
+    if (vdev->vbasedev.proxy != NULL) {
+        ret = vfio_user_set_irqs(&vdev->vbasedev, irq_set);
+    } else {
+        ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
+    }
 
     g_free(irq_set);
 
@@ -1123,8 +1127,14 @@ uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len)
     if (~emu_bits & (0xffffffffU >> (32 - len * 8))) {
         ssize_t ret;
 
-        ret = pread(vdev->vbasedev.fd, &phys_val, len,
-                    vdev->config_offset + addr);
+        if (vdev->vbasedev.proxy != NULL) {
+            ret = vfio_user_region_read(&vdev->vbasedev,
+                                        VFIO_PCI_CONFIG_REGION_INDEX,
+                                        addr, len, &phys_val);
+        } else {
+            ret = pread(vdev->vbasedev.fd, &phys_val, len,
+                        vdev->config_offset + addr);
+        }
         if (ret != len) {
             error_report("%s(%s, 0x%x, 0x%x) failed: %m",
                          __func__, vdev->vbasedev.name, addr, len);
@@ -1145,12 +1155,20 @@ void vfio_pci_write_config(PCIDevice *pdev,
 {
     VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
     uint32_t val_le = cpu_to_le32(val);
+    int ret;
 
     trace_vfio_pci_write_config(vdev->vbasedev.name, addr, val, len);
 
     /* Write everything to VFIO, let it filter out what we can't write */
-    if (pwrite(vdev->vbasedev.fd, &val_le, len, vdev->config_offset + addr)
-                != len) {
+    if (vdev->vbasedev.proxy != NULL) {
+        ret = vfio_user_region_write(&vdev->vbasedev,
+                                     VFIO_PCI_CONFIG_REGION_INDEX,
+                                     addr, len, &val_le);
+    } else {
+        ret = pwrite(vdev->vbasedev.fd, &val_le, len,
+                     vdev->config_offset + addr);
+    }
+    if (ret != len) {
         error_report("%s(%s, 0x%x, 0x%x, 0x%x) failed: %m",
                      __func__, vdev->vbasedev.name, addr, val, len);
     }
@@ -1175,7 +1193,7 @@ void vfio_pci_write_config(PCIDevice *pdev,
                 vfio_update_msi(vdev);
             }
         }
-    } else if (pdev->cap_present & QEMU_PCI_CAP_MSIX &&
+  } else if (pdev->cap_present & QEMU_PCI_CAP_MSIX &&
         ranges_overlap(addr, len, pdev->msix_cap, MSIX_CAP_LENGTH)) {
         int is_enabled, was_enabled = msix_enabled(pdev);
 
@@ -1456,22 +1474,30 @@ static void vfio_msix_early_setup(VFIOPCIDevice *vdev, Error **errp)
         return;
     }
 
-    if (pread(fd, &ctrl, sizeof(ctrl),
-              vdev->config_offset + pos + PCI_MSIX_FLAGS) != sizeof(ctrl)) {
-        error_setg_errno(errp, errno, "failed to read PCI MSIX FLAGS");
-        return;
-    }
+    if (vdev->vbasedev.proxy != NULL) {
+        /* during setup, config space was initialized from remote */
+        memcpy(&ctrl, vdev->pdev.config + pos + PCI_MSIX_FLAGS, sizeof(ctrl));
+        memcpy(&table, vdev->pdev.config + pos + PCI_MSIX_TABLE, sizeof(table));
+        memcpy(&pba, vdev->pdev.config + pos + PCI_MSIX_PBA, sizeof(pba));
+    } else {
+        if (pread(fd, &ctrl, sizeof(ctrl),
+                  vdev->config_offset + pos + PCI_MSIX_FLAGS) != sizeof(ctrl)) {
+            error_setg_errno(errp, errno, "failed to read PCI MSIX FLAGS");
+            return;
+        }
 
-    if (pread(fd, &table, sizeof(table),
-              vdev->config_offset + pos + PCI_MSIX_TABLE) != sizeof(table)) {
-        error_setg_errno(errp, errno, "failed to read PCI MSIX TABLE");
-        return;
-    }
+        if (pread(fd, &table, sizeof(table),
+                  vdev->config_offset + pos +
+                  PCI_MSIX_TABLE) != sizeof(table)) {
+            error_setg_errno(errp, errno, "failed to read PCI MSIX TABLE");
+            return;
+        }
 
-    if (pread(fd, &pba, sizeof(pba),
-              vdev->config_offset + pos + PCI_MSIX_PBA) != sizeof(pba)) {
-        error_setg_errno(errp, errno, "failed to read PCI MSIX PBA");
-        return;
+        if (pread(fd, &pba, sizeof(pba),
+                  vdev->config_offset + pos + PCI_MSIX_PBA) != sizeof(pba)) {
+            error_setg_errno(errp, errno, "failed to read PCI MSIX PBA");
+            return;
+        }
     }
 
     ctrl = le16_to_cpu(ctrl);
@@ -3530,6 +3556,11 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
 
     vfio_bars_prepare(vdev);
 
+    vfio_msix_early_setup(vdev, &err);
+    if (err) {
+        error_propagate(errp, err);
+        goto error;
+    }
 
     return;
 
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
index 8bedbc19f3..6afbde8ba8 100644
--- a/hw/vfio/user.c
+++ b/hw/vfio/user.c
@@ -798,3 +798,90 @@ void vfio_user_set_reqhandler(VFIODevice *vbasedev,
                                    iothread_get_aio_context(vfio_user_iothread),
                                    vfio_user_recv, NULL, vbasedev);
 }
+
+static int irq_howmany(int *fdp, int cur, int max)
+{
+    int n = 0;
+
+    if (fdp[cur] != -1) {
+        do {
+            n++;
+        } while (n < max && fdp[cur + n] != -1 && n < max_send_fds);
+    } else {
+        do {
+            n++;
+        } while (n < max && fdp[cur + n] == -1 && n < max_send_fds);
+    }
+
+    return n;
+}
+
+int vfio_user_set_irqs(VFIODevice *vbasedev, struct vfio_irq_set *irq)
+{
+    g_autofree struct vfio_user_irq_set *msgp = NULL;
+    uint32_t size, nfds, send_fds, sent_fds;
+
+    if (irq->argsz < sizeof(*irq)) {
+        error_printf("vfio_user_set_irqs argsz too small\n");
+        return -EINVAL;
+    }
+
+    /*
+     * Handle simple case
+     */
+    if ((irq->flags & VFIO_IRQ_SET_DATA_EVENTFD) == 0) {
+        size = sizeof(vfio_user_hdr_t) + irq->argsz;
+        msgp = g_malloc0(size);
+
+        vfio_user_request_msg(&msgp->hdr, VFIO_USER_DEVICE_SET_IRQS, size, 0);
+        msgp->argsz = irq->argsz;
+        msgp->flags = irq->flags;
+        msgp->index = irq->index;
+        msgp->start = irq->start;
+        msgp->count = irq->count;
+
+        vfio_user_send_recv(vbasedev->proxy, &msgp->hdr, NULL, 0);
+        if (msgp->hdr.flags & VFIO_USER_ERROR) {
+            return -msgp->hdr.error_reply;
+        }
+
+        return 0;
+    }
+
+    /*
+     * Calculate the number of FDs to send
+     * and adjust argsz
+     */
+    nfds = (irq->argsz - sizeof(*irq)) / sizeof(int);
+    irq->argsz = sizeof(*irq);
+    msgp = g_malloc0(sizeof(*msgp));
+    /*
+     * Send in chunks if over max_send_fds
+     */
+    for (sent_fds = 0; nfds > sent_fds; sent_fds += send_fds) {
+        VFIOUserFDs *arg_fds, loop_fds;
+
+        /* must send all valid FDs or all invalid FDs in single msg */
+        send_fds = irq_howmany((int *)irq->data, sent_fds, nfds - sent_fds);
+
+        vfio_user_request_msg(&msgp->hdr, VFIO_USER_DEVICE_SET_IRQS,
+                              sizeof(*msgp), 0);
+        msgp->argsz = irq->argsz;
+        msgp->flags = irq->flags;
+        msgp->index = irq->index;
+        msgp->start = irq->start + sent_fds;
+        msgp->count = send_fds;
+
+        loop_fds.send_fds = send_fds;
+        loop_fds.recv_fds = 0;
+        loop_fds.fds = (int *)irq->data + sent_fds;
+        arg_fds = loop_fds.fds[0] != -1 ? &loop_fds : NULL;
+
+        vfio_user_send_recv(vbasedev->proxy, &msgp->hdr, arg_fds, 0);
+        if (msgp->hdr.flags & VFIO_USER_ERROR) {
+            return -msgp->hdr.error_reply;
+        }
+    }
+
+    return 0;
+}
-- 
2.25.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC 15/19] vfio-user: vfio user device realize
  2021-07-19  6:27 [PATCH RFC 00/19] vfio-user implementation Elena Ufimtseva
                   ` (13 preceding siblings ...)
  2021-07-19  6:27 ` [PATCH RFC 14/19] vfio_user: setup MSI/X interrupts and PCI config operations Elena Ufimtseva
@ 2021-07-19  6:27 ` Elena Ufimtseva
  2021-07-19  6:27 ` [PATCH RFC 16/19] vfio-user: pci reset Elena Ufimtseva
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 55+ messages in thread
From: Elena Ufimtseva @ 2021-07-19  6:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha

From: John G Johnson <john.g.johnson@oracle.com>

Setup INTx interrupts and a device region info
cache for remote device info.

Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
---
 include/hw/vfio/vfio-common.h |  1 +
 hw/vfio/common.c              | 33 ++++++++++++++++++-
 hw/vfio/pci.c                 | 61 ++++++++++++++++++++++++++++++++---
 hw/vfio/user.c                | 20 ++++++++++++
 4 files changed, 109 insertions(+), 6 deletions(-)

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index d7b717594b..688660c28d 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -56,6 +56,7 @@ typedef struct VFIORegion {
     uint32_t nr_mmaps;
     VFIOMmap *mmaps;
     uint8_t nr; /* cache the region number for debug */
+    int remfd; /* fd if exported from remote process */
 } VFIORegion;
 
 typedef struct VFIOMigration {
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 9b68416599..953d9e7b55 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1571,6 +1571,16 @@ bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
     return true;
 }
 
+static int vfio_get_region_info_remfd(VFIODevice *vbasedev, int index)
+{
+    struct vfio_region_info *info;
+
+    if (vbasedev->regions == NULL || vbasedev->regions[index] == NULL) {
+        vfio_get_region_info(vbasedev, index, &info);
+    }
+    return vbasedev->regfds != NULL ? vbasedev->regfds[index] : -1;
+}
+
 static int vfio_setup_region_sparse_mmaps(VFIORegion *region,
                                           struct vfio_region_info *info)
 {
@@ -1624,6 +1634,7 @@ int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
     region->size = info->size;
     region->fd_offset = info->offset;
     region->nr = index;
+    region->remfd = vfio_get_region_info_remfd(vbasedev, index);
 
     if (region->size) {
         region->mem = g_new0(MemoryRegion, 1);
@@ -1667,6 +1678,7 @@ int vfio_region_mmap(VFIORegion *region)
 {
     int i, prot = 0;
     char *name;
+    int fd;
 
     if (!region->mem) {
         return 0;
@@ -1675,9 +1687,11 @@ int vfio_region_mmap(VFIORegion *region)
     prot |= region->flags & VFIO_REGION_INFO_FLAG_READ ? PROT_READ : 0;
     prot |= region->flags & VFIO_REGION_INFO_FLAG_WRITE ? PROT_WRITE : 0;
 
+    fd = region->remfd != -1 ? region->remfd : region->vbasedev->fd;
+
     for (i = 0; i < region->nr_mmaps; i++) {
         region->mmaps[i].mmap = mmap(NULL, region->mmaps[i].size, prot,
-                                     MAP_SHARED, region->vbasedev->fd,
+                                     MAP_SHARED, fd,
                                      region->fd_offset +
                                      region->mmaps[i].offset);
         if (region->mmaps[i].mmap == MAP_FAILED) {
@@ -2524,6 +2538,23 @@ int vfio_get_device(VFIOGroup *group, const char *name,
 
 void vfio_put_base_device(VFIODevice *vbasedev)
 {
+    if (vbasedev->regions != NULL) {
+        int i;
+
+        for (i = 0; i < vbasedev->num_regions; i++) {
+            if (vbasedev->regfds != NULL && vbasedev->regfds[i] != -1) {
+                close(vbasedev->regfds[i]);
+            }
+            g_free(vbasedev->regions[i]);
+        }
+        g_free(vbasedev->regions);
+        vbasedev->regions = NULL;
+        if (vbasedev->regfds != NULL) {
+            g_free(vbasedev->regfds);
+            vbasedev->regfds = NULL;
+        }
+    }
+
     if (!vbasedev->group) {
         return;
     }
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 3362e8f3f5..52af5a1061 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -256,11 +256,16 @@ static void vfio_irqchip_change(Notifier *notify, void *data)
 
 static int vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
 {
-    uint8_t pin = vfio_pci_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1);
+    uint8_t pin;
     Error *err = NULL;
     int32_t fd;
     int ret;
 
+    if (vdev->vbasedev.proxy != NULL) {
+        pin = vdev->pdev.config[PCI_INTERRUPT_PIN];
+    } else {
+        pin = vfio_pci_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1);
+    }
 
     if (!pin) {
         return 0;
@@ -1258,10 +1263,15 @@ static int vfio_msi_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
     int ret, entries;
     Error *err = NULL;
 
-    if (pread(vdev->vbasedev.fd, &ctrl, sizeof(ctrl),
-              vdev->config_offset + pos + PCI_CAP_FLAGS) != sizeof(ctrl)) {
-        error_setg_errno(errp, errno, "failed reading MSI PCI_CAP_FLAGS");
-        return -errno;
+    if (vdev->vbasedev.proxy != NULL) {
+        /* during setup, config space was initialized from remote */
+        memcpy(&ctrl, vdev->pdev.config + pos + PCI_CAP_FLAGS, sizeof(ctrl));
+    } else {
+        if (pread(vdev->vbasedev.fd, &ctrl, sizeof(ctrl),
+                  vdev->config_offset + pos + PCI_CAP_FLAGS) != sizeof(ctrl)) {
+            error_setg_errno(errp, errno, "failed reading MSI PCI_CAP_FLAGS");
+            return -errno;
+        }
     }
     ctrl = le16_to_cpu(ctrl);
 
@@ -3562,9 +3572,50 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
         goto error;
     }
 
+    vfio_bars_register(vdev);
+
+    ret = vfio_add_capabilities(vdev, errp);
+    if (ret) {
+        goto out_teardown;
+    }
+
+    /* QEMU emulates all of MSI & MSIX */
+    if (pdev->cap_present & QEMU_PCI_CAP_MSIX) {
+        memset(vdev->emulated_config_bits + pdev->msix_cap, 0xff,
+               MSIX_CAP_LENGTH);
+    }
+
+    if (pdev->cap_present & QEMU_PCI_CAP_MSI) {
+        memset(vdev->emulated_config_bits + pdev->msi_cap, 0xff,
+               vdev->msi_cap_size);
+    }
+
+    if (vdev->pdev.config[PCI_INTERRUPT_PIN] != 0) {
+        vdev->intx.mmap_timer = timer_new_ms(QEMU_CLOCK_VIRTUAL,
+                                             vfio_intx_mmap_enable, vdev);
+        pci_device_set_intx_routing_notifier(&vdev->pdev,
+                                             vfio_intx_routing_notifier);
+        vdev->irqchip_change_notifier.notify = vfio_irqchip_change;
+        kvm_irqchip_add_change_notifier(&vdev->irqchip_change_notifier);
+        ret = vfio_intx_enable(vdev, errp);
+        if (ret) {
+            goto out_deregister;
+        }
+    }
+
+    vfio_register_err_notifier(vdev);
+    vfio_register_req_notifier(vdev);
+
     return;
 
+out_deregister:
+    pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
+    kvm_irqchip_remove_change_notifier(&vdev->irqchip_change_notifier);
+out_teardown:
+    vfio_teardown_msi(vdev);
+    vfio_bars_exit(vdev);
  error:
+    vfio_user_disconnect(proxy);
     error_prepend(errp, VFIO_MSG_PREFIX, vdev->vbasedev.name);
 }
 
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
index 6afbde8ba8..0fd7e01986 100644
--- a/hw/vfio/user.c
+++ b/hw/vfio/user.c
@@ -574,6 +574,16 @@ VFIOProxy *vfio_user_connect_dev(char *sockname, Error **errp)
     return proxy;
 }
 
+static void vfio_user_cb(void *opaque)
+{
+    VFIOProxy *proxy = opaque;
+
+    qemu_mutex_lock(&proxy->lock);
+    proxy->state = CLOSED;
+    qemu_mutex_unlock(&proxy->lock);
+    qemu_cond_signal(&proxy->close_cv);
+}
+
 void vfio_user_disconnect(VFIOProxy *proxy)
 {
     VFIOUserReply *r1, *r2;
@@ -601,6 +611,16 @@ void vfio_user_disconnect(VFIOProxy *proxy)
         g_free(r1);
     }
 
+    /*
+     * Make sure the iothread isn't blocking anywhere
+     * with a ref to this proxy by waiting for a BH
+     * handler to run after the proxy fd handlers were
+     * deleted above.
+     */
+    proxy->close_wait = 1;
+    aio_bh_schedule_oneshot(iothread_get_aio_context(vfio_user_iothread),
+                            vfio_user_cb, proxy);
+
     /* drop locks so the iothread can make progress */
     qemu_mutex_unlock_iothread();
     qemu_cond_wait(&proxy->close_cv, &proxy->lock);
-- 
2.25.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC 16/19] vfio-user: pci reset
  2021-07-19  6:27 [PATCH RFC 00/19] vfio-user implementation Elena Ufimtseva
                   ` (14 preceding siblings ...)
  2021-07-19  6:27 ` [PATCH RFC 15/19] vfio-user: vfio user device realize Elena Ufimtseva
@ 2021-07-19  6:27 ` Elena Ufimtseva
  2021-07-19  6:27 ` [PATCH RFC 17/19] vfio-user: probe remote device ROM BAR Elena Ufimtseva
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 55+ messages in thread
From: Elena Ufimtseva @ 2021-07-19  6:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha

From: John G Johnson <john.g.johnson@oracle.com>

Send VFIO_USER_DEVICE_RESET to reset remote device.

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/vfio/user.h |  1 +
 hw/vfio/pci.c  | 29 ++++++++++++++++++++++++++---
 hw/vfio/user.c | 12 ++++++++++++
 3 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/hw/vfio/user.h b/hw/vfio/user.h
index afb85952da..95c2fb1707 100644
--- a/hw/vfio/user.h
+++ b/hw/vfio/user.h
@@ -254,4 +254,5 @@ void vfio_user_set_reqhandler(VFIODevice *vbasdev,
                                              VFIOUserFDs *fds),
                                              void *reqarg);
 int vfio_user_set_irqs(VFIODevice *vbasedev, struct vfio_irq_set *irq);
+void vfio_user_reset(VFIODevice *vbasedev);
 #endif /* VFIO_USER_H */
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 52af5a1061..a6c28dac03 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2212,8 +2212,9 @@ static void vfio_pci_pre_reset(VFIOPCIDevice *vdev)
 
 static void vfio_pci_post_reset(VFIOPCIDevice *vdev)
 {
+    VFIODevice *vbasedev = &vdev->vbasedev;
     Error *err = NULL;
-    int nr;
+    int ret, nr;
 
     vfio_intx_enable(vdev, &err);
     if (err) {
@@ -2221,11 +2222,18 @@ static void vfio_pci_post_reset(VFIOPCIDevice *vdev)
     }
 
     for (nr = 0; nr < PCI_NUM_REGIONS - 1; ++nr) {
-        off_t addr = vdev->config_offset + PCI_BASE_ADDRESS_0 + (4 * nr);
+        off_t addr = PCI_BASE_ADDRESS_0 + (4 * nr);
         uint32_t val = 0;
         uint32_t len = sizeof(val);
 
-        if (pwrite(vdev->vbasedev.fd, &val, len, addr) != len) {
+        if (vbasedev->proxy != NULL) {
+            ret = vfio_user_region_write(vbasedev, VFIO_PCI_CONFIG_REGION_INDEX,
+                                         addr, len, &val);
+        } else {
+            ret = pwrite(vdev->vbasedev.fd, &val, len,
+                         vdev->config_offset + addr);
+        }
+        if (ret != len) {
             error_report("%s(%s) reset bar %d failed: %m", __func__,
                          vdev->vbasedev.name, nr);
         }
@@ -3634,6 +3642,20 @@ static void vfio_user_instance_finalize(Object *obj)
     vfio_user_disconnect(vbasedev->proxy);
 }
 
+static void vfio_user_pci_reset(DeviceState *dev)
+{
+    VFIOPCIDevice *vdev = VFIO_PCI_BASE(dev);
+    VFIODevice *vbasedev = &vdev->vbasedev;
+
+    vfio_pci_pre_reset(vdev);
+
+    if (vbasedev->reset_works) {
+        vfio_user_reset(vbasedev);
+    }
+
+    vfio_pci_post_reset(vdev);
+}
+
 static Property vfio_user_pci_dev_properties[] = {
     DEFINE_PROP_STRING("socket", VFIOUserPCIDevice, sock_name),
     DEFINE_PROP_BOOL("secure-dma", VFIOUserPCIDevice, secure, false),
@@ -3645,6 +3667,7 @@ static void vfio_user_pci_dev_class_init(ObjectClass *klass, void *data)
     DeviceClass *dc = DEVICE_CLASS(klass);
     PCIDeviceClass *pdc = PCI_DEVICE_CLASS(klass);
 
+    dc->reset = vfio_user_pci_reset;
     device_class_set_props(dc, vfio_user_pci_dev_properties);
     dc->desc = "VFIO over socket PCI device assignment";
     pdc->realize = vfio_user_pci_realize;
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
index 0fd7e01986..8917596a2f 100644
--- a/hw/vfio/user.c
+++ b/hw/vfio/user.c
@@ -905,3 +905,15 @@ int vfio_user_set_irqs(VFIODevice *vbasedev, struct vfio_irq_set *irq)
 
     return 0;
 }
+
+void vfio_user_reset(VFIODevice *vbasedev)
+{
+    vfio_user_hdr_t msg;
+
+    vfio_user_request_msg(&msg, VFIO_USER_DEVICE_RESET, sizeof(msg), 0);
+
+    vfio_user_send_recv(vbasedev->proxy, &msg, NULL, 0);
+    if (msg.flags & VFIO_USER_ERROR) {
+        error_printf("reset reply error %d\n", msg.error_reply);
+    }
+}
-- 
2.25.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC 17/19] vfio-user: probe remote device ROM BAR
  2021-07-19  6:27 [PATCH RFC 00/19] vfio-user implementation Elena Ufimtseva
                   ` (15 preceding siblings ...)
  2021-07-19  6:27 ` [PATCH RFC 16/19] vfio-user: pci reset Elena Ufimtseva
@ 2021-07-19  6:27 ` Elena Ufimtseva
  2021-07-19  6:27 ` [PATCH RFC 18/19] vfio-user: migration support Elena Ufimtseva
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 55+ messages in thread
From: Elena Ufimtseva @ 2021-07-19  6:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha

From: John G Johnson <john.g.johnson@oracle.com>

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/vfio/pci.c | 38 ++++++++++++++++++++++++++++++--------
 1 file changed, 30 insertions(+), 8 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index a6c28dac03..bed8eaa4c2 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -816,8 +816,14 @@ static void vfio_pci_load_rom(VFIOPCIDevice *vdev)
     memset(vdev->rom, 0xff, size);
 
     while (size) {
-        bytes = pread(vdev->vbasedev.fd, vdev->rom + off,
-                      size, vdev->rom_offset + off);
+        if (vdev->vbasedev.proxy != NULL) {
+            bytes = vfio_user_region_read(&vdev->vbasedev,
+                                          VFIO_PCI_ROM_REGION_INDEX,
+                                          off, size, vdev->rom + off);
+        } else {
+            bytes = pread(vdev->vbasedev.fd, vdev->rom + off,
+                          size, vdev->rom_offset + off);
+        }
         if (bytes == 0) {
             break;
         } else if (bytes > 0) {
@@ -936,12 +942,28 @@ static void vfio_pci_size_rom(VFIOPCIDevice *vdev)
      * Use the same size ROM BAR as the physical device.  The contents
      * will get filled in later when the guest tries to read it.
      */
-    if (pread(fd, &orig, 4, offset) != 4 ||
-        pwrite(fd, &size, 4, offset) != 4 ||
-        pread(fd, &size, 4, offset) != 4 ||
-        pwrite(fd, &orig, 4, offset) != 4) {
-        error_report("%s(%s) failed: %m", __func__, vdev->vbasedev.name);
-        return;
+    if (vdev->vbasedev.proxy != NULL) {
+        if (vfio_user_region_read(&vdev->vbasedev, VFIO_PCI_CONFIG_REGION_INDEX,
+                                  PCI_ROM_ADDRESS, 4, &orig) != 4 ||
+            vfio_user_region_write(&vdev->vbasedev,
+                                   VFIO_PCI_CONFIG_REGION_INDEX,
+                                   PCI_ROM_ADDRESS, 4, &size) != 4 ||
+            vfio_user_region_read(&vdev->vbasedev, VFIO_PCI_CONFIG_REGION_INDEX,
+                                  PCI_ROM_ADDRESS, 4, &size) != 4 ||
+            vfio_user_region_write(&vdev->vbasedev,
+                                   VFIO_PCI_CONFIG_REGION_INDEX,
+                                   PCI_ROM_ADDRESS, 4, &orig) != 4) {
+            error_report("%s(%s) failed: %m", __func__, vdev->vbasedev.name);
+            return;
+        }
+    } else {
+        if (pread(fd, &orig, 4, offset) != 4 ||
+            pwrite(fd, &size, 4, offset) != 4 ||
+            pread(fd, &size, 4, offset) != 4 ||
+            pwrite(fd, &orig, 4, offset) != 4) {
+            error_report("%s(%s) failed: %m", __func__, vdev->vbasedev.name);
+            return;
+        }
     }
 
     size = ~(le32_to_cpu(size) & PCI_ROM_ADDRESS_MASK) + 1;
-- 
2.25.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC 18/19] vfio-user: migration support
  2021-07-19  6:27 [PATCH RFC 00/19] vfio-user implementation Elena Ufimtseva
                   ` (16 preceding siblings ...)
  2021-07-19  6:27 ` [PATCH RFC 17/19] vfio-user: probe remote device ROM BAR Elena Ufimtseva
@ 2021-07-19  6:27 ` Elena Ufimtseva
  2021-07-19  6:27 ` [PATCH RFC 19/19] vfio-user: add migration cli options and version negotiation Elena Ufimtseva
  2021-07-19 20:00 ` [PATCH RFC server 00/11] vfio-user server in QEMU Jagannathan Raman
  19 siblings, 0 replies; 55+ messages in thread
From: Elena Ufimtseva @ 2021-07-19  6:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha

From: John G Johnson <john.g.johnson@oracle.com>

Send migration region operations to remote server.
Send VFIO_USER_USER_DIRTY_PAGES to get remote dirty bitmap.

Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/vfio/user.h      | 17 +++++++++++++++
 hw/vfio/common.c    | 51 ++++++++++++++++++++++++++++++++++++---------
 hw/vfio/migration.c | 35 ++++++++++++++++++-------------
 hw/vfio/pci.c       |  7 +++++++
 hw/vfio/user.c      | 45 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 130 insertions(+), 25 deletions(-)

diff --git a/hw/vfio/user.h b/hw/vfio/user.h
index 95c2fb1707..eeb328c0a9 100644
--- a/hw/vfio/user.h
+++ b/hw/vfio/user.h
@@ -230,6 +230,20 @@ struct vfio_user_irq_set {
     uint32_t count;
 };
 
+/* imported from struct vfio_iommu_type1_dirty_bitmap_get */
+struct vfio_user_bitmap_range {
+    uint64_t iova;
+    uint64_t size;
+    struct vfio_user_bitmap bitmap;
+};
+
+/* imported from struct vfio_iommu_type1_dirty_bitmap */
+struct vfio_user_dirty_pages {
+    vfio_user_hdr_t hdr;
+    uint32_t argsz;
+    uint32_t flags;
+};
+
 void vfio_user_recv(void *opaque);
 void vfio_user_send_reply(VFIOProxy *proxy, char *buf, int ret);
 VFIOProxy *vfio_user_connect_dev(char *sockname, Error **errp);
@@ -255,4 +269,7 @@ void vfio_user_set_reqhandler(VFIODevice *vbasdev,
                                              void *reqarg);
 int vfio_user_set_irqs(VFIODevice *vbasedev, struct vfio_irq_set *irq);
 void vfio_user_reset(VFIODevice *vbasedev);
+int vfio_user_dirty_bitmap(VFIOProxy *proxy,
+                           struct vfio_iommu_type1_dirty_bitmap *bitmap,
+                           struct vfio_iommu_type1_dirty_bitmap_get *range);
 #endif /* VFIO_USER_H */
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 953d9e7b55..bd31731c0f 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -460,7 +460,11 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
         goto unmap_exit;
     }
 
-    ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap);
+    if (container->proxy != NULL) {
+        ret = vfio_user_dma_unmap(container->proxy, unmap, bitmap);
+    } else {
+        ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap);
+    }
     if (!ret) {
         cpu_physical_memory_set_dirty_lebitmap((unsigned long *)bitmap->data,
                 iotlb->translated_addr, pages);
@@ -1278,10 +1282,19 @@ static void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
         dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP;
     }
 
-    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty);
-    if (ret) {
-        error_report("Failed to set dirty tracking flag 0x%x errno: %d",
-                     dirty.flags, errno);
+    if (container->proxy != NULL) {
+        ret = vfio_user_dirty_bitmap(container->proxy, &dirty, NULL);
+        if (ret) {
+            error_report("Failed to set dirty tracking flag 0x%x errno: %d",
+                         dirty.flags, -ret);
+        }
+    } else {
+        ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty);
+        if (ret) {
+            error_report("Failed to set dirty tracking flag 0x%x errno: %d",
+                         dirty.flags, errno);
+            ret = -errno;
+        }
     }
 }
 
@@ -1331,7 +1344,11 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
         goto err_out;
     }
 
-    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
+    if (container->proxy != NULL) {
+        ret = vfio_user_dirty_bitmap(container->proxy, dbitmap, range);
+    } else {
+        ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
+    }
     if (ret) {
         error_report("Failed to get dirty bitmap for iova: 0x%"PRIx64
                 " size: 0x%"PRIx64" err: %d", (uint64_t)range->iova,
@@ -2282,6 +2299,12 @@ void vfio_connect_proxy(VFIOProxy *proxy, VFIOGroup *group, AddressSpace *as)
     VFIOAddressSpace *space;
     VFIOContainer *container;
 
+    if (QLIST_EMPTY(&vfio_group_list)) {
+        qemu_register_reset(vfio_reset_handler, NULL);
+    }
+
+    QLIST_INSERT_HEAD(&vfio_group_list, group, next);
+
     /*
      * try to mirror vfio_connect_container()
      * as much as possible
@@ -2292,18 +2315,26 @@ void vfio_connect_proxy(VFIOProxy *proxy, VFIOGroup *group, AddressSpace *as)
     container = g_malloc0(sizeof(*container));
     container->space = space;
     container->fd = -1;
+    QLIST_INIT(&container->giommu_list);
     QLIST_INIT(&container->hostwin_list);
     container->proxy = proxy;
 
+    /*
+     * The proxy uses a SW IOMMU in lieu of the HW one
+     * used in the ioctl() version.  Use TYPE1 with the
+     * target's page size for maximum capatibility
+     */
     container->iommu_type = VFIO_TYPE1_IOMMU;
-    vfio_host_win_add(container, 0, (hwaddr)-1, 4096);
-    container->pgsizes = 4096;
+    vfio_host_win_add(container, 0, (hwaddr)-1, TARGET_PAGE_SIZE);
+    container->pgsizes = TARGET_PAGE_SIZE;
+
+    container->dirty_pages_supported = true;
+    container->max_dirty_bitmap_size = VFIO_USER_DEF_MAX_XFER;
+    container->dirty_pgsizes = TARGET_PAGE_SIZE;
 
     QLIST_INIT(&container->group_list);
     QLIST_INSERT_HEAD(&space->containers, container, next);
 
-    QLIST_INIT(&container->giommu_list);
-
     group->container = container;
     QLIST_INSERT_HEAD(&container->group_list, group, container_next);
 
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 82f654afb6..8005b1171a 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -27,6 +27,7 @@
 #include "pci.h"
 #include "trace.h"
 #include "hw/hw.h"
+#include "user.h"
 
 /*
  * Flags to be used as unique delimiters for VFIO devices in the migration
@@ -49,10 +50,18 @@ static int64_t bytes_transferred;
 static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
                                   off_t off, bool iswrite)
 {
+    VFIORegion *region = &vbasedev->migration->region;
     int ret;
 
-    ret = iswrite ? pwrite(vbasedev->fd, val, count, off) :
-                    pread(vbasedev->fd, val, count, off);
+    if (vbasedev->proxy != NULL) {
+        ret = iswrite ?
+            vfio_user_region_write(vbasedev, region->nr, off, count, val) :
+            vfio_user_region_read(vbasedev, region->nr, off, count, val);
+    } else {
+        off += region->fd_offset;
+        ret = iswrite ? pwrite(vbasedev->fd, val, count, off) :
+                        pread(vbasedev->fd, val, count, off);
+    }
     if (ret < count) {
         error_report("vfio_mig_%s %d byte %s: failed at offset 0x%"
                      HWADDR_PRIx", err: %s", iswrite ? "write" : "read", count,
@@ -111,9 +120,7 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
                                     uint32_t value)
 {
     VFIOMigration *migration = vbasedev->migration;
-    VFIORegion *region = &migration->region;
-    off_t dev_state_off = region->fd_offset +
-                          VFIO_MIG_STRUCT_OFFSET(device_state);
+    off_t dev_state_off = VFIO_MIG_STRUCT_OFFSET(device_state);
     uint32_t device_state;
     int ret;
 
@@ -201,13 +208,13 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t *size)
     int ret;
 
     ret = vfio_mig_read(vbasedev, &data_offset, sizeof(data_offset),
-                      region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_offset));
+                        VFIO_MIG_STRUCT_OFFSET(data_offset));
     if (ret < 0) {
         return ret;
     }
 
     ret = vfio_mig_read(vbasedev, &data_size, sizeof(data_size),
-                        region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_size));
+                        VFIO_MIG_STRUCT_OFFSET(data_size));
     if (ret < 0) {
         return ret;
     }
@@ -233,8 +240,7 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t *size)
             }
             buf_allocated = true;
 
-            ret = vfio_mig_read(vbasedev, buf, sec_size,
-                                region->fd_offset + data_offset);
+            ret = vfio_mig_read(vbasedev, buf, sec_size, data_offset);
             if (ret < 0) {
                 g_free(buf);
                 return ret;
@@ -269,7 +275,7 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
 
     do {
         ret = vfio_mig_read(vbasedev, &data_offset, sizeof(data_offset),
-                      region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_offset));
+                            VFIO_MIG_STRUCT_OFFSET(data_offset));
         if (ret < 0) {
             return ret;
         }
@@ -309,8 +315,8 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
             qemu_get_buffer(f, buf, sec_size);
 
             if (buf_alloc) {
-                ret = vfio_mig_write(vbasedev, buf, sec_size,
-                        region->fd_offset + data_offset);
+
+                ret = vfio_mig_write(vbasedev, buf, sec_size, data_offset);
                 g_free(buf);
 
                 if (ret < 0) {
@@ -322,7 +328,7 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
         }
 
         ret = vfio_mig_write(vbasedev, &report_size, sizeof(report_size),
-                        region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_size));
+                             VFIO_MIG_STRUCT_OFFSET(data_size));
         if (ret < 0) {
             return ret;
         }
@@ -334,12 +340,11 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
 static int vfio_update_pending(VFIODevice *vbasedev)
 {
     VFIOMigration *migration = vbasedev->migration;
-    VFIORegion *region = &migration->region;
     uint64_t pending_bytes = 0;
     int ret;
 
     ret = vfio_mig_read(vbasedev, &pending_bytes, sizeof(pending_bytes),
-                    region->fd_offset + VFIO_MIG_STRUCT_OFFSET(pending_bytes));
+                        VFIO_MIG_STRUCT_OFFSET(pending_bytes));
     if (ret < 0) {
         migration->pending_bytes = 0;
         return ret;
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index bed8eaa4c2..36f8524e7c 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3633,6 +3633,13 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
         }
     }
 
+    if (!pdev->failover_pair_id) {
+        ret = vfio_migration_probe(&vdev->vbasedev, errp);
+        if (ret) {
+            error_report("%s: Migration disabled", vdev->vbasedev.name);
+        }
+    }
+
     vfio_register_err_notifier(vdev);
     vfio_register_req_notifier(vdev);
 
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
index 8917596a2f..eceaeeccea 100644
--- a/hw/vfio/user.c
+++ b/hw/vfio/user.c
@@ -917,3 +917,48 @@ void vfio_user_reset(VFIODevice *vbasedev)
         error_printf("reset reply error %d\n", msg.error_reply);
     }
 }
+
+int vfio_user_dirty_bitmap(VFIOProxy *proxy,
+                           struct vfio_iommu_type1_dirty_bitmap *cmd,
+                           struct vfio_iommu_type1_dirty_bitmap_get *dbitmap)
+{
+    g_autofree struct {
+        struct vfio_user_dirty_pages msg;
+        struct vfio_user_bitmap_range range;
+    } *msgp = NULL;
+    int msize, rsize;
+
+    /*
+     * If just the command is sent, the returned bitmap isn't needed.
+     * The bitmap structs are different from the ioctl() versions,
+     * ioctl() returns the bitmap in a local VA
+     */
+    if (dbitmap != NULL) {
+        msize = sizeof(*msgp);
+        rsize = msize + dbitmap->bitmap.size;
+        msgp = g_malloc0(rsize);
+        msgp->range.iova = dbitmap->iova;
+        msgp->range.size = dbitmap->size;
+        msgp->range.bitmap.pgsize = dbitmap->bitmap.pgsize;
+        msgp->range.bitmap.size = dbitmap->bitmap.size;
+    } else {
+        msize = rsize = sizeof(struct vfio_user_dirty_pages);
+        msgp = g_malloc0(rsize);
+    }
+
+    vfio_user_request_msg(&msgp->msg.hdr, VFIO_USER_DIRTY_PAGES, msize, 0);
+    msgp->msg.argsz = msize - sizeof(msgp->msg.hdr);
+    msgp->msg.flags = cmd->flags;
+
+    vfio_user_send_recv(proxy, &msgp->msg.hdr, NULL, rsize);
+    if (msgp->msg.hdr.flags & VFIO_USER_ERROR) {
+        return -msgp->msg.hdr.error_reply;
+    }
+
+    if (dbitmap != NULL) {
+        memcpy(dbitmap->bitmap.data, &msgp->range.bitmap.data,
+               dbitmap->bitmap.size);
+    }
+
+    return 0;
+}
-- 
2.25.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC 19/19] vfio-user: add migration cli options and version negotiation
  2021-07-19  6:27 [PATCH RFC 00/19] vfio-user implementation Elena Ufimtseva
                   ` (17 preceding siblings ...)
  2021-07-19  6:27 ` [PATCH RFC 18/19] vfio-user: migration support Elena Ufimtseva
@ 2021-07-19  6:27 ` Elena Ufimtseva
  2021-07-19 20:00 ` [PATCH RFC server 00/11] vfio-user server in QEMU Jagannathan Raman
  19 siblings, 0 replies; 55+ messages in thread
From: Elena Ufimtseva @ 2021-07-19  6:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha

From: John G Johnson <john.g.johnson@oracle.com>

Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
---
 hw/vfio/user.h |  4 ++++
 hw/vfio/pci.c  |  5 +++++
 hw/vfio/user.c | 33 +++++++++++++++++++++++++++++++++
 3 files changed, 42 insertions(+)

diff --git a/hw/vfio/user.h b/hw/vfio/user.h
index eeb328c0a9..5542aa1932 100644
--- a/hw/vfio/user.h
+++ b/hw/vfio/user.h
@@ -70,6 +70,10 @@ struct vfio_user_version {
 /* "capabilities" members */
 #define VFIO_USER_CAP_MAX_FDS   "max_msg_fds"
 #define VFIO_USER_CAP_MAX_XFER  "max_data_xfer_size"
+#define VFIO_USER_CAP_MIGR      "migration"
+
+/* "migration" member */
+#define VFIO_USER_CAP_PGSIZE    "pgsize"
 
 #define VFIO_USER_DEF_MAX_FDS   8
 #define VFIO_USER_MAX_MAX_FDS   16
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 36f8524e7c..2f97160147 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3688,6 +3688,11 @@ static void vfio_user_pci_reset(DeviceState *dev)
 static Property vfio_user_pci_dev_properties[] = {
     DEFINE_PROP_STRING("socket", VFIOUserPCIDevice, sock_name),
     DEFINE_PROP_BOOL("secure-dma", VFIOUserPCIDevice, secure, false),
+    DEFINE_PROP_BOOL("x-enable-migration", VFIOPCIDevice,
+                     vbasedev.enable_migration, false),
+    DEFINE_PROP_ON_OFF_AUTO("x-pre-copy-dirty-page-tracking", VFIOPCIDevice,
+                            vbasedev.pre_copy_dirty_page_tracking,
+                            ON_OFF_AUTO_ON),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
index eceaeeccea..23ace82bbb 100644
--- a/hw/vfio/user.c
+++ b/hw/vfio/user.c
@@ -393,6 +393,23 @@ static int caps_parse(QDict *qdict, struct cap_entry caps[], Error **errp)
     return 0;
 }
 
+static int check_pgsize(QObject *qobj, Error **errp)
+{
+    QNum *qn = qobject_to(QNum, qobj);
+    uint64_t pgsize;
+
+    if (qn == NULL || !qnum_get_try_uint(qn, &pgsize)) {
+        error_setg(errp, "malformed %s", VFIO_USER_CAP_PGSIZE);
+        return -1;
+    }
+    return pgsize == 4096 ? 0 : -1;
+}
+
+static struct cap_entry caps_migr[] = {
+    { VFIO_USER_CAP_PGSIZE, check_pgsize },
+    { NULL }
+};
+
 static int check_max_fds(QObject *qobj, Error **errp)
 {
     QNum *qn = qobject_to(QNum, qobj);
@@ -417,9 +434,21 @@ static int check_max_xfer(QObject *qobj, Error **errp)
     return 0;
 }
 
+static int check_migr(QObject *qobj, Error **errp)
+{
+    QDict *qdict = qobject_to(QDict, qobj);
+
+    if (qdict == NULL || caps_parse(qdict, caps_migr, errp)) {
+        error_setg(errp, "malformed %s", VFIO_USER_CAP_MAX_FDS);
+        return -1;
+    }
+    return 0;
+}
+
 static struct cap_entry caps_cap[] = {
     { VFIO_USER_CAP_MAX_FDS, check_max_fds },
     { VFIO_USER_CAP_MAX_XFER, check_max_xfer },
+    { VFIO_USER_CAP_MIGR, check_migr },
     { NULL }
 };
 
@@ -466,8 +495,12 @@ static GString *caps_json(void)
 {
     QDict *dict = qdict_new();
     QDict *capdict = qdict_new();
+    QDict *migdict = qdict_new();
     GString *str;
 
+    qdict_put_int(migdict, VFIO_USER_CAP_PGSIZE, 4096);
+    qdict_put_obj(capdict, VFIO_USER_CAP_MIGR, QOBJECT(migdict));
+
     qdict_put_int(capdict, VFIO_USER_CAP_MAX_FDS, VFIO_USER_MAX_MAX_FDS);
     qdict_put_int(capdict, VFIO_USER_CAP_MAX_XFER, VFIO_USER_DEF_MAX_XFER);
 
-- 
2.25.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC server 00/11] vfio-user server in QEMU
  2021-07-19  6:27 [PATCH RFC 00/19] vfio-user implementation Elena Ufimtseva
                   ` (18 preceding siblings ...)
  2021-07-19  6:27 ` [PATCH RFC 19/19] vfio-user: add migration cli options and version negotiation Elena Ufimtseva
@ 2021-07-19 20:00 ` Jagannathan Raman
  2021-07-19 20:00   ` [PATCH RFC server 01/11] vfio-user: build library Jagannathan Raman
                     ` (10 more replies)
  19 siblings, 11 replies; 55+ messages in thread
From: Jagannathan Raman @ 2021-07-19 20:00 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha, thanos.makatos

Hi,

This series adds on to the following series from
Elena Ufimtseva <elena.ufimtseva@oracle.com>:
[PATCH RFC 00/19] vfio-user implementation

QEMU enabled out-of-process device emulation with multi-process [1].
multi-process used a custom protocol to interact between the client
and server, which is not desirable.

The vfio-user user protocol [2] implements a VFIO based mechanism to interact
between the client and server. Since VFIO is a well-established specification,
it is preferable in terms of maintenance. It makes sense for multi-process to
switch to the vfio-user protocol.

Nutanix implemented the vfio-user protocol in their libvfio-user library. The
source for this library is located below:
https://github.com/nutanix/libvfio-user

Elena previously sent the patches for the vfio-user client.

This series implements a vfio-user server for QEMU. It includes the
libvfio-user as a git submodule to QEMU, and builds it along with QEMU.

We would like to make the following notes:
  - Some of the existing multi-process code would become obsolete, and would
    need to be removed. This series does not remove them to keep the number
    of patches to a minimum. We will address them subsequently.
  - The libvfio-user library needs json-c package to build. It looks like the
    GitLab CI images used for build test don't have this package. As such it
    causes build failure.

The patches from both series are available in the following github repo:
https://github.com/oracle/qemu.git
The vfio-user-client-server branch provides the same patches along with a
python script (scripts/vfiouser-launcher.py) to launch the VM.

Contributors:
John G Johnson <john.g.johnson@oracle.com>
John Levon <john.levon@nutanix.com>
Thanos Makatos <thanos.makatos@nutanix.com>
Elena Ufimtseva <elena.ufimtseva@oracle.com>
Jagannathan Raman <jag.raman@oracle.com>

We are looking forward to your comments and questions.

Thank you!

[1]: https://patchew.org/QEMU/20210210092628.193785-1-stefanha@redhat.com/
[2]: https://patchwork.kernel.org/project/qemu-devel/patch/20210614104608.212276-1-thanos.makatos@nutanix.com/

Jagannathan Raman (11):
  vfio-user: build library
  vfio-user: define vfio-user object
  vfio-user: instantiate vfio-user context
  vfio-user: find and init PCI device
  vfio-user: run vfio-user context
  vfio-user: handle PCI config space accesses
  vfio-user: handle DMA mappings
  vfio-user: handle PCI BAR accesses
  vfio-user: handle device interrupts
  vfio-user: register handlers to facilitate migration
  vfio-user: acceptance test

 configure                     |  11 +
 meson.build                   |  35 ++
 qapi/qom.json                 |  20 +-
 include/hw/remote/iohub.h     |   2 +
 migration/savevm.h            |   2 +
 hw/remote/iohub.c             |   6 +
 hw/remote/vfio-user-obj.c     | 754 ++++++++++++++++++++++++++++++++++++++++++
 migration/savevm.c            |  63 ++++
 .gitmodules                   |   3 +
 MAINTAINERS                   |   9 +
 hw/remote/meson.build         |   3 +
 hw/remote/trace-events        |  10 +
 libvfio-user                  |   1 +
 tests/acceptance/vfio-user.py |  94 ++++++
 14 files changed, 1011 insertions(+), 2 deletions(-)
 create mode 100644 hw/remote/vfio-user-obj.c
 create mode 160000 libvfio-user
 create mode 100644 tests/acceptance/vfio-user.py

-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC server 01/11] vfio-user: build library
  2021-07-19 20:00 ` [PATCH RFC server 00/11] vfio-user server in QEMU Jagannathan Raman
@ 2021-07-19 20:00   ` Jagannathan Raman
  2021-07-19 20:24     ` John Levon
  2021-07-19 20:00   ` [PATCH RFC server 02/11] vfio-user: define vfio-user object Jagannathan Raman
                     ` (9 subsequent siblings)
  10 siblings, 1 reply; 55+ messages in thread
From: Jagannathan Raman @ 2021-07-19 20:00 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha, thanos.makatos

add the libvfio-user library as a submodule. build it as part of QEMU

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 configure             | 11 +++++++++++
 meson.build           | 35 +++++++++++++++++++++++++++++++++++
 .gitmodules           |  3 +++
 MAINTAINERS           |  7 +++++++
 hw/remote/meson.build |  2 ++
 libvfio-user          |  1 +
 6 files changed, 59 insertions(+)
 create mode 160000 libvfio-user

diff --git a/configure b/configure
index 49b5481..bc1c961 100755
--- a/configure
+++ b/configure
@@ -4297,6 +4297,17 @@ but not implemented on your system"
 fi
 
 ##########################################
+# check for multiprocess
+
+case "$multiprocess" in
+  auto | enabled )
+    if test "$git_submodules_action" != "ignore"; then
+      git_submodules="${git_submodules} libvfio-user"
+    fi
+    ;;
+esac
+
+##########################################
 # End of CC checks
 # After here, no more $cc or $ld runs
 
diff --git a/meson.build b/meson.build
index 6e4d2d8..f2f9f86 100644
--- a/meson.build
+++ b/meson.build
@@ -1894,6 +1894,41 @@ if get_option('cfi') and slirp_opt == 'system'
          + ' Please configure with --enable-slirp=git')
 endif
 
+vfiouser = not_found
+if have_system and multiprocess_allowed
+  have_internal = fs.exists(meson.current_source_dir() / 'libvfio-user/Makefile')
+
+  if not have_internal
+    error('libvfio-user source not found - please pull git submodule')
+  endif
+
+  vfiouser_files = [
+    'libvfio-user/lib/dma.c',
+    'libvfio-user/lib/irq.c',
+    'libvfio-user/lib/libvfio-user.c',
+    'libvfio-user/lib/migration.c',
+    'libvfio-user/lib/pci.c',
+    'libvfio-user/lib/pci_caps.c',
+    'libvfio-user/lib/tran_sock.c',
+  ]
+
+  vfiouser_inc = include_directories('libvfio-user/include', 'libvfio-user/lib')
+
+  json_c = dependency('json-c', required: false)
+  if not json_c.found()
+    json_c = dependency('libjson-c')
+  endif
+
+  libvfiouser = static_library('vfiouser',
+                               build_by_default: false,
+                               sources: vfiouser_files,
+                               dependencies: json_c,
+                               include_directories: vfiouser_inc)
+
+  vfiouser = declare_dependency(link_with: libvfiouser,
+                                include_directories: vfiouser_inc)
+endif
+
 fdt = not_found
 fdt_opt = get_option('fdt')
 if have_system
diff --git a/.gitmodules b/.gitmodules
index 08b1b48..a583a39 100644
--- a/.gitmodules
+++ b/.gitmodules
@@ -64,3 +64,6 @@
 [submodule "roms/vbootrom"]
 	path = roms/vbootrom
 	url = https://gitlab.com/qemu-project/vbootrom.git
+[submodule "libvfio-user"]
+	path = libvfio-user
+	url = https://github.com/nutanix/libvfio-user.git
diff --git a/MAINTAINERS b/MAINTAINERS
index aa4df6c..99646e7 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3350,6 +3350,13 @@ F: semihosting/
 F: include/semihosting/
 F: tests/tcg/multiarch/arm-compat-semi/
 
+libvfio-user Library
+M: Thanos Makatos <thanos.makatos@nutanix.com>
+M: John Levon <john.levon@nutanix.com>
+T: https://github.com/nutanix/libvfio-user.git
+S: Maintained
+F: libvfio-user/*
+
 Multi-process QEMU
 M: Elena Ufimtseva <elena.ufimtseva@oracle.com>
 M: Jagannathan Raman <jag.raman@oracle.com>
diff --git a/hw/remote/meson.build b/hw/remote/meson.build
index e6a5574..fb35fb8 100644
--- a/hw/remote/meson.build
+++ b/hw/remote/meson.build
@@ -7,6 +7,8 @@ remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('remote-obj.c'))
 remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('proxy.c'))
 remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('iohub.c'))
 
+remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: vfiouser)
+
 specific_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('memory.c'))
 specific_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('proxy-memory-listener.c'))
 
diff --git a/libvfio-user b/libvfio-user
new file mode 160000
index 0000000..2a0a929
--- /dev/null
+++ b/libvfio-user
@@ -0,0 +1 @@
+Subproject commit 2a0a92912d598de871ab47c034432c5fa6546dc4
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC server 02/11] vfio-user: define vfio-user object
  2021-07-19 20:00 ` [PATCH RFC server 00/11] vfio-user server in QEMU Jagannathan Raman
  2021-07-19 20:00   ` [PATCH RFC server 01/11] vfio-user: build library Jagannathan Raman
@ 2021-07-19 20:00   ` Jagannathan Raman
  2021-07-19 20:00   ` [PATCH RFC server 03/11] vfio-user: instantiate vfio-user context Jagannathan Raman
                     ` (8 subsequent siblings)
  10 siblings, 0 replies; 55+ messages in thread
From: Jagannathan Raman @ 2021-07-19 20:00 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha, thanos.makatos

Define vfio-user object which is remote process server for QEMU. Setup
object initialization functions and properties necessary to instantiate
the object

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 qapi/qom.json             |  20 ++++++-
 hw/remote/vfio-user-obj.c | 141 ++++++++++++++++++++++++++++++++++++++++++++++
 MAINTAINERS               |   1 +
 hw/remote/meson.build     |   1 +
 hw/remote/trace-events    |   3 +
 5 files changed, 164 insertions(+), 2 deletions(-)
 create mode 100644 hw/remote/vfio-user-obj.c

diff --git a/qapi/qom.json b/qapi/qom.json
index 652be31..e0716d2 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -684,6 +684,20 @@
   'data': { 'fd': 'str', 'devid': 'str' } }
 
 ##
+# @VfioUserProperties:
+#
+# Properties for vfio-user objects.
+#
+# @socket: path to be used as socket by the libvfiouser library
+#
+# @devid: the id of the device to be associated with the file descriptor
+#
+# Since: 6.0
+##
+{ 'struct': 'VfioUserProperties',
+  'data': { 'socket': 'str', 'devid': 'str' } }
+
+##
 # @RngProperties:
 #
 # Properties for objects of classes derived from rng.
@@ -807,7 +821,8 @@
     'tls-creds-psk',
     'tls-creds-x509',
     'tls-cipher-suites',
-    'x-remote-object'
+    'x-remote-object',
+    'vfio-user'
   ] }
 
 ##
@@ -863,7 +878,8 @@
       'tls-creds-psk':              'TlsCredsPskProperties',
       'tls-creds-x509':             'TlsCredsX509Properties',
       'tls-cipher-suites':          'TlsCredsProperties',
-      'x-remote-object':            'RemoteObjectProperties'
+      'x-remote-object':            'RemoteObjectProperties',
+      'vfio-user':                  'VfioUserProperties'
   } }
 
 ##
diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
new file mode 100644
index 0000000..5098169
--- /dev/null
+++ b/hw/remote/vfio-user-obj.c
@@ -0,0 +1,141 @@
+/*
+ * QEMU vfio-user server object
+ *
+ * Copyright © 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL-v2, version 2 or later.
+ *
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+/**
+ * Usage: add options:
+ *     -machine x-remote
+ *     -device <PCI-device>,id=<pci-dev-id>
+ *     -object vfio-user,id=<id>,socket=<socket-path>,devid=<pci-dev-id>
+ *
+ * Note that vfio-user object must be used with x-remote machine only. This
+ * server could only support PCI devices for now.
+ *
+ * socket is path to a file. This file will be created by the server. It is
+ * a required option
+ *
+ * devid is the id of a PCI device on the server. It is also a required option.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+
+#include "qom/object.h"
+#include "qom/object_interfaces.h"
+#include "qemu/error-report.h"
+#include "trace.h"
+#include "sysemu/runstate.h"
+
+#define TYPE_VFU_OBJECT "vfio-user"
+OBJECT_DECLARE_TYPE(VfuObject, VfuObjectClass, VFU_OBJECT)
+
+struct VfuObjectClass {
+    ObjectClass parent_class;
+
+    unsigned int nr_devs;
+
+    /* Maximum number of devices the server could support*/
+    unsigned int max_devs;
+};
+
+struct VfuObject {
+    /* private */
+    Object parent;
+
+    char *socket;
+    char *devid;
+};
+
+static void vfu_object_set_socket(Object *obj, const char *str, Error **errp)
+{
+    VfuObject *o = VFU_OBJECT(obj);
+
+    g_free(o->socket);
+
+    o->socket = g_strdup(str);
+
+    trace_vfu_prop("socket", str);
+}
+
+static void vfu_object_set_devid(Object *obj, const char *str, Error **errp)
+{
+    VfuObject *o = VFU_OBJECT(obj);
+
+    g_free(o->devid);
+
+    o->devid = g_strdup(str);
+
+    trace_vfu_prop("devid", str);
+}
+
+static void vfu_object_init(Object *obj)
+{
+    VfuObjectClass *k = VFU_OBJECT_GET_CLASS(obj);
+
+    /* Add test for remote machine and PCI device */
+
+    if (k->nr_devs >= k->max_devs) {
+        error_report("Reached maximum number of vfio-user devices: %u",
+                     k->max_devs);
+        return;
+    }
+
+    k->nr_devs++;
+}
+
+static void vfu_object_finalize(Object *obj)
+{
+    VfuObjectClass *k = VFU_OBJECT_GET_CLASS(obj);
+    VfuObject *o = VFU_OBJECT(obj);
+
+    k->nr_devs--;
+
+    g_free(o->socket);
+    g_free(o->devid);
+
+    if (k->nr_devs == 0) {
+        qemu_system_shutdown_request(SHUTDOWN_CAUSE_GUEST_SHUTDOWN);
+    }
+}
+
+static void vfu_object_class_init(ObjectClass *klass, void *data)
+{
+    VfuObjectClass *k = VFU_OBJECT_CLASS(klass);
+
+    /* Limiting maximum number of devices to 1 until IOMMU support is added */
+    k->max_devs = 1;
+    k->nr_devs = 0;
+
+    object_class_property_add_str(klass, "socket", NULL,
+                                  vfu_object_set_socket);
+    object_class_property_add_str(klass, "devid", NULL,
+                                  vfu_object_set_devid);
+}
+
+static const TypeInfo vfu_object_info = {
+    .name = TYPE_VFU_OBJECT,
+    .parent = TYPE_OBJECT,
+    .instance_size = sizeof(VfuObject),
+    .instance_init = vfu_object_init,
+    .instance_finalize = vfu_object_finalize,
+    .class_size = sizeof(VfuObjectClass),
+    .class_init = vfu_object_class_init,
+    .interfaces = (InterfaceInfo[]) {
+        { TYPE_USER_CREATABLE },
+        { }
+    }
+};
+
+static void vfu_register_types(void)
+{
+    type_register_static(&vfu_object_info);
+}
+
+type_init(vfu_register_types);
diff --git a/MAINTAINERS b/MAINTAINERS
index 99646e7..46ab6b6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3380,6 +3380,7 @@ F: hw/remote/proxy-memory-listener.c
 F: include/hw/remote/proxy-memory-listener.h
 F: hw/remote/iohub.c
 F: include/hw/remote/iohub.h
+F: hw/remote/vfio-user-obj.c
 
 EBPF:
 M: Jason Wang <jasowang@redhat.com>
diff --git a/hw/remote/meson.build b/hw/remote/meson.build
index fb35fb8..cd44dfc 100644
--- a/hw/remote/meson.build
+++ b/hw/remote/meson.build
@@ -6,6 +6,7 @@ remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('message.c'))
 remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('remote-obj.c'))
 remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('proxy.c'))
 remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('iohub.c'))
+remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('vfio-user-obj.c'))
 
 remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: vfiouser)
 
diff --git a/hw/remote/trace-events b/hw/remote/trace-events
index 0b23974..7da12f0 100644
--- a/hw/remote/trace-events
+++ b/hw/remote/trace-events
@@ -2,3 +2,6 @@
 
 mpqemu_send_io_error(int cmd, int size, int nfds) "send command %d size %d, %d file descriptors to remote process"
 mpqemu_recv_io_error(int cmd, int size, int nfds) "failed to receive %d size %d, %d file descriptors to remote process"
+
+# vfio-user-obj.c
+vfu_prop(const char *prop, const char *val) "vfu: setting %s as %s"
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC server 03/11] vfio-user: instantiate vfio-user context
  2021-07-19 20:00 ` [PATCH RFC server 00/11] vfio-user server in QEMU Jagannathan Raman
  2021-07-19 20:00   ` [PATCH RFC server 01/11] vfio-user: build library Jagannathan Raman
  2021-07-19 20:00   ` [PATCH RFC server 02/11] vfio-user: define vfio-user object Jagannathan Raman
@ 2021-07-19 20:00   ` Jagannathan Raman
  2021-07-19 20:00   ` [PATCH RFC server 04/11] vfio-user: find and init PCI device Jagannathan Raman
                     ` (7 subsequent siblings)
  10 siblings, 0 replies; 55+ messages in thread
From: Jagannathan Raman @ 2021-07-19 20:00 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha, thanos.makatos

create a context with the vfio-user library for a device

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/remote/vfio-user-obj.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 5098169..adb3193 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -27,11 +27,18 @@
 #include "qemu/osdep.h"
 #include "qemu-common.h"
 
+#include <errno.h>
+
 #include "qom/object.h"
 #include "qom/object_interfaces.h"
 #include "qemu/error-report.h"
 #include "trace.h"
 #include "sysemu/runstate.h"
+#include "qemu/notify.h"
+#include "qapi/error.h"
+#include "sysemu/sysemu.h"
+
+#include "libvfio-user/include/libvfio-user.h"
 
 #define TYPE_VFU_OBJECT "vfio-user"
 OBJECT_DECLARE_TYPE(VfuObject, VfuObjectClass, VFU_OBJECT)
@@ -51,6 +58,10 @@ struct VfuObject {
 
     char *socket;
     char *devid;
+
+    Notifier machine_done;
+
+    vfu_ctx_t *vfu_ctx;
 };
 
 static void vfu_object_set_socket(Object *obj, const char *str, Error **errp)
@@ -75,9 +86,23 @@ static void vfu_object_set_devid(Object *obj, const char *str, Error **errp)
     trace_vfu_prop("devid", str);
 }
 
+static void vfu_object_machine_done(Notifier *notifier, void *data)
+{
+    VfuObject *o = container_of(notifier, VfuObject, machine_done);
+
+    o->vfu_ctx = vfu_create_ctx(VFU_TRANS_SOCK, o->socket, 0,
+                                o, VFU_DEV_TYPE_PCI);
+    if (o->vfu_ctx == NULL) {
+        error_setg(&error_abort, "vfu: Failed to create context - %s",
+                   strerror(errno));
+        return;
+    }
+}
+
 static void vfu_object_init(Object *obj)
 {
     VfuObjectClass *k = VFU_OBJECT_GET_CLASS(obj);
+    VfuObject *o = VFU_OBJECT(obj);
 
     /* Add test for remote machine and PCI device */
 
@@ -88,6 +113,9 @@ static void vfu_object_init(Object *obj)
     }
 
     k->nr_devs++;
+
+    o->machine_done.notify = vfu_object_machine_done;
+    qemu_add_machine_init_done_notifier(&o->machine_done);
 }
 
 static void vfu_object_finalize(Object *obj)
@@ -97,6 +125,8 @@ static void vfu_object_finalize(Object *obj)
 
     k->nr_devs--;
 
+    vfu_destroy_ctx(o->vfu_ctx);
+
     g_free(o->socket);
     g_free(o->devid);
 
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC server 04/11] vfio-user: find and init PCI device
  2021-07-19 20:00 ` [PATCH RFC server 00/11] vfio-user server in QEMU Jagannathan Raman
                     ` (2 preceding siblings ...)
  2021-07-19 20:00   ` [PATCH RFC server 03/11] vfio-user: instantiate vfio-user context Jagannathan Raman
@ 2021-07-19 20:00   ` Jagannathan Raman
  2021-07-26 15:05     ` John Levon
  2021-07-19 20:00   ` [PATCH RFC server 05/11] vfio-user: run vfio-user context Jagannathan Raman
                     ` (6 subsequent siblings)
  10 siblings, 1 reply; 55+ messages in thread
From: Jagannathan Raman @ 2021-07-19 20:00 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha, thanos.makatos

Find the PCI device with specified id. Initialize the device context
with the QEMU PCI device

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/remote/vfio-user-obj.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index adb3193..e362709 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -37,6 +37,8 @@
 #include "qemu/notify.h"
 #include "qapi/error.h"
 #include "sysemu/sysemu.h"
+#include "hw/qdev-core.h"
+#include "hw/pci/pci.h"
 
 #include "libvfio-user/include/libvfio-user.h"
 
@@ -62,6 +64,8 @@ struct VfuObject {
     Notifier machine_done;
 
     vfu_ctx_t *vfu_ctx;
+
+    PCIDevice *pci_dev;
 };
 
 static void vfu_object_set_socket(Object *obj, const char *str, Error **errp)
@@ -89,6 +93,8 @@ static void vfu_object_set_devid(Object *obj, const char *str, Error **errp)
 static void vfu_object_machine_done(Notifier *notifier, void *data)
 {
     VfuObject *o = container_of(notifier, VfuObject, machine_done);
+    DeviceState *dev = NULL;
+    int ret;
 
     o->vfu_ctx = vfu_create_ctx(VFU_TRANS_SOCK, o->socket, 0,
                                 o, VFU_DEV_TYPE_PCI);
@@ -97,6 +103,28 @@ static void vfu_object_machine_done(Notifier *notifier, void *data)
                    strerror(errno));
         return;
     }
+
+    dev = qdev_find_recursive(sysbus_get_default(), o->devid);
+    if (dev == NULL) {
+        error_setg(&error_abort, "vfu: Device %s not found", o->devid);
+        return;
+    }
+    o->pci_dev = PCI_DEVICE(dev);
+
+    ret = vfu_pci_init(o->vfu_ctx, VFU_PCI_TYPE_CONVENTIONAL,
+                       PCI_HEADER_TYPE_NORMAL, 0);
+    if (ret < 0) {
+        error_setg(&error_abort,
+                   "vfu: Failed to attach PCI device %s to context - %s",
+                   o->devid, strerror(errno));
+        return;
+    }
+
+    vfu_pci_set_id(o->vfu_ctx,
+                   pci_get_word(o->pci_dev->config + PCI_VENDOR_ID),
+                   pci_get_word(o->pci_dev->config + PCI_DEVICE_ID),
+                   pci_get_word(o->pci_dev->config + PCI_SUBSYSTEM_VENDOR_ID),
+                   pci_get_word(o->pci_dev->config + PCI_SUBSYSTEM_ID));
 }
 
 static void vfu_object_init(Object *obj)
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC server 05/11] vfio-user: run vfio-user context
  2021-07-19 20:00 ` [PATCH RFC server 00/11] vfio-user server in QEMU Jagannathan Raman
                     ` (3 preceding siblings ...)
  2021-07-19 20:00   ` [PATCH RFC server 04/11] vfio-user: find and init PCI device Jagannathan Raman
@ 2021-07-19 20:00   ` Jagannathan Raman
  2021-07-20 14:17     ` Thanos Makatos
  2021-07-19 20:00   ` [PATCH RFC server 06/11] vfio-user: handle PCI config space accesses Jagannathan Raman
                     ` (5 subsequent siblings)
  10 siblings, 1 reply; 55+ messages in thread
From: Jagannathan Raman @ 2021-07-19 20:00 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha, thanos.makatos

Setup a separate thread to run the vfio-user context. The thread acts as
the main loop for the device.

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/remote/vfio-user-obj.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index e362709..6a2d0f5 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -35,6 +35,7 @@
 #include "trace.h"
 #include "sysemu/runstate.h"
 #include "qemu/notify.h"
+#include "qemu/thread.h"
 #include "qapi/error.h"
 #include "sysemu/sysemu.h"
 #include "hw/qdev-core.h"
@@ -66,6 +67,8 @@ struct VfuObject {
     vfu_ctx_t *vfu_ctx;
 
     PCIDevice *pci_dev;
+
+    QemuThread vfu_ctx_thread;
 };
 
 static void vfu_object_set_socket(Object *obj, const char *str, Error **errp)
@@ -90,6 +93,44 @@ static void vfu_object_set_devid(Object *obj, const char *str, Error **errp)
     trace_vfu_prop("devid", str);
 }
 
+static void *vfu_object_ctx_run(void *opaque)
+{
+    VfuObject *o = opaque;
+    int ret;
+
+    ret = vfu_realize_ctx(o->vfu_ctx);
+    if (ret < 0) {
+        error_setg(&error_abort, "vfu: Failed to realize device %s- %s",
+                   o->devid, strerror(errno));
+        return NULL;
+    }
+
+    ret = vfu_attach_ctx(o->vfu_ctx);
+    if (ret < 0) {
+        error_setg(&error_abort,
+                   "vfu: Failed to attach device %s to context - %s",
+                   o->devid, strerror(errno));
+        return NULL;
+    }
+
+    do {
+        ret = vfu_run_ctx(o->vfu_ctx);
+        if (ret < 0) {
+            if (errno == EINTR) {
+                ret = 0;
+            } else if (errno == ENOTCONN) {
+                object_unparent(OBJECT(o));
+                break;
+            } else {
+                error_setg(&error_abort, "vfu: Failed to run device %s - %s",
+                           o->devid, strerror(errno));
+            }
+        }
+    } while (ret == 0);
+
+    return NULL;
+}
+
 static void vfu_object_machine_done(Notifier *notifier, void *data)
 {
     VfuObject *o = container_of(notifier, VfuObject, machine_done);
@@ -125,6 +166,9 @@ static void vfu_object_machine_done(Notifier *notifier, void *data)
                    pci_get_word(o->pci_dev->config + PCI_DEVICE_ID),
                    pci_get_word(o->pci_dev->config + PCI_SUBSYSTEM_VENDOR_ID),
                    pci_get_word(o->pci_dev->config + PCI_SUBSYSTEM_ID));
+
+    qemu_thread_create(&o->vfu_ctx_thread, "VFU ctx runner", vfu_object_ctx_run,
+                       o, QEMU_THREAD_JOINABLE);
 }
 
 static void vfu_object_init(Object *obj)
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC server 06/11] vfio-user: handle PCI config space accesses
  2021-07-19 20:00 ` [PATCH RFC server 00/11] vfio-user server in QEMU Jagannathan Raman
                     ` (4 preceding siblings ...)
  2021-07-19 20:00   ` [PATCH RFC server 05/11] vfio-user: run vfio-user context Jagannathan Raman
@ 2021-07-19 20:00   ` Jagannathan Raman
  2021-07-26 15:10     ` John Levon
  2021-07-19 20:00   ` [PATCH RFC server 07/11] vfio-user: handle DMA mappings Jagannathan Raman
                     ` (4 subsequent siblings)
  10 siblings, 1 reply; 55+ messages in thread
From: Jagannathan Raman @ 2021-07-19 20:00 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha, thanos.makatos

Define and register handlers for PCI config space accesses

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/remote/vfio-user-obj.c | 41 +++++++++++++++++++++++++++++++++++++++++
 hw/remote/trace-events    |  2 ++
 2 files changed, 43 insertions(+)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 6a2d0f5..60d9fa8 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -36,6 +36,7 @@
 #include "sysemu/runstate.h"
 #include "qemu/notify.h"
 #include "qemu/thread.h"
+#include "qemu/main-loop.h"
 #include "qapi/error.h"
 #include "sysemu/sysemu.h"
 #include "hw/qdev-core.h"
@@ -131,6 +132,35 @@ static void *vfu_object_ctx_run(void *opaque)
     return NULL;
 }
 
+static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, char * const buf,
+                                     size_t count, loff_t offset,
+                                     const bool is_write)
+{
+    VfuObject *o = vfu_get_private(vfu_ctx);
+    uint32_t val = 0;
+    int i;
+
+    qemu_mutex_lock_iothread();
+
+    for (i = 0; i < count; i++) {
+        if (is_write) {
+            val = *((uint8_t *)(buf + i));
+            trace_vfu_cfg_write((offset + i), val);
+            pci_default_write_config(PCI_DEVICE(o->pci_dev),
+                                     (offset + i), val, 1);
+        } else {
+            val = pci_default_read_config(PCI_DEVICE(o->pci_dev),
+                                          (offset + i), 1);
+            *((uint8_t *)(buf + i)) = (uint8_t)val;
+            trace_vfu_cfg_read((offset + i), val);
+        }
+    }
+
+    qemu_mutex_unlock_iothread();
+
+    return count;
+}
+
 static void vfu_object_machine_done(Notifier *notifier, void *data)
 {
     VfuObject *o = container_of(notifier, VfuObject, machine_done);
@@ -167,6 +197,17 @@ static void vfu_object_machine_done(Notifier *notifier, void *data)
                    pci_get_word(o->pci_dev->config + PCI_SUBSYSTEM_VENDOR_ID),
                    pci_get_word(o->pci_dev->config + PCI_SUBSYSTEM_ID));
 
+    ret = vfu_setup_region(o->vfu_ctx, VFU_PCI_DEV_CFG_REGION_IDX,
+                           pci_config_size(o->pci_dev), &vfu_object_cfg_access,
+                           VFU_REGION_FLAG_RW | VFU_REGION_FLAG_ALWAYS_CB,
+                           NULL, 0, -1, 0);
+    if (ret < 0) {
+        error_setg(&error_abort,
+                   "vfu: Failed to setup config space handlers for %s- %s",
+                   o->devid, strerror(errno));
+        return;
+    }
+
     qemu_thread_create(&o->vfu_ctx_thread, "VFU ctx runner", vfu_object_ctx_run,
                        o, QEMU_THREAD_JOINABLE);
 }
diff --git a/hw/remote/trace-events b/hw/remote/trace-events
index 7da12f0..2ef7884 100644
--- a/hw/remote/trace-events
+++ b/hw/remote/trace-events
@@ -5,3 +5,5 @@ mpqemu_recv_io_error(int cmd, int size, int nfds) "failed to receive %d size %d,
 
 # vfio-user-obj.c
 vfu_prop(const char *prop, const char *val) "vfu: setting %s as %s"
+vfu_cfg_read(uint32_t offset, uint32_t val) "vfu: cfg: 0x%u -> 0x%x"
+vfu_cfg_write(uint32_t offset, uint32_t val) "vfu: cfg: 0x%u <- 0x%x"
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC server 07/11] vfio-user: handle DMA mappings
  2021-07-19 20:00 ` [PATCH RFC server 00/11] vfio-user server in QEMU Jagannathan Raman
                     ` (5 preceding siblings ...)
  2021-07-19 20:00   ` [PATCH RFC server 06/11] vfio-user: handle PCI config space accesses Jagannathan Raman
@ 2021-07-19 20:00   ` Jagannathan Raman
  2021-07-20 14:38     ` Thanos Makatos
  2021-07-19 20:00   ` [PATCH RFC server 08/11] vfio-user: handle PCI BAR accesses Jagannathan Raman
                     ` (3 subsequent siblings)
  10 siblings, 1 reply; 55+ messages in thread
From: Jagannathan Raman @ 2021-07-19 20:00 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha, thanos.makatos

Define and register callbacks to manage the RAM regions used for
device DMA

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/remote/vfio-user-obj.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++
 hw/remote/trace-events    |  2 ++
 2 files changed, 60 insertions(+)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 60d9fa8..d158a7f 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -161,6 +161,57 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, char * const buf,
     return count;
 }
 
+static void dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
+{
+    MemoryRegion *subregion = NULL;
+    g_autofree char *name = NULL;
+    static unsigned int suffix;
+    struct iovec *iov = &info->iova;
+
+    if (!info->vaddr) {
+        return;
+    }
+
+    name = g_strdup_printf("remote-mem-%u", suffix++);
+
+    subregion = g_new0(MemoryRegion, 1);
+
+    qemu_mutex_lock_iothread();
+
+    memory_region_init_ram_ptr(subregion, NULL, name,
+                               iov->iov_len, info->vaddr);
+
+    memory_region_add_subregion(get_system_memory(), (hwaddr)iov->iov_base,
+                                subregion);
+
+    qemu_mutex_unlock_iothread();
+
+    trace_vfu_dma_register((uint64_t)iov->iov_base, iov->iov_len);
+}
+
+static int dma_unregister(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
+{
+    MemoryRegion *mr = NULL;
+    ram_addr_t offset;
+
+    mr = memory_region_from_host(info->vaddr, &offset);
+    if (!mr) {
+        return 0;
+    }
+
+    qemu_mutex_lock_iothread();
+
+    memory_region_del_subregion(get_system_memory(), mr);
+
+    object_unparent((OBJECT(mr)));
+
+    qemu_mutex_unlock_iothread();
+
+    trace_vfu_dma_unregister((uint64_t)info->iova.iov_base);
+
+    return 0;
+}
+
 static void vfu_object_machine_done(Notifier *notifier, void *data)
 {
     VfuObject *o = container_of(notifier, VfuObject, machine_done);
@@ -208,6 +259,13 @@ static void vfu_object_machine_done(Notifier *notifier, void *data)
         return;
     }
 
+    ret = vfu_setup_device_dma(o->vfu_ctx, &dma_register, &dma_unregister);
+    if (ret < 0) {
+        error_setg(&error_abort, "vfu: Failed to setup DMA handlers for %s",
+                   o->devid);
+        return;
+    }
+
     qemu_thread_create(&o->vfu_ctx_thread, "VFU ctx runner", vfu_object_ctx_run,
                        o, QEMU_THREAD_JOINABLE);
 }
diff --git a/hw/remote/trace-events b/hw/remote/trace-events
index 2ef7884..f945c7e 100644
--- a/hw/remote/trace-events
+++ b/hw/remote/trace-events
@@ -7,3 +7,5 @@ mpqemu_recv_io_error(int cmd, int size, int nfds) "failed to receive %d size %d,
 vfu_prop(const char *prop, const char *val) "vfu: setting %s as %s"
 vfu_cfg_read(uint32_t offset, uint32_t val) "vfu: cfg: 0x%u -> 0x%x"
 vfu_cfg_write(uint32_t offset, uint32_t val) "vfu: cfg: 0x%u <- 0x%x"
+vfu_dma_register(uint64_t gpa, size_t len) "vfu: registering GPA 0x%"PRIx64", %zu bytes"
+vfu_dma_unregister(uint64_t gpa) "vfu: unregistering GPA 0x%"PRIx64""
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC server 08/11] vfio-user: handle PCI BAR accesses
  2021-07-19 20:00 ` [PATCH RFC server 00/11] vfio-user server in QEMU Jagannathan Raman
                     ` (6 preceding siblings ...)
  2021-07-19 20:00   ` [PATCH RFC server 07/11] vfio-user: handle DMA mappings Jagannathan Raman
@ 2021-07-19 20:00   ` Jagannathan Raman
  2021-07-19 20:00   ` [PATCH RFC server 09/11] vfio-user: handle device interrupts Jagannathan Raman
                     ` (2 subsequent siblings)
  10 siblings, 0 replies; 55+ messages in thread
From: Jagannathan Raman @ 2021-07-19 20:00 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha, thanos.makatos

Determine the BARs used by the PCI device and register handlers to
manage the access to the same.

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/remote/vfio-user-obj.c | 95 +++++++++++++++++++++++++++++++++++++++++++++++
 hw/remote/trace-events    |  2 +
 2 files changed, 97 insertions(+)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index d158a7f..9853feb 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -212,6 +212,99 @@ static int dma_unregister(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
     return 0;
 }
 
+static ssize_t vfu_object_bar_rw(PCIDevice *pci_dev, hwaddr addr, size_t count,
+                                 char * const buf, const bool is_write,
+                                 uint8_t type)
+{
+    AddressSpace *as = NULL;
+    MemTxResult res;
+
+    if (type == PCI_BASE_ADDRESS_SPACE_MEMORY) {
+        as = pci_device_iommu_address_space(pci_dev);
+    } else {
+        as = &address_space_io;
+    }
+
+    trace_vfu_bar_rw_enter(is_write ? "Write" : "Read", (uint64_t)addr);
+
+    res = address_space_rw(as, addr, MEMTXATTRS_UNSPECIFIED, (void *)buf,
+                           (hwaddr)count, is_write);
+    if (res != MEMTX_OK) {
+        warn_report("vfu: failed to %s 0x%"PRIx64"",
+                    is_write ? "write to" : "read from",
+                    addr);
+        return -1;
+    }
+
+    trace_vfu_bar_rw_exit(is_write ? "Write" : "Read", (uint64_t)addr);
+
+    return count;
+}
+
+/**
+ * VFU_OBJECT_BAR_HANDLER - macro for defining handlers for PCI BARs.
+ *
+ * To create handler for BAR number 2, VFU_OBJECT_BAR_HANDLER(2) would
+ * define vfu_object_bar2_handler
+ */
+#define VFU_OBJECT_BAR_HANDLER(BAR_NO)                                         \
+    static ssize_t vfu_object_bar##BAR_NO##_handler(vfu_ctx_t *vfu_ctx,        \
+                                        char * const buf, size_t count,        \
+                                        loff_t offset, const bool is_write)    \
+    {                                                                          \
+        VfuObject *o = vfu_get_private(vfu_ctx);                               \
+        hwaddr addr = (hwaddr)(pci_get_long(o->pci_dev->config +               \
+                                            PCI_BASE_ADDRESS_0 +               \
+                                            (4 * BAR_NO)) + offset);           \
+                                                                               \
+        return vfu_object_bar_rw(o->pci_dev, addr, count, buf, is_write,       \
+                                 o->pci_dev->io_regions[BAR_NO].type);         \
+    }                                                                          \
+
+VFU_OBJECT_BAR_HANDLER(0)
+VFU_OBJECT_BAR_HANDLER(1)
+VFU_OBJECT_BAR_HANDLER(2)
+VFU_OBJECT_BAR_HANDLER(3)
+VFU_OBJECT_BAR_HANDLER(4)
+VFU_OBJECT_BAR_HANDLER(5)
+
+static vfu_region_access_cb_t *vfu_object_bar_handlers[PCI_NUM_REGIONS] = {
+    &vfu_object_bar0_handler,
+    &vfu_object_bar1_handler,
+    &vfu_object_bar2_handler,
+    &vfu_object_bar3_handler,
+    &vfu_object_bar4_handler,
+    &vfu_object_bar5_handler,
+};
+
+/**
+ * vfu_object_register_bars - Identify active BAR regions of pdev and setup
+ *                            callbacks to handle read/write accesses
+ */
+static void vfu_object_register_bars(vfu_ctx_t *vfu_ctx, PCIDevice *pdev)
+{
+    uint32_t orig_val, new_val;
+    int i, size;
+
+    for (i = 0; i < PCI_NUM_REGIONS; i++) {
+        orig_val = pci_default_read_config(pdev,
+                                           PCI_BASE_ADDRESS_0 + (4 * i), 4);
+        new_val = 0xffffffff;
+        pci_default_write_config(pdev,
+                                 PCI_BASE_ADDRESS_0 + (4 * i), new_val, 4);
+        new_val = pci_default_read_config(pdev,
+                                          PCI_BASE_ADDRESS_0 + (4 * i), 4);
+        size = (~(new_val & 0xFFFFFFF0)) + 1;
+        pci_default_write_config(pdev, PCI_BASE_ADDRESS_0 + (4 * i),
+                                 orig_val, 4);
+        if (size) {
+            vfu_setup_region(vfu_ctx, VFU_PCI_DEV_BAR0_REGION_IDX + i, size,
+                             vfu_object_bar_handlers[i], VFU_REGION_FLAG_RW,
+                             NULL, 0, -1, 0);
+        }
+    }
+}
+
 static void vfu_object_machine_done(Notifier *notifier, void *data)
 {
     VfuObject *o = container_of(notifier, VfuObject, machine_done);
@@ -266,6 +359,8 @@ static void vfu_object_machine_done(Notifier *notifier, void *data)
         return;
     }
 
+    vfu_object_register_bars(o->vfu_ctx, o->pci_dev);
+
     qemu_thread_create(&o->vfu_ctx_thread, "VFU ctx runner", vfu_object_ctx_run,
                        o, QEMU_THREAD_JOINABLE);
 }
diff --git a/hw/remote/trace-events b/hw/remote/trace-events
index f945c7e..f3f65e2 100644
--- a/hw/remote/trace-events
+++ b/hw/remote/trace-events
@@ -9,3 +9,5 @@ vfu_cfg_read(uint32_t offset, uint32_t val) "vfu: cfg: 0x%u -> 0x%x"
 vfu_cfg_write(uint32_t offset, uint32_t val) "vfu: cfg: 0x%u <- 0x%x"
 vfu_dma_register(uint64_t gpa, size_t len) "vfu: registering GPA 0x%"PRIx64", %zu bytes"
 vfu_dma_unregister(uint64_t gpa) "vfu: unregistering GPA 0x%"PRIx64""
+vfu_bar_rw_enter(const char *op, uint64_t addr) "vfu: %s request for BAR address 0x%"PRIx64""
+vfu_bar_rw_exit(const char *op, uint64_t addr) "vfu: Finished %s of BAR address 0x%"PRIx64""
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC server 09/11] vfio-user: handle device interrupts
  2021-07-19 20:00 ` [PATCH RFC server 00/11] vfio-user server in QEMU Jagannathan Raman
                     ` (7 preceding siblings ...)
  2021-07-19 20:00   ` [PATCH RFC server 08/11] vfio-user: handle PCI BAR accesses Jagannathan Raman
@ 2021-07-19 20:00   ` Jagannathan Raman
  2021-07-19 20:00   ` [PATCH RFC server 10/11] vfio-user: register handlers to facilitate migration Jagannathan Raman
  2021-07-19 20:00   ` [PATCH RFC server 11/11] vfio-user: acceptance test Jagannathan Raman
  10 siblings, 0 replies; 55+ messages in thread
From: Jagannathan Raman @ 2021-07-19 20:00 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha, thanos.makatos

Forward remote device's interrupts to the guest

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 include/hw/remote/iohub.h |  2 ++
 hw/remote/iohub.c         |  6 ++++++
 hw/remote/vfio-user-obj.c | 30 ++++++++++++++++++++++++++++++
 hw/remote/trace-events    |  1 +
 4 files changed, 39 insertions(+)

diff --git a/include/hw/remote/iohub.h b/include/hw/remote/iohub.h
index 0bf98e0..132f496 100644
--- a/include/hw/remote/iohub.h
+++ b/include/hw/remote/iohub.h
@@ -15,6 +15,7 @@
 #include "qemu/event_notifier.h"
 #include "qemu/thread-posix.h"
 #include "hw/remote/mpqemu-link.h"
+#include "libvfio-user/include/libvfio-user.h"
 
 #define REMOTE_IOHUB_NB_PIRQS    PCI_DEVFN_MAX
 
@@ -30,6 +31,7 @@ typedef struct RemoteIOHubState {
     unsigned int irq_level[REMOTE_IOHUB_NB_PIRQS];
     ResampleToken token[REMOTE_IOHUB_NB_PIRQS];
     QemuMutex irq_level_lock[REMOTE_IOHUB_NB_PIRQS];
+    vfu_ctx_t *vfu_ctx[REMOTE_IOHUB_NB_PIRQS];
 } RemoteIOHubState;
 
 int remote_iohub_map_irq(PCIDevice *pci_dev, int intx);
diff --git a/hw/remote/iohub.c b/hw/remote/iohub.c
index 547d597..241c8d7 100644
--- a/hw/remote/iohub.c
+++ b/hw/remote/iohub.c
@@ -18,6 +18,8 @@
 #include "hw/remote/machine.h"
 #include "hw/remote/iohub.h"
 #include "qemu/main-loop.h"
+#include "libvfio-user/include/libvfio-user.h"
+#include "trace.h"
 
 void remote_iohub_init(RemoteIOHubState *iohub)
 {
@@ -62,6 +64,10 @@ void remote_iohub_set_irq(void *opaque, int pirq, int level)
     QEMU_LOCK_GUARD(&iohub->irq_level_lock[pirq]);
 
     if (level) {
+        if (iohub->vfu_ctx[pirq]) {
+            trace_vfu_interrupt(pirq);
+            vfu_irq_trigger(iohub->vfu_ctx[pirq], 0);
+        }
         if (++iohub->irq_level[pirq] == 1) {
             event_notifier_set(&iohub->irqfds[pirq]);
         }
diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 9853feb..d2a2e51 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -41,6 +41,9 @@
 #include "sysemu/sysemu.h"
 #include "hw/qdev-core.h"
 #include "hw/pci/pci.h"
+#include "hw/boards.h"
+#include "hw/remote/iohub.h"
+#include "hw/remote/machine.h"
 
 #include "libvfio-user/include/libvfio-user.h"
 
@@ -305,6 +308,26 @@ static void vfu_object_register_bars(vfu_ctx_t *vfu_ctx, PCIDevice *pdev)
     }
 }
 
+static int vfu_object_setup_irqs(vfu_ctx_t *vfu_ctx, PCIDevice *pci_dev)
+{
+    RemoteMachineState *machine = REMOTE_MACHINE(current_machine);
+    RemoteIOHubState *iohub = &machine->iohub;
+    int pirq, intx, ret;
+
+    ret = vfu_setup_device_nr_irqs(vfu_ctx, VFU_DEV_INTX_IRQ, 1);
+    if (ret < 0) {
+        return ret;
+    }
+
+    intx = pci_get_byte(pci_dev->config + PCI_INTERRUPT_PIN) - 1;
+
+    pirq = remote_iohub_map_irq(pci_dev, intx);
+
+    iohub->vfu_ctx[pirq] = vfu_ctx;
+
+    return 0;
+}
+
 static void vfu_object_machine_done(Notifier *notifier, void *data)
 {
     VfuObject *o = container_of(notifier, VfuObject, machine_done);
@@ -361,6 +384,13 @@ static void vfu_object_machine_done(Notifier *notifier, void *data)
 
     vfu_object_register_bars(o->vfu_ctx, o->pci_dev);
 
+    ret = vfu_object_setup_irqs(o->vfu_ctx, o->pci_dev);
+    if (ret < 0) {
+        error_setg(&error_abort, "vfu: Failed to setup interrupts for %s",
+                   o->devid);
+        return;
+    }
+
     qemu_thread_create(&o->vfu_ctx_thread, "VFU ctx runner", vfu_object_ctx_run,
                        o, QEMU_THREAD_JOINABLE);
 }
diff --git a/hw/remote/trace-events b/hw/remote/trace-events
index f3f65e2..b419d6f 100644
--- a/hw/remote/trace-events
+++ b/hw/remote/trace-events
@@ -11,3 +11,4 @@ vfu_dma_register(uint64_t gpa, size_t len) "vfu: registering GPA 0x%"PRIx64", %z
 vfu_dma_unregister(uint64_t gpa) "vfu: unregistering GPA 0x%"PRIx64""
 vfu_bar_rw_enter(const char *op, uint64_t addr) "vfu: %s request for BAR address 0x%"PRIx64""
 vfu_bar_rw_exit(const char *op, uint64_t addr) "vfu: Finished %s of BAR address 0x%"PRIx64""
+vfu_interrupt(int pirq) "vfu: sending interrupt to device - PIRQ %d"
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC server 10/11] vfio-user: register handlers to facilitate migration
  2021-07-19 20:00 ` [PATCH RFC server 00/11] vfio-user server in QEMU Jagannathan Raman
                     ` (8 preceding siblings ...)
  2021-07-19 20:00   ` [PATCH RFC server 09/11] vfio-user: handle device interrupts Jagannathan Raman
@ 2021-07-19 20:00   ` Jagannathan Raman
  2021-07-20 14:05     ` Thanos Makatos
  2021-07-19 20:00   ` [PATCH RFC server 11/11] vfio-user: acceptance test Jagannathan Raman
  10 siblings, 1 reply; 55+ messages in thread
From: Jagannathan Raman @ 2021-07-19 20:00 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha, thanos.makatos

Store and load the device's state using handlers for live migration

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 migration/savevm.h        |   2 +
 hw/remote/vfio-user-obj.c | 287 ++++++++++++++++++++++++++++++++++++++++++++++
 migration/savevm.c        |  63 ++++++++++
 3 files changed, 352 insertions(+)

diff --git a/migration/savevm.h b/migration/savevm.h
index 6461342..71d1733 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -67,5 +67,7 @@ int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
 int qemu_load_device_state(QEMUFile *f);
 int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
         bool in_postcopy, bool inactivate_disks);
+int qemu_remote_savevm(QEMUFile *f);
+int qemu_remote_loadvm(QEMUFile *f);
 
 #endif
diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index d2a2e51..5948576 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -44,6 +44,10 @@
 #include "hw/boards.h"
 #include "hw/remote/iohub.h"
 #include "hw/remote/machine.h"
+#include "migration/qemu-file.h"
+#include "migration/savevm.h"
+#include "migration/global_state.h"
+#include "block/block.h"
 
 #include "libvfio-user/include/libvfio-user.h"
 
@@ -73,6 +77,31 @@ struct VfuObject {
     PCIDevice *pci_dev;
 
     QemuThread vfu_ctx_thread;
+
+    /*
+     * vfu_mig_buf holds the migration data. In the remote process, this
+     * buffer replaces the role of an IO channel which links the source
+     * and the destination.
+     *
+     * Whenever the client QEMU process initiates migration, the libvfio-user
+     * library notifies that to this server. The remote/server QEMU sets up a
+     * QEMUFile object using this buffer as backend. The remote passes this
+     * object to its migration subsystem, and it slirps the VMSDs of all its
+     * devices and stores them in this buffer.
+     *
+     * libvfio-user library subsequetly asks the remote for any data that needs
+     * to be moved over to the destination using its vfu_migration_callbacks_t
+     * APIs. The remote hands over this buffer as data at this time.
+     *
+     * A reverse of this process happens at the destination.
+     */
+    uint8_t *vfu_mig_buf;
+
+    uint64_t vfu_mig_buf_size;
+
+    uint64_t vfu_mig_buf_pending;
+
+    QEMUFile *vfu_mig_file;
 };
 
 static void vfu_object_set_socket(Object *obj, const char *str, Error **errp)
@@ -97,6 +126,226 @@ static void vfu_object_set_devid(Object *obj, const char *str, Error **errp)
     trace_vfu_prop("devid", str);
 }
 
+/**
+ * Migration helper functions
+ *
+ * vfu_mig_buf_read & vfu_mig_buf_write are used by QEMU's migration
+ * subsystem - qemu_remote_savevm & qemu_remote_loadvm. savevm/loadvm
+ * call these functions via QEMUFileOps to save/load the VMSD of all
+ * the devices into vfu_mig_buf
+ *
+ */
+static ssize_t vfu_mig_buf_read(void *opaque, uint8_t *buf, int64_t pos,
+                                size_t size, Error **errp)
+{
+    VfuObject *o = opaque;
+
+    if (pos > o->vfu_mig_buf_size) {
+        size = 0;
+    } else if ((pos + size) > o->vfu_mig_buf_size) {
+        size = o->vfu_mig_buf_size;
+    }
+
+    memcpy(buf, (o->vfu_mig_buf + pos), size);
+
+    o->vfu_mig_buf_size -= size;
+
+    return size;
+}
+
+static ssize_t vfu_mig_buf_write(void *opaque, struct iovec *iov, int iovcnt,
+                                 int64_t pos, Error **errp)
+{
+    VfuObject *o = opaque;
+    uint64_t end = pos + iov_size(iov, iovcnt);
+    int i;
+
+    if (end > o->vfu_mig_buf_size) {
+        o->vfu_mig_buf = g_realloc(o->vfu_mig_buf, end);
+    }
+
+    for (i = 0; i < iovcnt; i++) {
+        memcpy((o->vfu_mig_buf + o->vfu_mig_buf_size), iov[i].iov_base,
+               iov[i].iov_len);
+        o->vfu_mig_buf_size += iov[i].iov_len;
+        o->vfu_mig_buf_pending += iov[i].iov_len;
+    }
+
+    return iov_size(iov, iovcnt);
+}
+
+static int vfu_mig_buf_shutdown(void *opaque, bool rd, bool wr, Error **errp)
+{
+    VfuObject *o = opaque;
+
+    o->vfu_mig_buf_size = 0;
+
+    g_free(o->vfu_mig_buf);
+
+    return 0;
+}
+
+static const QEMUFileOps vfu_mig_fops_save = {
+    .writev_buffer  = vfu_mig_buf_write,
+    .shut_down      = vfu_mig_buf_shutdown,
+};
+
+static const QEMUFileOps vfu_mig_fops_load = {
+    .get_buffer     = vfu_mig_buf_read,
+    .shut_down      = vfu_mig_buf_shutdown,
+};
+
+/**
+ * handlers for vfu_migration_callbacks_t
+ *
+ * The libvfio-user library accesses these handlers to drive the migration
+ * at the remote end, and also to transport the data stored in vfu_mig_buf
+ *
+ */
+static void vfu_mig_state_precopy(vfu_ctx_t *vfu_ctx)
+{
+    VfuObject *o = vfu_get_private(vfu_ctx);
+    int ret;
+
+    if (!o->vfu_mig_file) {
+        o->vfu_mig_file = qemu_fopen_ops(o, &vfu_mig_fops_save);
+    }
+
+    global_state_store();
+
+    ret = qemu_remote_savevm(o->vfu_mig_file);
+    if (ret) {
+        qemu_file_shutdown(o->vfu_mig_file);
+        return;
+    }
+
+    qemu_fflush(o->vfu_mig_file);
+
+    bdrv_inactivate_all();
+}
+
+static void vfu_mig_state_running(vfu_ctx_t *vfu_ctx)
+{
+    VfuObject *o = vfu_get_private(vfu_ctx);
+    Error *local_err = NULL;
+    int ret;
+
+    ret = qemu_remote_loadvm(o->vfu_mig_file);
+    if (ret) {
+        error_setg(&error_abort, "vfu: failed to restore device state");
+        return;
+    }
+
+    bdrv_invalidate_cache_all(&local_err);
+    if (local_err) {
+        error_report_err(local_err);
+        return;
+    }
+
+    vm_start();
+}
+
+static int vfu_mig_transition(vfu_ctx_t *vfu_ctx, vfu_migr_state_t state)
+{
+    switch (state) {
+    case VFU_MIGR_STATE_RESUME:
+    case VFU_MIGR_STATE_STOP_AND_COPY:
+    case VFU_MIGR_STATE_STOP:
+        break;
+    case VFU_MIGR_STATE_PRE_COPY:
+        vfu_mig_state_precopy(vfu_ctx);
+        break;
+    case VFU_MIGR_STATE_RUNNING:
+        if (!runstate_is_running()) {
+            vfu_mig_state_running(vfu_ctx);
+        }
+        break;
+    default:
+        warn_report("vfu: Unknown migration state %d", state);
+    }
+
+    return 0;
+}
+
+static uint64_t vfu_mig_get_pending_bytes(vfu_ctx_t *vfu_ctx)
+{
+    VfuObject *o = vfu_get_private(vfu_ctx);
+
+    return o->vfu_mig_buf_pending;
+}
+
+static int vfu_mig_prepare_data(vfu_ctx_t *vfu_ctx, uint64_t *offset,
+                                uint64_t *size)
+{
+    VfuObject *o = vfu_get_private(vfu_ctx);
+
+    if (offset) {
+        *offset = 0;
+    }
+
+    if (size) {
+        *size = o->vfu_mig_buf_size;
+    }
+
+    return 0;
+}
+
+static ssize_t vfu_mig_read_data(vfu_ctx_t *vfu_ctx, void *buf,
+                                 uint64_t size, uint64_t offset)
+{
+    VfuObject *o = vfu_get_private(vfu_ctx);
+
+    if (offset > o->vfu_mig_buf_size) {
+        return -1;
+    }
+
+    if ((offset + size) > o->vfu_mig_buf_size) {
+        warn_report("vfu: buffer overflow - check pending_bytes");
+        size = o->vfu_mig_buf_size - offset;
+    }
+
+    memcpy(buf, (o->vfu_mig_buf + offset), size);
+
+    o->vfu_mig_buf_pending -= size;
+
+    return size;
+}
+
+static ssize_t vfu_mig_write_data(vfu_ctx_t *vfu_ctx, void *data,
+                                  uint64_t size, uint64_t offset)
+{
+    VfuObject *o = vfu_get_private(vfu_ctx);
+    uint64_t end = offset + size;
+
+    if (end > o->vfu_mig_buf_size) {
+        o->vfu_mig_buf = g_realloc(o->vfu_mig_buf, end);
+        o->vfu_mig_buf_size = end;
+    }
+
+    memcpy((o->vfu_mig_buf + offset), data, size);
+
+    if (!o->vfu_mig_file) {
+        o->vfu_mig_file = qemu_fopen_ops(o, &vfu_mig_fops_load);
+    }
+
+    return size;
+}
+
+static int vfu_mig_data_written(vfu_ctx_t *vfu_ctx, uint64_t count)
+{
+    return 0;
+}
+
+static const vfu_migration_callbacks_t vfu_mig_cbs = {
+    .version = VFU_MIGR_CALLBACKS_VERS,
+    .transition = &vfu_mig_transition,
+    .get_pending_bytes = &vfu_mig_get_pending_bytes,
+    .prepare_data = &vfu_mig_prepare_data,
+    .read_data = &vfu_mig_read_data,
+    .data_written = &vfu_mig_data_written,
+    .write_data = &vfu_mig_write_data,
+};
+
 static void *vfu_object_ctx_run(void *opaque)
 {
     VfuObject *o = opaque;
@@ -332,6 +581,7 @@ static void vfu_object_machine_done(Notifier *notifier, void *data)
 {
     VfuObject *o = container_of(notifier, VfuObject, machine_done);
     DeviceState *dev = NULL;
+    size_t migr_area_size;
     int ret;
 
     o->vfu_ctx = vfu_create_ctx(VFU_TRANS_SOCK, o->socket, 0,
@@ -391,6 +641,35 @@ static void vfu_object_machine_done(Notifier *notifier, void *data)
         return;
     }
 
+    /*
+     * TODO: The 0x20000 number used below is a temporary. We are working on
+     *     a cleaner fix for this.
+     *
+     *     The libvfio-user library assumes that the remote knows the size of
+     *     the data to be migrated at boot time, but that is not the case with
+     *     VMSDs, as it can contain a variable-size buffer. 0x20000 is used
+     *     as a sufficiently large buffer to demonstrate migration, but that
+     *     cannot be used as a solution.
+     *
+     */
+    ret = vfu_setup_region(o->vfu_ctx, VFU_PCI_DEV_MIGR_REGION_IDX,
+                           0x20000, NULL,
+                           VFU_REGION_FLAG_RW, NULL, 0, -1, 0);
+    if (ret < 0) {
+        error_setg(&error_abort, "vfu: Failed to register migration BAR %s- %s",
+                   o->devid, strerror(errno));
+        return;
+    }
+
+    migr_area_size = vfu_get_migr_register_area_size();
+    ret = vfu_setup_device_migration_callbacks(o->vfu_ctx, &vfu_mig_cbs,
+                                               migr_area_size);
+    if (ret < 0) {
+        error_setg(&error_abort, "vfu: Failed to setup migration %s- %s",
+                   o->devid, strerror(errno));
+        return;
+    }
+
     qemu_thread_create(&o->vfu_ctx_thread, "VFU ctx runner", vfu_object_ctx_run,
                        o, QEMU_THREAD_JOINABLE);
 }
@@ -412,6 +691,14 @@ static void vfu_object_init(Object *obj)
 
     o->machine_done.notify = vfu_object_machine_done;
     qemu_add_machine_init_done_notifier(&o->machine_done);
+
+    o->vfu_mig_file = NULL;
+
+    o->vfu_mig_buf = NULL;
+
+    o->vfu_mig_buf_size = 0;
+
+    o->vfu_mig_buf_pending = 0;
 }
 
 static void vfu_object_finalize(Object *obj)
diff --git a/migration/savevm.c b/migration/savevm.c
index 72848b9..c2279af 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1603,6 +1603,33 @@ static int qemu_savevm_state(QEMUFile *f, Error **errp)
     return ret;
 }
 
+int qemu_remote_savevm(QEMUFile *f)
+{
+    SaveStateEntry *se;
+    int ret;
+
+    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+        if (!se->vmsd || !vmstate_save_needed(se->vmsd, se->opaque)) {
+            continue;
+        }
+
+        save_section_header(f, se, QEMU_VM_SECTION_FULL);
+
+        ret = vmstate_save(f, se, NULL);
+        if (ret) {
+            qemu_file_set_error(f, ret);
+            return ret;
+        }
+
+        save_section_footer(f, se);
+    }
+
+    qemu_put_byte(f, QEMU_VM_EOF);
+    qemu_fflush(f);
+
+    return 0;
+}
+
 void qemu_savevm_live_state(QEMUFile *f)
 {
     /* save QEMU_VM_SECTION_END section */
@@ -2443,6 +2470,42 @@ qemu_loadvm_section_start_full(QEMUFile *f, MigrationIncomingState *mis)
     return 0;
 }
 
+int qemu_remote_loadvm(QEMUFile *f)
+{
+    uint8_t section_type;
+    int ret = 0;
+
+    qemu_mutex_lock_iothread();
+
+    while (true) {
+        section_type = qemu_get_byte(f);
+
+        if (qemu_file_get_error(f)) {
+            ret = qemu_file_get_error(f);
+            break;
+        }
+
+        switch (section_type) {
+        case QEMU_VM_SECTION_FULL:
+            ret = qemu_loadvm_section_start_full(f, NULL);
+            if (ret < 0) {
+                break;
+            }
+            break;
+        case QEMU_VM_EOF:
+            goto out;
+        default:
+            ret = -EINVAL;
+            goto out;
+        }
+    }
+
+out:
+    qemu_mutex_unlock_iothread();
+
+    return ret;
+}
+
 static int
 qemu_loadvm_section_part_end(QEMUFile *f, MigrationIncomingState *mis)
 {
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC server 11/11] vfio-user: acceptance test
  2021-07-19 20:00 ` [PATCH RFC server 00/11] vfio-user server in QEMU Jagannathan Raman
                     ` (9 preceding siblings ...)
  2021-07-19 20:00   ` [PATCH RFC server 10/11] vfio-user: register handlers to facilitate migration Jagannathan Raman
@ 2021-07-19 20:00   ` Jagannathan Raman
  2021-07-20 16:12     ` Thanos Makatos
  10 siblings, 1 reply; 55+ messages in thread
From: Jagannathan Raman @ 2021-07-19 20:00 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha, thanos.makatos

Acceptance test for libvfio-user in QEMU

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 MAINTAINERS                   |  1 +
 tests/acceptance/vfio-user.py | 94 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 95 insertions(+)
 create mode 100644 tests/acceptance/vfio-user.py

diff --git a/MAINTAINERS b/MAINTAINERS
index 46ab6b6..644bd35 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3381,6 +3381,7 @@ F: include/hw/remote/proxy-memory-listener.h
 F: hw/remote/iohub.c
 F: include/hw/remote/iohub.h
 F: hw/remote/vfio-user-obj.c
+F: tests/acceptance/vfio-user.py
 
 EBPF:
 M: Jason Wang <jasowang@redhat.com>
diff --git a/tests/acceptance/vfio-user.py b/tests/acceptance/vfio-user.py
new file mode 100644
index 0000000..ef318d9
--- /dev/null
+++ b/tests/acceptance/vfio-user.py
@@ -0,0 +1,94 @@
+# vfio-user protocol sanity test
+#
+# This work is licensed under the terms of the GNU GPL, version 2 or
+# later.  See the COPYING file in the top-level directory.
+
+
+import os
+import socket
+import uuid
+
+from avocado_qemu import Test
+from avocado_qemu import wait_for_console_pattern
+from avocado_qemu import exec_command
+from avocado_qemu import exec_command_and_wait_for_pattern
+
+class VfioUser(Test):
+    """
+    :avocado: tags=vfiouser
+    """
+    KERNEL_COMMON_COMMAND_LINE = 'printk.time=0 '
+
+    def do_test(self, kernel_url, initrd_url, kernel_command_line,
+                machine_type):
+        """Main test method"""
+        self.require_accelerator('kvm')
+
+        kernel_path = self.fetch_asset(kernel_url)
+        initrd_path = self.fetch_asset(initrd_url)
+
+        socket = os.path.join('/tmp', str(uuid.uuid4()))
+        if os.path.exists(socket):
+            os.remove(socket)
+
+        # Create remote process
+        remote_vm = self.get_vm()
+        remote_vm.add_args('-machine', 'x-remote')
+        remote_vm.add_args('-nodefaults')
+        remote_vm.add_args('-device', 'lsi53c895a,id=lsi1')
+        remote_vm.add_args('-object', 'vfio-user,id=vfioobj1,'
+                           'devid=lsi1,socket='+socket)
+        remote_vm.launch()
+
+        # Create proxy process
+        self.vm.set_console()
+        self.vm.add_args('-machine', machine_type)
+        self.vm.add_args('-accel', 'kvm')
+        self.vm.add_args('-cpu', 'host')
+        self.vm.add_args('-object',
+                         'memory-backend-memfd,id=sysmem-file,size=2G')
+        self.vm.add_args('--numa', 'node,memdev=sysmem-file')
+        self.vm.add_args('-m', '2048')
+        self.vm.add_args('-kernel', kernel_path,
+                         '-initrd', initrd_path,
+                         '-append', kernel_command_line)
+        self.vm.add_args('-device',
+                         'vfio-user-pci,'
+                         'socket='+socket)
+        self.vm.launch()
+        wait_for_console_pattern(self, 'as init process',
+                                 'Kernel panic - not syncing')
+        exec_command(self, 'mount -t sysfs sysfs /sys')
+        exec_command_and_wait_for_pattern(self,
+                                          'cat /sys/bus/pci/devices/*/uevent',
+                                          'PCI_ID=1000:0012')
+
+    def test_multiprocess_x86_64(self):
+        """
+        :avocado: tags=arch:x86_64
+        """
+        kernel_url = ('https://archives.fedoraproject.org/pub/archive/fedora'
+                      '/linux/releases/31/Everything/x86_64/os/images'
+                      '/pxeboot/vmlinuz')
+        initrd_url = ('https://archives.fedoraproject.org/pub/archive/fedora'
+                      '/linux/releases/31/Everything/x86_64/os/images'
+                      '/pxeboot/initrd.img')
+        kernel_command_line = (self.KERNEL_COMMON_COMMAND_LINE +
+                               'console=ttyS0 rdinit=/bin/bash')
+        machine_type = 'pc'
+        self.do_test(kernel_url, initrd_url, kernel_command_line, machine_type)
+
+    def test_multiprocess_aarch64(self):
+        """
+        :avocado: tags=arch:aarch64
+        """
+        kernel_url = ('https://archives.fedoraproject.org/pub/archive/fedora'
+                      '/linux/releases/31/Everything/aarch64/os/images'
+                      '/pxeboot/vmlinuz')
+        initrd_url = ('https://archives.fedoraproject.org/pub/archive/fedora'
+                      '/linux/releases/31/Everything/aarch64/os/images'
+                      '/pxeboot/initrd.img')
+        kernel_command_line = (self.KERNEL_COMMON_COMMAND_LINE +
+                               'rdinit=/bin/bash console=ttyAMA0')
+        machine_type = 'virt,gic-version=3'
+        self.do_test(kernel_url, initrd_url, kernel_command_line, machine_type)
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC server 01/11] vfio-user: build library
  2021-07-19 20:00   ` [PATCH RFC server 01/11] vfio-user: build library Jagannathan Raman
@ 2021-07-19 20:24     ` John Levon
  2021-07-20 12:06       ` Jag Raman
  0 siblings, 1 reply; 55+ messages in thread
From: John Levon @ 2021-07-19 20:24 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: elena.ufimtseva, john.g.johnson, Swapnil Ingle, qemu-devel,
	alex.williamson, stefanha, Thanos Makatos

On Mon, Jul 19, 2021 at 04:00:03PM -0400, Jagannathan Raman wrote:

> add the libvfio-user library as a submodule. build it as part of QEMU
> 
> diff --git a/meson.build b/meson.build
> index 6e4d2d8..f2f9f86 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -1894,6 +1894,41 @@ if get_option('cfi') and slirp_opt == 'system'
>           + ' Please configure with --enable-slirp=git')
>  endif
>  
> +vfiouser = not_found
> +if have_system and multiprocess_allowed
> +  have_internal = fs.exists(meson.current_source_dir() / 'libvfio-user/Makefile')
> +
> +  if not have_internal
> +    error('libvfio-user source not found - please pull git submodule')
> +  endif
> +
> +  vfiouser_files = [
> +    'libvfio-user/lib/dma.c',
> +    'libvfio-user/lib/irq.c',
> +    'libvfio-user/lib/libvfio-user.c',
> +    'libvfio-user/lib/migration.c',
> +    'libvfio-user/lib/pci.c',
> +    'libvfio-user/lib/pci_caps.c',
> +    'libvfio-user/lib/tran_sock.c',
> +  ]
> +
> +  vfiouser_inc = include_directories('libvfio-user/include', 'libvfio-user/lib')
> +
> +  json_c = dependency('json-c', required: false)
> +  if not json_c.found()
> +    json_c = dependency('libjson-c')
> +  endif
> +
> +  libvfiouser = static_library('vfiouser',
> +                               build_by_default: false,
> +                               sources: vfiouser_files,
> +                               dependencies: json_c,
> +                               include_directories: vfiouser_inc)
> +
> +  vfiouser = declare_dependency(link_with: libvfiouser,
> +                                include_directories: vfiouser_inc)
> +endif

Why this way, rather than recursing into the submodule? Seems a bit fragile to
encode details of the library here.

regards
john

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC 12/19] vfio-user: probe remote device's BARs
  2021-07-19  6:27 ` [PATCH RFC 12/19] vfio-user: probe remote device's BARs Elena Ufimtseva
@ 2021-07-19 22:59   ` Alex Williamson
  2021-07-20  1:39     ` John Johnson
  0 siblings, 1 reply; 55+ messages in thread
From: Alex Williamson @ 2021-07-19 22:59 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: john.g.johnson, jag.raman, swapnil.ingle, john.levon, qemu-devel,
	stefanha

On Sun, 18 Jul 2021 23:27:51 -0700
Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote:
> @@ -3442,6 +3448,22 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
>      /* QEMU can also add or extend BARs */
>      memset(vdev->emulated_config_bits + PCI_BASE_ADDRESS_0, 0xff, 6 * 4);
>  
> +    /*
> +     * Local QEMU overrides aren't allowed
> +     * They must be done in the device process
> +     */
> +    if (pdev->cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
> +        error_setg(errp, "Multi-function must be specified by device process");
> +        goto error;
> +    }
> +    if (pdev->romfile) {
> +        error_setg(errp, "Romfile must be specified by device process");
> +        goto error;
> +    }

For what reason?  PCI functions can operate completely independently,
there could be different servers for different functions, a QEMU user
may wish to apply a different option ROM image than provided by the
server.  This all creates unnecessary incompatibilities.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC 12/19] vfio-user: probe remote device's BARs
  2021-07-19 22:59   ` Alex Williamson
@ 2021-07-20  1:39     ` John Johnson
  2021-07-20  3:01       ` Alex Williamson
  0 siblings, 1 reply; 55+ messages in thread
From: John Johnson @ 2021-07-20  1:39 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Elena Ufimtseva, Jag Raman, Swapnil Ingle, John Levon,
	QEMU Devel Mailing List, stefanha



> On Jul 19, 2021, at 3:59 PM, Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> On Sun, 18 Jul 2021 23:27:51 -0700
> Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote:
>> @@ -3442,6 +3448,22 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
>>     /* QEMU can also add or extend BARs */
>>     memset(vdev->emulated_config_bits + PCI_BASE_ADDRESS_0, 0xff, 6 * 4);
>> 
>> +    /*
>> +     * Local QEMU overrides aren't allowed
>> +     * They must be done in the device process
>> +     */
>> +    if (pdev->cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
>> +        error_setg(errp, "Multi-function must be specified by device process");
>> +        goto error;
>> +    }
>> +    if (pdev->romfile) {
>> +        error_setg(errp, "Romfile must be specified by device process");
>> +        goto error;
>> +    }
> 
> For what reason?  PCI functions can operate completely independently,
> there could be different servers for different functions, a QEMU user
> may wish to apply a different option ROM image than provided by the
> server.  This all creates unnecessary incompatibilities.  Thanks,
> 

	The idea is to have all the device options specified on the remote
server, and have the vfio client just be a pass-through device.  I thought
having them specified in 2 places would cause more confusion.

								JJ



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC 12/19] vfio-user: probe remote device's BARs
  2021-07-20  1:39     ` John Johnson
@ 2021-07-20  3:01       ` Alex Williamson
  0 siblings, 0 replies; 55+ messages in thread
From: Alex Williamson @ 2021-07-20  3:01 UTC (permalink / raw)
  To: John Johnson
  Cc: Elena Ufimtseva, Jag Raman, Swapnil Ingle, John Levon,
	QEMU Devel Mailing List, stefanha

On Tue, 20 Jul 2021 01:39:21 +0000
John Johnson <john.g.johnson@oracle.com> wrote:

> > On Jul 19, 2021, at 3:59 PM, Alex Williamson <alex.williamson@redhat.com> wrote:
> > 
> > On Sun, 18 Jul 2021 23:27:51 -0700
> > Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote:  
> >> @@ -3442,6 +3448,22 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
> >>     /* QEMU can also add or extend BARs */
> >>     memset(vdev->emulated_config_bits + PCI_BASE_ADDRESS_0, 0xff, 6 * 4);
> >> 
> >> +    /*
> >> +     * Local QEMU overrides aren't allowed
> >> +     * They must be done in the device process
> >> +     */
> >> +    if (pdev->cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
> >> +        error_setg(errp, "Multi-function must be specified by device process");
> >> +        goto error;
> >> +    }
> >> +    if (pdev->romfile) {
> >> +        error_setg(errp, "Romfile must be specified by device process");
> >> +        goto error;
> >> +    }  
> > 
> > For what reason?  PCI functions can operate completely independently,
> > there could be different servers for different functions, a QEMU user
> > may wish to apply a different option ROM image than provided by the
> > server.  This all creates unnecessary incompatibilities.  Thanks,
> >   
> 
> 	The idea is to have all the device options specified on the remote
> server, and have the vfio client just be a pass-through device.  I thought
> having them specified in 2 places would cause more confusion.

IMO, the server has no business making such configuration restrictions.
It's the client's decision if it wants to create multi-function
topologies or override the option rom.  Same for whether it wants to
override or virtualize capabilities.  All of this should just work
as-is; it's actually additional code required and knowledge through the
management stack to prevent it.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC server 01/11] vfio-user: build library
  2021-07-19 20:24     ` John Levon
@ 2021-07-20 12:06       ` Jag Raman
  2021-07-20 12:20         ` Marc-André Lureau
  0 siblings, 1 reply; 55+ messages in thread
From: Jag Raman @ 2021-07-20 12:06 UTC (permalink / raw)
  To: John Levon
  Cc: Elena Ufimtseva, John Johnson, Daniel P. Berrangé,
	Swapnil Ingle, richard.henderson, qemu-devel, f4bug,
	Marc-André Lureau, alex.williamson, stefanha, Paolo Bonzini,
	Thanos Makatos



> On Jul 19, 2021, at 4:24 PM, John Levon <john.levon@nutanix.com> wrote:
> 
> On Mon, Jul 19, 2021 at 04:00:03PM -0400, Jagannathan Raman wrote:
> 
>> add the libvfio-user library as a submodule. build it as part of QEMU
>> 
>> diff --git a/meson.build b/meson.build
>> index 6e4d2d8..f2f9f86 100644
>> --- a/meson.build
>> +++ b/meson.build
>> @@ -1894,6 +1894,41 @@ if get_option('cfi') and slirp_opt == 'system'
>>          + ' Please configure with --enable-slirp=git')
>> endif
>> 
>> +vfiouser = not_found
>> +if have_system and multiprocess_allowed
>> +  have_internal = fs.exists(meson.current_source_dir() / 'libvfio-user/Makefile')
>> +
>> +  if not have_internal
>> +    error('libvfio-user source not found - please pull git submodule')
>> +  endif
>> +
>> +  vfiouser_files = [
>> +    'libvfio-user/lib/dma.c',
>> +    'libvfio-user/lib/irq.c',
>> +    'libvfio-user/lib/libvfio-user.c',
>> +    'libvfio-user/lib/migration.c',
>> +    'libvfio-user/lib/pci.c',
>> +    'libvfio-user/lib/pci_caps.c',
>> +    'libvfio-user/lib/tran_sock.c',
>> +  ]
>> +
>> +  vfiouser_inc = include_directories('libvfio-user/include', 'libvfio-user/lib')
>> +
>> +  json_c = dependency('json-c', required: false)
>> +  if not json_c.found()
>> +    json_c = dependency('libjson-c')
>> +  endif
>> +
>> +  libvfiouser = static_library('vfiouser',
>> +                               build_by_default: false,
>> +                               sources: vfiouser_files,
>> +                               dependencies: json_c,
>> +                               include_directories: vfiouser_inc)
>> +
>> +  vfiouser = declare_dependency(link_with: libvfiouser,
>> +                                include_directories: vfiouser_inc)
>> +endif
> 
> Why this way, rather than recursing into the submodule? Seems a bit fragile to
> encode details of the library here.

+maintainers of meson.build. I apologize for not adding them when I sent the
patches out initially. I copied the email list from Elena, but Elena did not make
any changes to meson.build - stupid me.

John, 

    This way appears to be present convention with QEMU - I’m also not very clear
on the reason for it.

For example submodules such as slirp (libslirp), capstone (libcapstone),
dtc (libfdt) are built this way.

I’m guessing it’s because QEMU doesn’t build all parts of a submodule. For
example, QEMU only builds libfdt in the doc submodule. Similarly,
libvfio-user only builds the core library without building the tests and samples.

> 
> regards
> john


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC server 01/11] vfio-user: build library
  2021-07-20 12:06       ` Jag Raman
@ 2021-07-20 12:20         ` Marc-André Lureau
  2021-07-20 13:09           ` John Levon
  0 siblings, 1 reply; 55+ messages in thread
From: Marc-André Lureau @ 2021-07-20 12:20 UTC (permalink / raw)
  To: Jag Raman
  Cc: Elena Ufimtseva, John Johnson, Daniel P. Berrangé,
	Swapnil Ingle, John Levon, richard.henderson, qemu-devel, f4bug,
	alex.williamson, stefanha, Thanos Makatos, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 3135 bytes --]

Hi

On Tue, Jul 20, 2021 at 4:12 PM Jag Raman <jag.raman@oracle.com> wrote:

>
>
> > On Jul 19, 2021, at 4:24 PM, John Levon <john.levon@nutanix.com> wrote:
> >
> > On Mon, Jul 19, 2021 at 04:00:03PM -0400, Jagannathan Raman wrote:
> >
> >> add the libvfio-user library as a submodule. build it as part of QEMU
> >>
> >> diff --git a/meson.build b/meson.build
> >> index 6e4d2d8..f2f9f86 100644
> >> --- a/meson.build
> >> +++ b/meson.build
> >> @@ -1894,6 +1894,41 @@ if get_option('cfi') and slirp_opt == 'system'
> >>          + ' Please configure with --enable-slirp=git')
> >> endif
> >>
> >> +vfiouser = not_found
> >> +if have_system and multiprocess_allowed
> >> +  have_internal = fs.exists(meson.current_source_dir() /
> 'libvfio-user/Makefile')
> >> +
> >> +  if not have_internal
> >> +    error('libvfio-user source not found - please pull git submodule')
> >> +  endif
> >> +
> >> +  vfiouser_files = [
> >> +    'libvfio-user/lib/dma.c',
> >> +    'libvfio-user/lib/irq.c',
> >> +    'libvfio-user/lib/libvfio-user.c',
> >> +    'libvfio-user/lib/migration.c',
> >> +    'libvfio-user/lib/pci.c',
> >> +    'libvfio-user/lib/pci_caps.c',
> >> +    'libvfio-user/lib/tran_sock.c',
> >> +  ]
> >> +
> >> +  vfiouser_inc = include_directories('libvfio-user/include',
> 'libvfio-user/lib')
> >> +
> >> +  json_c = dependency('json-c', required: false)
> >> +  if not json_c.found()
> >> +    json_c = dependency('libjson-c')
> >> +  endif
> >> +
> >> +  libvfiouser = static_library('vfiouser',
> >> +                               build_by_default: false,
> >> +                               sources: vfiouser_files,
> >> +                               dependencies: json_c,
> >> +                               include_directories: vfiouser_inc)
> >> +
> >> +  vfiouser = declare_dependency(link_with: libvfiouser,
> >> +                                include_directories: vfiouser_inc)
> >> +endif
> >
> > Why this way, rather than recursing into the submodule? Seems a bit
> fragile to
> > encode details of the library here.
>
> +maintainers of meson.build. I apologize for not adding them when I sent
> the
> patches out initially. I copied the email list from Elena, but Elena did
> not make
> any changes to meson.build - stupid me.
>
> John,
>
>     This way appears to be present convention with QEMU - I’m also not
> very clear
> on the reason for it.
>
> For example submodules such as slirp (libslirp), capstone (libcapstone),
> dtc (libfdt) are built this way.
>

For slirp and dtc, we are eventually going to use meson subproject(). No
idea about capstone.

>
> I’m guessing it’s because QEMU doesn’t build all parts of a submodule. For
> example, QEMU only builds libfdt in the doc submodule. Similarly,
> libvfio-user only builds the core library without building the tests and
> samples.
>
>
You can give subproject options to build limited parts.

Fwiw, since libvfio-user uses cmake, we may be able to use meson
cmake.subproject() (https://mesonbuild.com/CMake-module.html).

-- 
Marc-André Lureau

[-- Attachment #2: Type: text/html, Size: 4476 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC server 01/11] vfio-user: build library
  2021-07-20 12:20         ` Marc-André Lureau
@ 2021-07-20 13:09           ` John Levon
  0 siblings, 0 replies; 55+ messages in thread
From: John Levon @ 2021-07-20 13:09 UTC (permalink / raw)
  To: Marc-André Lureau
  Cc: Elena Ufimtseva, John Johnson, Jag Raman, Swapnil Ingle,
	richard.henderson, qemu-devel, f4bug, alex.williamson,
	Paolo Bonzini, stefanha, Thanos Makatos, Daniel P. Berrangé

On Tue, Jul 20, 2021 at 04:20:13PM +0400, Marc-André Lureau wrote:

> > >> +  libvfiouser = static_library('vfiouser',
> > >> +                               build_by_default: false,
> > >> +                               sources: vfiouser_files,
> > >> +                               dependencies: json_c,
> > >> +                               include_directories: vfiouser_inc)
> >
> >     This way appears to be present convention with QEMU - I’m also not
> > very clear
> > on the reason for it.
> >
> > I’m guessing it’s because QEMU doesn’t build all parts of a submodule. For
> > example, QEMU only builds libfdt in the doc submodule. Similarly,
> > libvfio-user only builds the core library without building the tests and
> > samples.
> >
> You can give subproject options to build limited parts.
> 
> Fwiw, since libvfio-user uses cmake, we may be able to use meson
> cmake.subproject() (https://mesonbuild.com/CMake-module.html).

That'd be great. We also briefly discussed moving away from cmake anyway - since
both SPDK and qemu are meson-based, it seems like it would make sense. I'd
prefer it to be easy to regularly update libvfio-user within these projects.

Ideally, running qemu tests would actually run libvfio-user tests too, for some
level of assurance on the library's internal expectations.

regards
john


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH RFC server 10/11] vfio-user: register handlers to facilitate migration
  2021-07-19 20:00   ` [PATCH RFC server 10/11] vfio-user: register handlers to facilitate migration Jagannathan Raman
@ 2021-07-20 14:05     ` Thanos Makatos
  0 siblings, 0 replies; 55+ messages in thread
From: Thanos Makatos @ 2021-07-20 14:05 UTC (permalink / raw)
  To: Jagannathan Raman, qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, Swapnil Ingle, John Levon,
	alex.williamson, stefanha

> -----Original Message-----
> From: Jagannathan Raman <jag.raman@oracle.com>
> Sent: 19 July 2021 21:00
> To: qemu-devel@nongnu.org
> Cc: stefanha@redhat.com; alex.williamson@redhat.com;
> elena.ufimtseva@oracle.com; John Levon <john.levon@nutanix.com>;
> john.g.johnson@oracle.com; Thanos Makatos
> <thanos.makatos@nutanix.com>; Swapnil Ingle
> <swapnil.ingle@nutanix.com>; jag.raman@oracle.com
> Subject: [PATCH RFC server 10/11] vfio-user: register handlers to facilitate
> migration
> 
> Store and load the device's state using handlers for live migration
> 
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> ---
>  migration/savevm.h        |   2 +
>  hw/remote/vfio-user-obj.c | 287
> ++++++++++++++++++++++++++++++++++++++++++++++
>  migration/savevm.c        |  63 ++++++++++
>  3 files changed, 352 insertions(+)
> 
> diff --git a/migration/savevm.h b/migration/savevm.h
> index 6461342..71d1733 100644
> --- a/migration/savevm.h
> +++ b/migration/savevm.h
> @@ -67,5 +67,7 @@ int qemu_loadvm_state_main(QEMUFile *f,
> MigrationIncomingState *mis);
>  int qemu_load_device_state(QEMUFile *f);
>  int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
>          bool in_postcopy, bool inactivate_disks);
> +int qemu_remote_savevm(QEMUFile *f);
> +int qemu_remote_loadvm(QEMUFile *f);
> 
>  #endif
> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
> index d2a2e51..5948576 100644
> --- a/hw/remote/vfio-user-obj.c
> +++ b/hw/remote/vfio-user-obj.c
> @@ -44,6 +44,10 @@
>  #include "hw/boards.h"
>  #include "hw/remote/iohub.h"
>  #include "hw/remote/machine.h"
> +#include "migration/qemu-file.h"
> +#include "migration/savevm.h"
> +#include "migration/global_state.h"
> +#include "block/block.h"
> 
>  #include "libvfio-user/include/libvfio-user.h"
> 
> @@ -73,6 +77,31 @@ struct VfuObject {
>      PCIDevice *pci_dev;
> 
>      QemuThread vfu_ctx_thread;
> +
> +    /*
> +     * vfu_mig_buf holds the migration data. In the remote process, this
> +     * buffer replaces the role of an IO channel which links the source
> +     * and the destination.
> +     *
> +     * Whenever the client QEMU process initiates migration, the libvfio-user
> +     * library notifies that to this server. The remote/server QEMU sets up a
> +     * QEMUFile object using this buffer as backend. The remote passes this

Can we use remote/server more consistently? E.g. "remote process" or "server" instead of just "remote"? (makes me think of git remotes :D)

> +     * object to its migration subsystem, and it slirps the VMSDs of all its

By "slirps" do you mean transfer from the client to the server over the SLiRP network?

> +     * devices and stores them in this buffer.

Isn't this a per-device object? If so, then why do we store the VMSDs of *all* the devices in a single device's buffer? I think I'm missing something here.

> +     *
> +     * libvfio-user library subsequetly asks the remote for any data that needs
> +     * to be moved over to the destination using its vfu_migration_callbacks_t

It's not obvious to me, is this the libvfio-user library running at the server?

> +     * APIs. The remote hands over this buffer as data at this time.

Hands over the buffer to whom?

> +     *
> +     * A reverse of this process happens at the destination.
> +     */
> +    uint8_t *vfu_mig_buf;

Does the above description refer to a typical use case of the VFIO migration protocol where data is copied in an iterative manner (implemented in libvfio-user by the migration callbacks)? Is this what you're documenting here?

> +
> +    uint64_t vfu_mig_buf_size;
> +
> +    uint64_t vfu_mig_buf_pending;
> +
> +    QEMUFile *vfu_mig_file;
>  };
> 
>  static void vfu_object_set_socket(Object *obj, const char *str, Error **errp)
> @@ -97,6 +126,226 @@ static void vfu_object_set_devid(Object *obj, const
> char *str, Error **errp)
>      trace_vfu_prop("devid", str);
>  }
> 
> +/**
> + * Migration helper functions
> + *
> + * vfu_mig_buf_read & vfu_mig_buf_write are used by QEMU's migration
> + * subsystem - qemu_remote_savevm & qemu_remote_loadvm.

vfu_mig_buf_read is used by qemu_remote_loadvm and vfu_mig_buf_write is used by qemu_remote_savevm, right? The order they're written suggests the opposite.

> savevm/loadvm
> + * call these functions via QEMUFileOps to save/load the VMSD of all
> + * the devices into vfu_mig_buf
> + *
> + */
> +static ssize_t vfu_mig_buf_read(void *opaque, uint8_t *buf, int64_t pos,
> +                                size_t size, Error **errp)
> +{
> +    VfuObject *o = opaque;
> +
> +    if (pos > o->vfu_mig_buf_size) {
> +        size = 0;
> +    } else if ((pos + size) > o->vfu_mig_buf_size) {
> +        size = o->vfu_mig_buf_size;
> +    }
> +
> +    memcpy(buf, (o->vfu_mig_buf + pos), size);
> +
> +    o->vfu_mig_buf_size -= size;
> +
> +    return size;
> +}
> +
> +static ssize_t vfu_mig_buf_write(void *opaque, struct iovec *iov, int iovcnt,
> +                                 int64_t pos, Error **errp)
> +{
> +    VfuObject *o = opaque;
> +    uint64_t end = pos + iov_size(iov, iovcnt);
> +    int i;
> +
> +    if (end > o->vfu_mig_buf_size) {
> +        o->vfu_mig_buf = g_realloc(o->vfu_mig_buf, end);
> +    }
> +
> +    for (i = 0; i < iovcnt; i++) {
> +        memcpy((o->vfu_mig_buf + o->vfu_mig_buf_size), iov[i].iov_base,
> +               iov[i].iov_len);
> +        o->vfu_mig_buf_size += iov[i].iov_len;
> +        o->vfu_mig_buf_pending += iov[i].iov_len;
> +    }
> +
> +    return iov_size(iov, iovcnt);
> +}
> +
> +static int vfu_mig_buf_shutdown(void *opaque, bool rd, bool wr, Error
> **errp)
> +{
> +    VfuObject *o = opaque;
> +
> +    o->vfu_mig_buf_size = 0;
> +
> +    g_free(o->vfu_mig_buf);
> +
> +    return 0;
> +}
> +
> +static const QEMUFileOps vfu_mig_fops_save = {
> +    .writev_buffer  = vfu_mig_buf_write,
> +    .shut_down      = vfu_mig_buf_shutdown,
> +};
> +
> +static const QEMUFileOps vfu_mig_fops_load = {
> +    .get_buffer     = vfu_mig_buf_read,
> +    .shut_down      = vfu_mig_buf_shutdown,
> +};
> +
> +/**
> + * handlers for vfu_migration_callbacks_t
> + *
> + * The libvfio-user library accesses these handlers to drive the migration
> + * at the remote end, and also to transport the data stored in vfu_mig_buf
> + *
> + */
> +static void vfu_mig_state_precopy(vfu_ctx_t *vfu_ctx)
> +{
> +    VfuObject *o = vfu_get_private(vfu_ctx);
> +    int ret;
> +
> +    if (!o->vfu_mig_file) {
> +        o->vfu_mig_file = qemu_fopen_ops(o, &vfu_mig_fops_save);
> +    }
> +
> +    global_state_store();
> +
> +    ret = qemu_remote_savevm(o->vfu_mig_file);
> +    if (ret) {
> +        qemu_file_shutdown(o->vfu_mig_file);
> +        return;
> +    }
> +
> +    qemu_fflush(o->vfu_mig_file);
> +
> +    bdrv_inactivate_all();
> +}
> +
> +static void vfu_mig_state_running(vfu_ctx_t *vfu_ctx)
> +{
> +    VfuObject *o = vfu_get_private(vfu_ctx);
> +    Error *local_err = NULL;
> +    int ret;
> +
> +    ret = qemu_remote_loadvm(o->vfu_mig_file);
> +    if (ret) {
> +        error_setg(&error_abort, "vfu: failed to restore device state");
> +        return;
> +    }
> +
> +    bdrv_invalidate_cache_all(&local_err);
> +    if (local_err) {
> +        error_report_err(local_err);
> +        return;
> +    }
> +
> +    vm_start();
> +}
> +
> +static int vfu_mig_transition(vfu_ctx_t *vfu_ctx, vfu_migr_state_t state)
> +{
> +    switch (state) {
> +    case VFU_MIGR_STATE_RESUME:
> +    case VFU_MIGR_STATE_STOP_AND_COPY:
> +    case VFU_MIGR_STATE_STOP:
> +        break;

Can you explain why we don't have to do anything in the above cases?

> +    case VFU_MIGR_STATE_PRE_COPY:
> +        vfu_mig_state_precopy(vfu_ctx);
> +        break;
> +    case VFU_MIGR_STATE_RUNNING:
> +        if (!runstate_is_running()) {
> +            vfu_mig_state_running(vfu_ctx);
> +        }
> +        break;
> +    default:
> +        warn_report("vfu: Unknown migration state %d", state);
> +    }
> +
> +    return 0;
> +}
> +
> +static uint64_t vfu_mig_get_pending_bytes(vfu_ctx_t *vfu_ctx)
> +{
> +    VfuObject *o = vfu_get_private(vfu_ctx);
> +
> +    return o->vfu_mig_buf_pending;
> +}
> +
> +static int vfu_mig_prepare_data(vfu_ctx_t *vfu_ctx, uint64_t *offset,
> +                                uint64_t *size)
> +{
> +    VfuObject *o = vfu_get_private(vfu_ctx);
> +
> +    if (offset) {
> +        *offset = 0;
> +    }
> +
> +    if (size) {
> +        *size = o->vfu_mig_buf_size;
> +    }
> +
> +    return 0;
> +}
> +
> +static ssize_t vfu_mig_read_data(vfu_ctx_t *vfu_ctx, void *buf,
> +                                 uint64_t size, uint64_t offset)
> +{
> +    VfuObject *o = vfu_get_private(vfu_ctx);
> +
> +    if (offset > o->vfu_mig_buf_size) {
> +        return -1;
> +    }
> +
> +    if ((offset + size) > o->vfu_mig_buf_size) {
> +        warn_report("vfu: buffer overflow - check pending_bytes");
> +        size = o->vfu_mig_buf_size - offset;
> +    }
> +
> +    memcpy(buf, (o->vfu_mig_buf + offset), size);
> +
> +    o->vfu_mig_buf_pending -= size;
> +
> +    return size;
> +}
> +
> +static ssize_t vfu_mig_write_data(vfu_ctx_t *vfu_ctx, void *data,
> +                                  uint64_t size, uint64_t offset)
> +{
> +    VfuObject *o = vfu_get_private(vfu_ctx);
> +    uint64_t end = offset + size;
> +
> +    if (end > o->vfu_mig_buf_size) {
> +        o->vfu_mig_buf = g_realloc(o->vfu_mig_buf, end);
> +        o->vfu_mig_buf_size = end;
> +    }
> +
> +    memcpy((o->vfu_mig_buf + offset), data, size);
> +
> +    if (!o->vfu_mig_file) {
> +        o->vfu_mig_file = qemu_fopen_ops(o, &vfu_mig_fops_load);
> +    }
> +
> +    return size;
> +}
> +
> +static int vfu_mig_data_written(vfu_ctx_t *vfu_ctx, uint64_t count)
> +{
> +    return 0;
> +}
> +
> +static const vfu_migration_callbacks_t vfu_mig_cbs = {
> +    .version = VFU_MIGR_CALLBACKS_VERS,
> +    .transition = &vfu_mig_transition,
> +    .get_pending_bytes = &vfu_mig_get_pending_bytes,
> +    .prepare_data = &vfu_mig_prepare_data,
> +    .read_data = &vfu_mig_read_data,
> +    .data_written = &vfu_mig_data_written,
> +    .write_data = &vfu_mig_write_data,
> +};
> +
>  static void *vfu_object_ctx_run(void *opaque)
>  {
>      VfuObject *o = opaque;
> @@ -332,6 +581,7 @@ static void vfu_object_machine_done(Notifier
> *notifier, void *data)
>  {
>      VfuObject *o = container_of(notifier, VfuObject, machine_done);
>      DeviceState *dev = NULL;
> +    size_t migr_area_size;
>      int ret;
> 
>      o->vfu_ctx = vfu_create_ctx(VFU_TRANS_SOCK, o->socket, 0,
> @@ -391,6 +641,35 @@ static void vfu_object_machine_done(Notifier
> *notifier, void *data)
>          return;
>      }
> 
> +    /*
> +     * TODO: The 0x20000 number used below is a temporary. We are
> working on
> +     *     a cleaner fix for this.
> +     *
> +     *     The libvfio-user library assumes that the remote knows the size of
> +     *     the data to be migrated at boot time, but that is not the case with
> +     *     VMSDs, as it can contain a variable-size buffer. 0x20000 is used
> +     *     as a sufficiently large buffer to demonstrate migration, but that
> +     *     cannot be used as a solution.
> +     *
> +     */

The size of the migration region dictates the amount of migration data that can be produced/consumed in one-go, it's not necessarily the total size of the migration data produced/consumed throughout the migration operation.

> +    ret = vfu_setup_region(o->vfu_ctx, VFU_PCI_DEV_MIGR_REGION_IDX,
> +                           0x20000, NULL,
> +                           VFU_REGION_FLAG_RW, NULL, 0, -1, 0);
> +    if (ret < 0) {
> +        error_setg(&error_abort, "vfu: Failed to register migration BAR %s- %s",
> +                   o->devid, strerror(errno));
> +        return;
> +    }
> +
> +    migr_area_size = vfu_get_migr_register_area_size();
> +    ret = vfu_setup_device_migration_callbacks(o->vfu_ctx, &vfu_mig_cbs,
> +                                               migr_area_size);
> +    if (ret < 0) {
> +        error_setg(&error_abort, "vfu: Failed to setup migration %s- %s",
> +                   o->devid, strerror(errno));
> +        return;
> +    }
> +
>      qemu_thread_create(&o->vfu_ctx_thread, "VFU ctx runner",
> vfu_object_ctx_run,
>                         o, QEMU_THREAD_JOINABLE);
>  }
> @@ -412,6 +691,14 @@ static void vfu_object_init(Object *obj)
> 
>      o->machine_done.notify = vfu_object_machine_done;
>      qemu_add_machine_init_done_notifier(&o->machine_done);
> +
> +    o->vfu_mig_file = NULL;
> +
> +    o->vfu_mig_buf = NULL;
> +
> +    o->vfu_mig_buf_size = 0;
> +
> +    o->vfu_mig_buf_pending = 0;
>  }
> 
>  static void vfu_object_finalize(Object *obj)
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 72848b9..c2279af 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1603,6 +1603,33 @@ static int qemu_savevm_state(QEMUFile *f, Error
> **errp)
>      return ret;
>  }
> 
> +int qemu_remote_savevm(QEMUFile *f)
> +{
> +    SaveStateEntry *se;
> +    int ret;
> +
> +    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +        if (!se->vmsd || !vmstate_save_needed(se->vmsd, se->opaque)) {
> +            continue;
> +        }
> +
> +        save_section_header(f, se, QEMU_VM_SECTION_FULL);
> +
> +        ret = vmstate_save(f, se, NULL);
> +        if (ret) {
> +            qemu_file_set_error(f, ret);
> +            return ret;
> +        }
> +
> +        save_section_footer(f, se);
> +    }
> +
> +    qemu_put_byte(f, QEMU_VM_EOF);
> +    qemu_fflush(f);
> +
> +    return 0;
> +}
> +
>  void qemu_savevm_live_state(QEMUFile *f)
>  {
>      /* save QEMU_VM_SECTION_END section */
> @@ -2443,6 +2470,42 @@ qemu_loadvm_section_start_full(QEMUFile *f,
> MigrationIncomingState *mis)
>      return 0;
>  }
> 
> +int qemu_remote_loadvm(QEMUFile *f)
> +{
> +    uint8_t section_type;
> +    int ret = 0;
> +
> +    qemu_mutex_lock_iothread();
> +
> +    while (true) {
> +        section_type = qemu_get_byte(f);
> +
> +        if (qemu_file_get_error(f)) {
> +            ret = qemu_file_get_error(f);
> +            break;
> +        }
> +
> +        switch (section_type) {
> +        case QEMU_VM_SECTION_FULL:
> +            ret = qemu_loadvm_section_start_full(f, NULL);
> +            if (ret < 0) {
> +                break;
> +            }
> +            break;
> +        case QEMU_VM_EOF:
> +            goto out;
> +        default:
> +            ret = -EINVAL;
> +            goto out;
> +        }
> +    }
> +
> +out:
> +    qemu_mutex_unlock_iothread();
> +
> +    return ret;
> +}
> +
>  static int
>  qemu_loadvm_section_part_end(QEMUFile *f, MigrationIncomingState
> *mis)
>  {
> --
> 1.8.3.1

My background in implementing device migration with libvfio-user is for a specific device, it seems to me that you're using this functionality differently? Maybe that's why I'm getting confused. If this is the case, could you explain in more detail how you're using libvfio-user here?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH RFC server 05/11] vfio-user: run vfio-user context
  2021-07-19 20:00   ` [PATCH RFC server 05/11] vfio-user: run vfio-user context Jagannathan Raman
@ 2021-07-20 14:17     ` Thanos Makatos
  2021-08-13 14:51       ` Jag Raman
  0 siblings, 1 reply; 55+ messages in thread
From: Thanos Makatos @ 2021-07-20 14:17 UTC (permalink / raw)
  To: Jagannathan Raman, qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, Swapnil Ingle, John Levon,
	alex.williamson, stefanha

> -----Original Message-----
> From: Jagannathan Raman <jag.raman@oracle.com>
> Sent: 19 July 2021 21:00
> To: qemu-devel@nongnu.org
> Cc: stefanha@redhat.com; alex.williamson@redhat.com;
> elena.ufimtseva@oracle.com; John Levon <john.levon@nutanix.com>;
> john.g.johnson@oracle.com; Thanos Makatos
> <thanos.makatos@nutanix.com>; Swapnil Ingle
> <swapnil.ingle@nutanix.com>; jag.raman@oracle.com
> Subject: [PATCH RFC server 05/11] vfio-user: run vfio-user context
> 
> Setup a separate thread to run the vfio-user context. The thread acts as
> the main loop for the device.

In your "vfio-user: instantiate vfio-user context" patch you create the vfu context in blocking-mode, so the only way to run device emulation is in a separate thread.
Were you going to create a separate thread anyway? You can run device emulation in polling mode therefore you can avoid creating a separate thread, thus saving resources. Do plan to do that in the future?

> 
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> ---
>  hw/remote/vfio-user-obj.c | 44
> ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 44 insertions(+)
> 
> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
> index e362709..6a2d0f5 100644
> --- a/hw/remote/vfio-user-obj.c
> +++ b/hw/remote/vfio-user-obj.c
> @@ -35,6 +35,7 @@
>  #include "trace.h"
>  #include "sysemu/runstate.h"
>  #include "qemu/notify.h"
> +#include "qemu/thread.h"
>  #include "qapi/error.h"
>  #include "sysemu/sysemu.h"
>  #include "hw/qdev-core.h"
> @@ -66,6 +67,8 @@ struct VfuObject {
>      vfu_ctx_t *vfu_ctx;
> 
>      PCIDevice *pci_dev;
> +
> +    QemuThread vfu_ctx_thread;
>  };
> 
>  static void vfu_object_set_socket(Object *obj, const char *str, Error **errp)
> @@ -90,6 +93,44 @@ static void vfu_object_set_devid(Object *obj, const
> char *str, Error **errp)
>      trace_vfu_prop("devid", str);
>  }
> 
> +static void *vfu_object_ctx_run(void *opaque)
> +{
> +    VfuObject *o = opaque;
> +    int ret;
> +
> +    ret = vfu_realize_ctx(o->vfu_ctx);
> +    if (ret < 0) {
> +        error_setg(&error_abort, "vfu: Failed to realize device %s- %s",
> +                   o->devid, strerror(errno));
> +        return NULL;
> +    }
> +
> +    ret = vfu_attach_ctx(o->vfu_ctx);
> +    if (ret < 0) {
> +        error_setg(&error_abort,
> +                   "vfu: Failed to attach device %s to context - %s",
> +                   o->devid, strerror(errno));
> +        return NULL;
> +    }
> +
> +    do {
> +        ret = vfu_run_ctx(o->vfu_ctx);
> +        if (ret < 0) {
> +            if (errno == EINTR) {
> +                ret = 0;
> +            } else if (errno == ENOTCONN) {
> +                object_unparent(OBJECT(o));
> +                break;
> +            } else {
> +                error_setg(&error_abort, "vfu: Failed to run device %s - %s",
> +                           o->devid, strerror(errno));
> +            }
> +        }
> +    } while (ret == 0);
> +
> +    return NULL;
> +}
> +
>  static void vfu_object_machine_done(Notifier *notifier, void *data)
>  {
>      VfuObject *o = container_of(notifier, VfuObject, machine_done);
> @@ -125,6 +166,9 @@ static void vfu_object_machine_done(Notifier
> *notifier, void *data)
>                     pci_get_word(o->pci_dev->config + PCI_DEVICE_ID),
>                     pci_get_word(o->pci_dev->config +
> PCI_SUBSYSTEM_VENDOR_ID),
>                     pci_get_word(o->pci_dev->config + PCI_SUBSYSTEM_ID));
> +
> +    qemu_thread_create(&o->vfu_ctx_thread, "VFU ctx runner",
> vfu_object_ctx_run,
> +                       o, QEMU_THREAD_JOINABLE);
>  }
> 
>  static void vfu_object_init(Object *obj)
> --
> 1.8.3.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH RFC server 07/11] vfio-user: handle DMA mappings
  2021-07-19 20:00   ` [PATCH RFC server 07/11] vfio-user: handle DMA mappings Jagannathan Raman
@ 2021-07-20 14:38     ` Thanos Makatos
  0 siblings, 0 replies; 55+ messages in thread
From: Thanos Makatos @ 2021-07-20 14:38 UTC (permalink / raw)
  To: Jagannathan Raman, qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, Swapnil Ingle, John Levon,
	alex.williamson, stefanha

> -----Original Message-----
> From: Jagannathan Raman <jag.raman@oracle.com>
> Sent: 19 July 2021 21:00
> To: qemu-devel@nongnu.org
> Cc: stefanha@redhat.com; alex.williamson@redhat.com;
> elena.ufimtseva@oracle.com; John Levon <john.levon@nutanix.com>;
> john.g.johnson@oracle.com; Thanos Makatos
> <thanos.makatos@nutanix.com>; Swapnil Ingle
> <swapnil.ingle@nutanix.com>; jag.raman@oracle.com
> Subject: [PATCH RFC server 07/11] vfio-user: handle DMA mappings
> 
> Define and register callbacks to manage the RAM regions used for
> device DMA
> 
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> ---
>  hw/remote/vfio-user-obj.c | 58
> +++++++++++++++++++++++++++++++++++++++++++++++
>  hw/remote/trace-events    |  2 ++
>  2 files changed, 60 insertions(+)
> 
> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
> index 60d9fa8..d158a7f 100644
> --- a/hw/remote/vfio-user-obj.c
> +++ b/hw/remote/vfio-user-obj.c
> @@ -161,6 +161,57 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t
> *vfu_ctx, char * const buf,
>      return count;
>  }
> 
> +static void dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
> +{
> +    MemoryRegion *subregion = NULL;
> +    g_autofree char *name = NULL;
> +    static unsigned int suffix;
> +    struct iovec *iov = &info->iova;
> +
> +    if (!info->vaddr) {
> +        return;
> +    }

This shouldn't happen, you can replace it with an assert if you want.

> +
> +    name = g_strdup_printf("remote-mem-%u", suffix++);
> +
> +    subregion = g_new0(MemoryRegion, 1);
> +
> +    qemu_mutex_lock_iothread();
> +
> +    memory_region_init_ram_ptr(subregion, NULL, name,
> +                               iov->iov_len, info->vaddr);
> +
> +    memory_region_add_subregion(get_system_memory(), (hwaddr)iov-
> >iov_base,
> +                                subregion);
> +
> +    qemu_mutex_unlock_iothread();
> +
> +    trace_vfu_dma_register((uint64_t)iov->iov_base, iov->iov_len);
> +}
> +
> +static int dma_unregister(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
> +{
> +    MemoryRegion *mr = NULL;
> +    ram_addr_t offset;
> +
> +    mr = memory_region_from_host(info->vaddr, &offset);
> +    if (!mr) {

Is this expected? If not then should we at least log something?

> +        return 0;
> +    }
> +
> +    qemu_mutex_lock_iothread();
> +
> +    memory_region_del_subregion(get_system_memory(), mr);
> +
> +    object_unparent((OBJECT(mr)));
> +
> +    qemu_mutex_unlock_iothread();
> +
> +    trace_vfu_dma_unregister((uint64_t)info->iova.iov_base);
> +
> +    return 0;
> +}
> +
>  static void vfu_object_machine_done(Notifier *notifier, void *data)
>  {
>      VfuObject *o = container_of(notifier, VfuObject, machine_done);
> @@ -208,6 +259,13 @@ static void vfu_object_machine_done(Notifier
> *notifier, void *data)
>          return;
>      }
> 
> +    ret = vfu_setup_device_dma(o->vfu_ctx, &dma_register,
> &dma_unregister);
> +    if (ret < 0) {
> +        error_setg(&error_abort, "vfu: Failed to setup DMA handlers for %s",
> +                   o->devid);
> +        return;
> +    }
> +
>      qemu_thread_create(&o->vfu_ctx_thread, "VFU ctx runner",
> vfu_object_ctx_run,
>                         o, QEMU_THREAD_JOINABLE);
>  }
> diff --git a/hw/remote/trace-events b/hw/remote/trace-events
> index 2ef7884..f945c7e 100644
> --- a/hw/remote/trace-events
> +++ b/hw/remote/trace-events
> @@ -7,3 +7,5 @@ mpqemu_recv_io_error(int cmd, int size, int nfds) "failed
> to receive %d size %d,
>  vfu_prop(const char *prop, const char *val) "vfu: setting %s as %s"
>  vfu_cfg_read(uint32_t offset, uint32_t val) "vfu: cfg: 0x%u -> 0x%x"
>  vfu_cfg_write(uint32_t offset, uint32_t val) "vfu: cfg: 0x%u <- 0x%x"
> +vfu_dma_register(uint64_t gpa, size_t len) "vfu: registering GPA
> 0x%"PRIx64", %zu bytes"
> +vfu_dma_unregister(uint64_t gpa) "vfu: unregistering GPA 0x%"PRIx64""
> --
> 1.8.3.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH RFC server 11/11] vfio-user: acceptance test
  2021-07-19 20:00   ` [PATCH RFC server 11/11] vfio-user: acceptance test Jagannathan Raman
@ 2021-07-20 16:12     ` Thanos Makatos
  0 siblings, 0 replies; 55+ messages in thread
From: Thanos Makatos @ 2021-07-20 16:12 UTC (permalink / raw)
  To: Jagannathan Raman, qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, Swapnil Ingle, John Levon,
	alex.williamson, stefanha



> -----Original Message-----
> From: Jagannathan Raman <jag.raman@oracle.com>
> Sent: 19 July 2021 21:00
> To: qemu-devel@nongnu.org
> Cc: stefanha@redhat.com; alex.williamson@redhat.com;
> elena.ufimtseva@oracle.com; John Levon <john.levon@nutanix.com>;
> john.g.johnson@oracle.com; Thanos Makatos
> <thanos.makatos@nutanix.com>; Swapnil Ingle
> <swapnil.ingle@nutanix.com>; jag.raman@oracle.com
> Subject: [PATCH RFC server 11/11] vfio-user: acceptance test
> 
> Acceptance test for libvfio-user in QEMU
> 
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> ---
>  MAINTAINERS                   |  1 +
>  tests/acceptance/vfio-user.py | 94
> +++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 95 insertions(+)
>  create mode 100644 tests/acceptance/vfio-user.py
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 46ab6b6..644bd35 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3381,6 +3381,7 @@ F: include/hw/remote/proxy-memory-listener.h
>  F: hw/remote/iohub.c
>  F: include/hw/remote/iohub.h
>  F: hw/remote/vfio-user-obj.c
> +F: tests/acceptance/vfio-user.py
> 
>  EBPF:
>  M: Jason Wang <jasowang@redhat.com>
> diff --git a/tests/acceptance/vfio-user.py b/tests/acceptance/vfio-user.py
> new file mode 100644
> index 0000000..ef318d9
> --- /dev/null
> +++ b/tests/acceptance/vfio-user.py
> @@ -0,0 +1,94 @@
> +# vfio-user protocol sanity test
> +#
> +# This work is licensed under the terms of the GNU GPL, version 2 or
> +# later.  See the COPYING file in the top-level directory.
> +
> +
> +import os
> +import socket
> +import uuid
> +
> +from avocado_qemu import Test
> +from avocado_qemu import wait_for_console_pattern
> +from avocado_qemu import exec_command
> +from avocado_qemu import exec_command_and_wait_for_pattern
> +
> +class VfioUser(Test):
> +    """
> +    :avocado: tags=vfiouser
> +    """
> +    KERNEL_COMMON_COMMAND_LINE = 'printk.time=0 '
> +
> +    def do_test(self, kernel_url, initrd_url, kernel_command_line,
> +                machine_type):
> +        """Main test method"""
> +        self.require_accelerator('kvm')
> +
> +        kernel_path = self.fetch_asset(kernel_url)
> +        initrd_path = self.fetch_asset(initrd_url)
> +
> +        socket = os.path.join('/tmp', str(uuid.uuid4()))
> +        if os.path.exists(socket):
> +            os.remove(socket)
> +
> +        # Create remote process
> +        remote_vm = self.get_vm()
> +        remote_vm.add_args('-machine', 'x-remote')
> +        remote_vm.add_args('-nodefaults')
> +        remote_vm.add_args('-device', 'lsi53c895a,id=lsi1')

IIUC the LSI controller will now be a migratable device and migration will be handled by vfu_mig_transition() introduced in your "vfio-user: register handlers to facilitate migration" patch. In vfu_mig_transition(), you don’t copy migration data in the VFU_MIGR_STATE_STOP_AND_COPY case but only in VFU_MIGR_STATE_PRE_COPY, however I believe that in VFIO it's possible to jump from the running state straight to the stop-and-copy state. Are you relying on QEMU not doing this?

> +        remote_vm.add_args('-object', 'vfio-user,id=vfioobj1,'
> +                           'devid=lsi1,socket='+socket)
> +        remote_vm.launch()
> +
> +        # Create proxy process
> +        self.vm.set_console()
> +        self.vm.add_args('-machine', machine_type)
> +        self.vm.add_args('-accel', 'kvm')
> +        self.vm.add_args('-cpu', 'host')
> +        self.vm.add_args('-object',
> +                         'memory-backend-memfd,id=sysmem-file,size=2G')
> +        self.vm.add_args('--numa', 'node,memdev=sysmem-file')
> +        self.vm.add_args('-m', '2048')
> +        self.vm.add_args('-kernel', kernel_path,
> +                         '-initrd', initrd_path,
> +                         '-append', kernel_command_line)
> +        self.vm.add_args('-device',
> +                         'vfio-user-pci,'
> +                         'socket='+socket)
> +        self.vm.launch()
> +        wait_for_console_pattern(self, 'as init process',
> +                                 'Kernel panic - not syncing')
> +        exec_command(self, 'mount -t sysfs sysfs /sys')
> +        exec_command_and_wait_for_pattern(self,
> +                                          'cat /sys/bus/pci/devices/*/uevent',
> +                                          'PCI_ID=1000:0012')
> +
> +    def test_multiprocess_x86_64(self):
> +        """
> +        :avocado: tags=arch:x86_64
> +        """
> +        kernel_url = ('https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__archives.fedoraproject.org_pub_archive_fedora&d=DwIBAg&c=s883G
> pUCOChKOHiocYtGcg&r=XTpYsh5Ps2zJvtw6ogtti46atk736SI4vgsJiUKIyDE&m=
> 4nAZXeA9xd82TON6H7CXF5LVa0jKBAJkyu0Y-
> curSd4&s=hP6IktdmIVlw3gMuZlWRkPvFq9OzjUji6sb_28sapwk&e= '
> +                      '/linux/releases/31/Everything/x86_64/os/images'
> +                      '/pxeboot/vmlinuz')
> +        initrd_url = ('https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__archives.fedoraproject.org_pub_archive_fedora&d=DwIBAg&c=s883G
> pUCOChKOHiocYtGcg&r=XTpYsh5Ps2zJvtw6ogtti46atk736SI4vgsJiUKIyDE&m=
> 4nAZXeA9xd82TON6H7CXF5LVa0jKBAJkyu0Y-
> curSd4&s=hP6IktdmIVlw3gMuZlWRkPvFq9OzjUji6sb_28sapwk&e= '
> +                      '/linux/releases/31/Everything/x86_64/os/images'
> +                      '/pxeboot/initrd.img')
> +        kernel_command_line = (self.KERNEL_COMMON_COMMAND_LINE +
> +                               'console=ttyS0 rdinit=/bin/bash')
> +        machine_type = 'pc'
> +        self.do_test(kernel_url, initrd_url, kernel_command_line,
> machine_type)
> +
> +    def test_multiprocess_aarch64(self):
> +        """
> +        :avocado: tags=arch:aarch64
> +        """
> +        kernel_url = ('https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__archives.fedoraproject.org_pub_archive_fedora&d=DwIBAg&c=s883G
> pUCOChKOHiocYtGcg&r=XTpYsh5Ps2zJvtw6ogtti46atk736SI4vgsJiUKIyDE&m=
> 4nAZXeA9xd82TON6H7CXF5LVa0jKBAJkyu0Y-
> curSd4&s=hP6IktdmIVlw3gMuZlWRkPvFq9OzjUji6sb_28sapwk&e= '
> +                      '/linux/releases/31/Everything/aarch64/os/images'
> +                      '/pxeboot/vmlinuz')
> +        initrd_url = ('https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__archives.fedoraproject.org_pub_archive_fedora&d=DwIBAg&c=s883G
> pUCOChKOHiocYtGcg&r=XTpYsh5Ps2zJvtw6ogtti46atk736SI4vgsJiUKIyDE&m=
> 4nAZXeA9xd82TON6H7CXF5LVa0jKBAJkyu0Y-
> curSd4&s=hP6IktdmIVlw3gMuZlWRkPvFq9OzjUji6sb_28sapwk&e= '
> +                      '/linux/releases/31/Everything/aarch64/os/images'
> +                      '/pxeboot/initrd.img')
> +        kernel_command_line = (self.KERNEL_COMMON_COMMAND_LINE +
> +                               'rdinit=/bin/bash console=ttyAMA0')
> +        machine_type = 'virt,gic-version=3'
> +        self.do_test(kernel_url, initrd_url, kernel_command_line,
> machine_type)
> --
> 1.8.3.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC server 04/11] vfio-user: find and init PCI device
  2021-07-19 20:00   ` [PATCH RFC server 04/11] vfio-user: find and init PCI device Jagannathan Raman
@ 2021-07-26 15:05     ` John Levon
  2021-07-28 17:08       ` Jag Raman
  0 siblings, 1 reply; 55+ messages in thread
From: John Levon @ 2021-07-26 15:05 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: elena.ufimtseva, john.g.johnson, swapnil.ingle, qemu-devel,
	alex.williamson, stefanha, thanos.makatos

On Mon, Jul 19, 2021 at 04:00:06PM -0400, Jagannathan Raman wrote:

> +    vfu_pci_set_id(o->vfu_ctx,
> +                   pci_get_word(o->pci_dev->config + PCI_VENDOR_ID),
> +                   pci_get_word(o->pci_dev->config + PCI_DEVICE_ID),
> +                   pci_get_word(o->pci_dev->config + PCI_SUBSYSTEM_VENDOR_ID),
> +                   pci_get_word(o->pci_dev->config + PCI_SUBSYSTEM_ID));

Since you handle all config space accesses yourselves, is there even any need
for this?

regards
john


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC server 06/11] vfio-user: handle PCI config space accesses
  2021-07-19 20:00   ` [PATCH RFC server 06/11] vfio-user: handle PCI config space accesses Jagannathan Raman
@ 2021-07-26 15:10     ` John Levon
  0 siblings, 0 replies; 55+ messages in thread
From: John Levon @ 2021-07-26 15:10 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: elena.ufimtseva, john.g.johnson, swapnil.ingle, qemu-devel,
	alex.williamson, stefanha, thanos.makatos

On Mon, Jul 19, 2021 at 04:00:08PM -0400, Jagannathan Raman wrote:

> Define and register handlers for PCI config space accesses
> 
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> ---
>  hw/remote/vfio-user-obj.c | 41 +++++++++++++++++++++++++++++++++++++++++
>  hw/remote/trace-events    |  2 ++
>  2 files changed, 43 insertions(+)
> 
> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
> index 6a2d0f5..60d9fa8 100644
> --- a/hw/remote/vfio-user-obj.c
> +++ b/hw/remote/vfio-user-obj.c
> @@ -36,6 +36,7 @@
>  #include "sysemu/runstate.h"
>  #include "qemu/notify.h"
>  #include "qemu/thread.h"
> +#include "qemu/main-loop.h"
>  #include "qapi/error.h"
>  #include "sysemu/sysemu.h"
>  #include "hw/qdev-core.h"
> @@ -131,6 +132,35 @@ static void *vfu_object_ctx_run(void *opaque)
>      return NULL;
>  }
>  
> +static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, char * const buf,
> +                                     size_t count, loff_t offset,
> +                                     const bool is_write)
> +{
> +    VfuObject *o = vfu_get_private(vfu_ctx);
> +    uint32_t val = 0;
> +    int i;
> +
> +    qemu_mutex_lock_iothread();
> +
> +    for (i = 0; i < count; i++) {
> +        if (is_write) {
> +            val = *((uint8_t *)(buf + i));
> +            trace_vfu_cfg_write((offset + i), val);
> +            pci_default_write_config(PCI_DEVICE(o->pci_dev),
> +                                     (offset + i), val, 1);
> +        } else {
> +            val = pci_default_read_config(PCI_DEVICE(o->pci_dev),
> +                                          (offset + i), 1);
> +            *((uint8_t *)(buf + i)) = (uint8_t)val;
> +            trace_vfu_cfg_read((offset + i), val);
> +        }
> +    }

Is it always OK to split up the access into single bytes like this?

regards
john


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC 03/19] vfio-user: define VFIO Proxy and communication functions
  2021-07-19  6:27 ` [PATCH RFC 03/19] vfio-user: define VFIO Proxy and communication functions Elena Ufimtseva
@ 2021-07-27 16:34   ` Stefan Hajnoczi
  2021-07-28 18:08     ` John Johnson
  0 siblings, 1 reply; 55+ messages in thread
From: Stefan Hajnoczi @ 2021-07-27 16:34 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: john.g.johnson, jag.raman, swapnil.ingle, john.levon, qemu-devel,
	alex.williamson

[-- Attachment #1: Type: text/plain, Size: 18867 bytes --]

On Sun, Jul 18, 2021 at 11:27:42PM -0700, Elena Ufimtseva wrote:
> From: John G Johnson <john.g.johnson@oracle.com>
> 
> Add user.c and user.h files for vfio-user with the basic
> send and receive functions.
> 
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> ---
>  hw/vfio/user.h                | 120 ++++++++++++++
>  include/hw/vfio/vfio-common.h |   2 +
>  hw/vfio/user.c                | 286 ++++++++++++++++++++++++++++++++++
>  MAINTAINERS                   |   4 +
>  hw/vfio/meson.build           |   1 +
>  5 files changed, 413 insertions(+)
>  create mode 100644 hw/vfio/user.h
>  create mode 100644 hw/vfio/user.c

The multi-threading, coroutine, and blocking I/O requirements of
vfio_user_recv() and vfio_user_send_reply() are unclear to me. Please
document them so it's clear what environment they can be called from. I
guess they are not called from coroutines and proxy->ioc is a blocking
IOChannel?

> 
> diff --git a/hw/vfio/user.h b/hw/vfio/user.h
> new file mode 100644
> index 0000000000..cdbc074579
> --- /dev/null
> +++ b/hw/vfio/user.h
> @@ -0,0 +1,120 @@
> +#ifndef VFIO_USER_H
> +#define VFIO_USER_H
> +
> +/*
> + * vfio protocol over a UNIX socket.
> + *
> + * Copyright © 2018, 2021 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.  See
> + * the COPYING file in the top-level directory.
> + *
> + * Each message has a standard header that describes the command
> + * being sent, which is almost always a VFIO ioctl().
> + *
> + * The header may be followed by command-specfic data, such as the
> + * region and offset info for read and write commands.
> + */
> +
> +/* commands */
> +enum vfio_user_command {
> +    VFIO_USER_VERSION                   = 1,
> +    VFIO_USER_DMA_MAP                   = 2,
> +    VFIO_USER_DMA_UNMAP                 = 3,
> +    VFIO_USER_DEVICE_GET_INFO           = 4,
> +    VFIO_USER_DEVICE_GET_REGION_INFO    = 5,
> +    VFIO_USER_DEVICE_GET_REGION_IO_FDS  = 6,
> +    VFIO_USER_DEVICE_GET_IRQ_INFO       = 7,
> +    VFIO_USER_DEVICE_SET_IRQS           = 8,
> +    VFIO_USER_REGION_READ               = 9,
> +    VFIO_USER_REGION_WRITE              = 10,
> +    VFIO_USER_DMA_READ                  = 11,
> +    VFIO_USER_DMA_WRITE                 = 12,
> +    VFIO_USER_DEVICE_RESET              = 13,
> +    VFIO_USER_DIRTY_PAGES               = 14,
> +    VFIO_USER_MAX,
> +};
> +
> +/* flags */
> +#define VFIO_USER_REQUEST       0x0
> +#define VFIO_USER_REPLY         0x1
> +#define VFIO_USER_TYPE          0xF
> +
> +#define VFIO_USER_NO_REPLY      0x10
> +#define VFIO_USER_ERROR         0x20
> +
> +typedef struct vfio_user_hdr {
> +    uint16_t id;
> +    uint16_t command;
> +    uint32_t size;
> +    uint32_t flags;
> +    uint32_t error_reply;
> +} vfio_user_hdr_t;

Please use QEMU coding style in QEMU code (i.e. not shared with Linux or
external libraries):

  typedef struct {
      ...
  } VfioUserHdr;

You can also specify the struct VfioUserHdr tag if you want but it's
only necessary to reference the struct before the end of the typedef
definition.

https://qemu-project.gitlab.io/qemu/devel/style.html

> +
> +/*
> + * VFIO_USER_VERSION
> + */
> +#define VFIO_USER_MAJOR_VER     0
> +#define VFIO_USER_MINOR_VER     0
> +
> +struct vfio_user_version {
> +    vfio_user_hdr_t hdr;
> +    uint16_t major;
> +    uint16_t minor;
> +    char capabilities[];
> +};
> +
> +#define VFIO_USER_DEF_MAX_FDS   8
> +#define VFIO_USER_MAX_MAX_FDS   16
> +
> +#define VFIO_USER_DEF_MAX_XFER  (1024 * 1024)
> +#define VFIO_USER_MAX_MAX_XFER  (64 * 1024 * 1024)
> +
> +typedef struct VFIOUserFDs {
> +    int send_fds;
> +    int recv_fds;
> +    int *fds;
> +} VFIOUserFDs;

I think around here we switch from vfio-user spec definitions to QEMU
implementation details. It might be nice to keep the vfio-user spec
definitions in a separate header file so the boundary is clear.

> +
> +typedef struct VFIOUserReply {
> +    QTAILQ_ENTRY(VFIOUserReply) next;
> +    vfio_user_hdr_t *msg;
> +    VFIOUserFDs *fds;
> +    int rsize;
> +    uint32_t id;
> +    QemuCond cv;
> +    uint8_t complete;

Please use bool.

> +} VFIOUserReply;
> +
> +enum proxy_state {
> +    CONNECTED = 1,
> +    RECV_ERROR = 2,
> +    CLOSING = 3,
> +    CLOSED = 4,
> +};

These enum values probably need a prefix (VFIO_PROXY_*). Generic short
names like CONNECTED, CLOSED, etc could lead to namespace collisions.
Enum constants are in the global namespace.

> +
> +typedef struct VFIOProxy {
> +    QLIST_ENTRY(VFIOProxy) next;
> +    char *sockname;
> +    struct QIOChannel *ioc;
> +    int (*request)(void *opaque, char *buf, VFIOUserFDs *fds);
> +    void *reqarg;
> +    int flags;
> +    QemuCond close_cv;
> +
> +    /*
> +     * above only changed when iolock is held

Please use "BQL" instead of "iolock". git grep shows many results for
BQL and the only result for iolock is in mpqemu code.

> +     * below are protected by per-proxy lock
> +     */
> +    QemuMutex lock;
> +    QTAILQ_HEAD(, VFIOUserReply) free;
> +    QTAILQ_HEAD(, VFIOUserReply) pending;
> +    enum proxy_state state;
> +    int close_wait;

Is this a bool? Please use bool.

> +} VFIOProxy;
> +
> +#define VFIO_PROXY_CLIENT       0x1

A comment that shows which field VFIO_PROXY_CLIENT relates would make this clearer:

  /* VFIOProxy->flags */
  #define VFIO_PROXY_CLIENT 0x1

> +
> +void vfio_user_recv(void *opaque);
> +void vfio_user_send_reply(VFIOProxy *proxy, char *buf, int ret);
> +#endif /* VFIO_USER_H */
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 8af11b0a76..f43dc6e5d0 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -75,6 +75,7 @@ typedef struct VFIOAddressSpace {
>  } VFIOAddressSpace;
>  
>  struct VFIOGroup;
> +typedef struct VFIOProxy VFIOProxy;
>  
>  typedef struct VFIOContainer {
>      VFIOAddressSpace *space;
> @@ -143,6 +144,7 @@ typedef struct VFIODevice {
>      VFIOMigration *migration;
>      Error *migration_blocker;
>      OnOffAuto pre_copy_dirty_page_tracking;
> +    VFIOProxy *proxy;
>  } VFIODevice;
>  
>  struct VFIODeviceOps {
> diff --git a/hw/vfio/user.c b/hw/vfio/user.c
> new file mode 100644
> index 0000000000..021d5540e0
> --- /dev/null
> +++ b/hw/vfio/user.c
> @@ -0,0 +1,286 @@
> +/*
> + * vfio protocol over a UNIX socket.
> + *
> + * Copyright © 2018, 2021 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "qemu/osdep.h"
> +#include <linux/vfio.h>
> +#include <sys/ioctl.h>
> +
> +#include "qemu/error-report.h"
> +#include "qapi/error.h"
> +#include "qemu/main-loop.h"
> +#include "hw/hw.h"
> +#include "hw/vfio/vfio-common.h"
> +#include "hw/vfio/vfio.h"
> +#include "qemu/sockets.h"
> +#include "io/channel.h"
> +#include "io/channel-util.h"
> +#include "sysemu/iothread.h"
> +#include "user.h"
> +
> +static uint64_t max_xfer_size = VFIO_USER_DEF_MAX_XFER;
> +static IOThread *vfio_user_iothread;
> +static void vfio_user_send_locked(VFIOProxy *proxy, vfio_user_hdr_t *msg,
> +                                  VFIOUserFDs *fds);
> +static void vfio_user_send(VFIOProxy *proxy, vfio_user_hdr_t *msg,
> +                           VFIOUserFDs *fds);
> +static void vfio_user_shutdown(VFIOProxy *proxy);
> +
> +static void vfio_user_shutdown(VFIOProxy *proxy)
> +{
> +    qio_channel_shutdown(proxy->ioc, QIO_CHANNEL_SHUTDOWN_READ, NULL);
> +    qio_channel_set_aio_fd_handler(proxy->ioc,
> +                                   iothread_get_aio_context(vfio_user_iothread),
> +                                   NULL, NULL, NULL);

There is no other qio_channel_set_aio_fd_handler() call in this patch.
Why is this one necessary?

> +}
> +
> +void vfio_user_send_reply(VFIOProxy *proxy, char *buf, int ret)
> +{
> +    vfio_user_hdr_t *hdr = (vfio_user_hdr_t *)buf;
> +
> +    /*
> +     * convert header to associated reply
> +     * positive ret is reply size, negative is error code
> +     */
> +    hdr->flags = VFIO_USER_REPLY;
> +    if (ret > 0) {
> +        hdr->size = ret;
> +    } else if (ret < 0) {
> +        hdr->flags |= VFIO_USER_ERROR;
> +        hdr->error_reply = -ret;
> +        hdr->size = sizeof(*hdr);
> +    }

assert(ret != 0)? That case doesn't seem to be defined so maybe an
assertion is worthwhile.

> +    vfio_user_send(proxy, hdr, NULL);
> +}
> +
> +void vfio_user_recv(void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOProxy *proxy = vbasedev->proxy;
> +    VFIOUserReply *reply = NULL;
> +    g_autofree int *fdp = NULL;
> +    VFIOUserFDs reqfds = { 0, 0, fdp };
> +    vfio_user_hdr_t msg;
> +    struct iovec iov = {
> +        .iov_base = &msg,
> +        .iov_len = sizeof(msg),
> +    };
> +    int isreply, i, ret;
> +    size_t msgleft, numfds = 0;
> +    char *data = NULL;
> +    g_autofree char *buf = NULL;
> +    Error *local_err = NULL;
> +
> +    qemu_mutex_lock(&proxy->lock);
> +    if (proxy->state == CLOSING) {
> +        qemu_mutex_unlock(&proxy->lock);

QEMU_LOCK_GUARD() automatically unlocks mutexes when the function
returns and is less error-prone than manual lock/unlock calls.

> +        return;
> +    }
> +
> +    ret = qio_channel_readv_full(proxy->ioc, &iov, 1, &fdp, &numfds,
> +                                 &local_err);
> +    if (ret <= 0) {
> +        /* read error or other side closed connection */
> +        error_setg_errno(&local_err, errno, "vfio_user_recv read error");

This will trigger an assertion failure when local_err was already set by
qio_channel_readv_full():

  static void error_setv(Error **errp,
                         const char *src, int line, const char *func,
                         ErrorClass err_class, const char *fmt, va_list ap,
                         const char *suffix)
  {
      Error *err;
      int saved_errno = errno;
  
      if (errp == NULL) {
          return;
      }
      assert(*errp == NULL);
      ^^^^^^^^^^^^^^^^^^^^^^

I think this error_setg_errno() call should be dropped. You can use
error_prepend() if you'd like to add more information to the error
message from qio_channel_readv_full().

> +        goto fatal;
> +    }
> +
> +    if (ret < sizeof(msg)) {
> +        error_setg(&local_err, "vfio_user_recv short read of header");
> +        goto err;
> +    }
> +
> +    /*
> +     * For replies, find the matching pending request
> +     */
> +    switch (msg.flags & VFIO_USER_TYPE) {
> +    case VFIO_USER_REQUEST:
> +        isreply = 0;
> +        break;
> +    case VFIO_USER_REPLY:
> +        isreply = 1;
> +        break;
> +    default:
> +        error_setg(&local_err, "vfio_user_recv unknown message type");
> +        goto err;
> +    }
> +
> +    if (isreply) {
> +        QTAILQ_FOREACH(reply, &proxy->pending, next) {
> +            if (msg.id == reply->id) {
> +                break;
> +            }
> +        }

I'm surprised to see this loop since proxy->lock prevents additional
requests from being sent while we're trying to receive a message. Can
there really be multiple replies pending with this locking scheme?

> +        if (reply == NULL) {
> +            error_setg(&local_err, "vfio_user_recv unexpected reply");
> +            goto err;
> +        }
> +        QTAILQ_REMOVE(&proxy->pending, reply, next);
> +
> +        /*
> +         * Process any received FDs
> +         */
> +        if (numfds != 0) {
> +            if (reply->fds == NULL || reply->fds->recv_fds < numfds) {
> +                error_setg(&local_err, "vfio_user_recv unexpected FDs");
> +                goto err;
> +            }
> +            reply->fds->recv_fds = numfds;
> +            memcpy(reply->fds->fds, fdp, numfds * sizeof(int));
> +        }
> +
> +    } else {
> +        /*
> +         * The client doesn't expect any FDs in requests, but
> +         * they will be expected on the server
> +         */
> +        if (numfds != 0 && (proxy->flags & VFIO_PROXY_CLIENT)) {
> +            error_setg(&local_err, "vfio_user_recv fd in client reply");
> +            goto err;
> +        }
> +        reqfds.recv_fds = numfds;
> +    }
> +
> +    /*
> +     * put the whole message into a single buffer
> +     */
> +    msgleft = msg.size - sizeof(msg);

msg.size has not been validated so this could underflow. Please validate
all inputs so malicious servers/clients cannot crash or compromise the
program.

> +    if (isreply) {
> +        if (msg.size > reply->rsize) {

rsize is an int. Should it be uint32_t like msg.size?

> +            error_setg(&local_err,
> +                       "vfio_user_recv reply larger than recv buffer");
> +            goto fatal;
> +        }
> +        *reply->msg = msg;
> +        data = (char *)reply->msg + sizeof(msg);
> +    } else {
> +        if (msg.size > max_xfer_size) {
> +            error_setg(&local_err, "vfio_user_recv request larger than max");
> +            goto fatal;
> +        }

Missing check to prevent buffer overflow:

  if (msg.size < sizeof(msg)) {
      error_setg(&local_err, "vfio_user_recv request too small");
      goto fatal;
  }

> +        buf = g_malloc0(msg.size);
> +        memcpy(buf, &msg, sizeof(msg));
> +        data = buf + sizeof(msg);
> +    }
> +
> +    if (msgleft != 0) {
> +        ret = qio_channel_read(proxy->ioc, data, msgleft, &local_err);
> +        if (ret < 0) {
> +            goto fatal;
> +        }
> +        if (ret != msgleft) {
> +            error_setg(&local_err, "vfio_user_recv short read of msg body");
> +            goto err;
> +        }
> +    }
> +
> +    /*
> +     * Replies signal a waiter, requests get processed by vfio code
> +     * that may assume the iothread lock is held.
> +     */
> +    qemu_mutex_unlock(&proxy->lock);
> +    if (isreply) {
> +        reply->complete = 1;
> +        qemu_cond_signal(&reply->cv);

signal must be called with the mutex held to avoid race conditions. If
the waiter acquires the lock and still sees complete == 0, then we
signal before wait is entered, the signal is missed and the waiter is
stuck.

> +    } else {
> +        qemu_mutex_lock_iothread();
> +        /*
> +         * make sure proxy wasn't closed while we waited
> +         * checking without holding the proxy lock is safe
> +         * since state is only set to CLOSING when iolock is held

s/iolock/the BQL/

> +         */
> +        if (proxy->state != CLOSING) {
> +            ret = proxy->request(proxy->reqarg, buf, &reqfds);
> +            if (ret < 0 && !(msg.flags & VFIO_USER_NO_REPLY)) {
> +                vfio_user_send_reply(proxy, buf, ret);
> +            }
> +        }
> +        qemu_mutex_unlock_iothread();
> +    }
> +
> +    return;
> + fatal:
> +    vfio_user_shutdown(proxy);
> +    proxy->state = RECV_ERROR;
> +
> + err:
> +    qemu_mutex_unlock(&proxy->lock);
> +    for (i = 0; i < numfds; i++) {
> +        close(fdp[i]);
> +    }
> +    if (reply != NULL) {
> +        /* force an error to keep sending thread from hanging */
> +        reply->msg->flags |= VFIO_USER_ERROR;
> +        reply->msg->error_reply = EINVAL;
> +        reply->complete = 1;
> +        qemu_cond_signal(&reply->cv);

This has the race condition too.

> +    }
> +    error_report_err(local_err);
> +}
> +
> +static void vfio_user_send_locked(VFIOProxy *proxy, vfio_user_hdr_t *msg,
> +                                  VFIOUserFDs *fds)
> +{
> +    struct iovec iov = {
> +        .iov_base = msg,
> +        .iov_len = msg->size,
> +    };
> +    size_t numfds = 0;
> +    int msgleft, ret, *fdp = NULL;
> +    char *buf;
> +    Error *local_err = NULL;
> +
> +    if (proxy->state != CONNECTED) {
> +        msg->flags |= VFIO_USER_ERROR;
> +        msg->error_reply = ECONNRESET;
> +        return;
> +    }
> +
> +    if (fds != NULL && fds->send_fds != 0) {
> +        numfds = fds->send_fds;
> +        fdp = fds->fds;
> +    }
> +    ret = qio_channel_writev_full(proxy->ioc, &iov, 1, fdp, numfds, &local_err);
> +    if (ret < 0) {
> +        goto err;
> +    }
> +    if (ret == msg->size) {
> +        return;
> +    }
> +
> +    buf = iov.iov_base + ret;
> +    msgleft = iov.iov_len - ret;
> +    do {
> +        ret = qio_channel_write(proxy->ioc, buf, msgleft, &local_err);
> +        if (ret < 0) {
> +            goto err;
> +        }
> +        buf += ret, msgleft -= ret;

Please use semicolon. Comma operators are rare, requiring readers to
check their exact semantics. There is no need to use comma here.

> +    } while (msgleft != 0);
> +    return;
> +
> + err:
> +    error_report_err(local_err);

State remains unchanged and msg->error_reply isn't set?

> +}
> +
> +static void vfio_user_send(VFIOProxy *proxy, vfio_user_hdr_t *msg,
> +                           VFIOUserFDs *fds)
> +{
> +    bool iolock = qemu_mutex_iothread_locked();
> +
> +    if (iolock) {
> +        qemu_mutex_unlock_iothread();
> +    }
> +    qemu_mutex_lock(&proxy->lock);
> +    vfio_user_send_locked(proxy, msg, fds);
> +    qemu_mutex_unlock(&proxy->lock);
> +    if (iolock) {
> +        qemu_mutex_lock_iothread();
> +    }
> +}
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 12d69f3a45..aa4df6c418 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1883,8 +1883,12 @@ L: qemu-s390x@nongnu.org
>  vfio-user
>  M: John G Johnson <john.g.johnson@oracle.com>
>  M: Thanos Makatos <thanos.makatos@nutanix.com>
> +M: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> +M: Jagannathan Raman <jag.raman@oracle.com>
>  S: Supported
>  F: docs/devel/vfio-user.rst
> +F: hw/vfio/user.c
> +F: hw/vfio/user.h
>  
>  vhost
>  M: Michael S. Tsirkin <mst@redhat.com>
> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
> index da9af297a0..739b30be73 100644
> --- a/hw/vfio/meson.build
> +++ b/hw/vfio/meson.build
> @@ -8,6 +8,7 @@ vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
>    'display.c',
>    'pci-quirks.c',
>    'pci.c',
> +  'user.c',
>  ))
>  vfio_ss.add(when: 'CONFIG_VFIO_CCW', if_true: files('ccw.c'))
>  vfio_ss.add(when: 'CONFIG_VFIO_PLATFORM', if_true: files('platform.c'))
> -- 
> 2.25.1
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC 04/19] vfio-user: Define type vfio_user_pci_dev_info
  2021-07-19  6:27 ` [PATCH RFC 04/19] vfio-user: Define type vfio_user_pci_dev_info Elena Ufimtseva
@ 2021-07-28 10:16   ` Stefan Hajnoczi
  2021-07-29  0:55     ` John Johnson
  0 siblings, 1 reply; 55+ messages in thread
From: Stefan Hajnoczi @ 2021-07-28 10:16 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: john.g.johnson, jag.raman, swapnil.ingle, john.levon, qemu-devel,
	alex.williamson

[-- Attachment #1: Type: text/plain, Size: 3278 bytes --]

On Sun, Jul 18, 2021 at 11:27:43PM -0700, Elena Ufimtseva wrote:
> From: John G Johnson <john.g.johnson@oracle.com>
> 
> New class for vfio-user with its class and instance
> constructors and destructors.
> 
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> ---
>  hw/vfio/pci.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 49 insertions(+)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index bea95efc33..554b562769 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -42,6 +42,7 @@
>  #include "qapi/error.h"
>  #include "migration/blocker.h"
>  #include "migration/qemu-file.h"
> +#include "hw/vfio/user.h"
>  
>  #define TYPE_VFIO_PCI_NOHOTPLUG "vfio-pci-nohotplug"
>  
> @@ -3326,3 +3327,51 @@ static void register_vfio_pci_dev_type(void)
>  }
>  
>  type_init(register_vfio_pci_dev_type)
> +
> +static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
> +{
> +    ERRP_GUARD();
> +    VFIOUserPCIDevice *udev = VFIO_USER_PCI(pdev);
> +
> +    if (!udev->sock_name) {
> +        error_setg(errp, "No socket specified");
> +        error_append_hint(errp, "Use -device vfio-user-pci,socket=<name>\n");
> +        return;
> +    }
> +}
> +
> +static void vfio_user_instance_finalize(Object *obj)
> +{
> +}
> +
> +static Property vfio_user_pci_dev_properties[] = {
> +    DEFINE_PROP_STRING("socket", VFIOUserPCIDevice, sock_name),

Please use SocketAddress so that alternative socket connection details
can be supported without inventing custom syntax for vfio-user-pci. For
example, file descriptor passing should be possible.

I think this requires a bit of command-line parsing work, so don't worry
about it for now, but please add a TODO comment. When the -device
vfio-user-pci syntax is finalized (i.e. when the code is merged and the
device name doesn't start with the experimental x- prefix), then it
needs to be solved.

> +    DEFINE_PROP_BOOL("secure-dma", VFIOUserPCIDevice, secure, false),

I'm not sure what "secure-dma" means and the "secure" variable name is
even more inscrutable. Does this mean don't share memory so that each
DMA access is checked individually?

> +    DEFINE_PROP_END_OF_LIST(),
> +};
> +
> +static void vfio_user_pci_dev_class_init(ObjectClass *klass, void *data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(klass);
> +    PCIDeviceClass *pdc = PCI_DEVICE_CLASS(klass);
> +
> +    device_class_set_props(dc, vfio_user_pci_dev_properties);
> +    dc->desc = "VFIO over socket PCI device assignment";
> +    pdc->realize = vfio_user_pci_realize;
> +}
> +
> +static const TypeInfo vfio_user_pci_dev_info = {
> +    .name = TYPE_VFIO_USER_PCI,
> +    .parent = TYPE_VFIO_PCI_BASE,
> +    .instance_size = sizeof(VFIOUserPCIDevice),
> +    .class_init = vfio_user_pci_dev_class_init,
> +    .instance_init = vfio_instance_init,
> +    .instance_finalize = vfio_user_instance_finalize,
> +};
> +
> +static void register_vfio_user_dev_type(void)
> +{
> +    type_register_static(&vfio_user_pci_dev_info);
> +}
> +
> +type_init(register_vfio_user_dev_type)
> -- 
> 2.25.1
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC server 04/11] vfio-user: find and init PCI device
  2021-07-26 15:05     ` John Levon
@ 2021-07-28 17:08       ` Jag Raman
  0 siblings, 0 replies; 55+ messages in thread
From: Jag Raman @ 2021-07-28 17:08 UTC (permalink / raw)
  To: John Levon
  Cc: Elena Ufimtseva, John Johnson, swapnil.ingle, qemu-devel,
	alex.williamson, stefanha, thanos.makatos



> On Jul 26, 2021, at 11:05 AM, John Levon <levon@movementarian.org> wrote:
> 
> On Mon, Jul 19, 2021 at 04:00:06PM -0400, Jagannathan Raman wrote:
> 
>> +    vfu_pci_set_id(o->vfu_ctx,
>> +                   pci_get_word(o->pci_dev->config + PCI_VENDOR_ID),
>> +                   pci_get_word(o->pci_dev->config + PCI_DEVICE_ID),
>> +                   pci_get_word(o->pci_dev->config + PCI_SUBSYSTEM_VENDOR_ID),
>> +                   pci_get_word(o->pci_dev->config + PCI_SUBSYSTEM_ID));
> 
> Since you handle all config space accesses yourselves, is there even any need
> for this?

I think that makes sense. Since the QEMU server handles all the config space
accesses, it’s not necessary to register the device’s vendor/device ID with the library.

Thank you!
--
Jag

> 
> regards
> john


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC 03/19] vfio-user: define VFIO Proxy and communication functions
  2021-07-27 16:34   ` Stefan Hajnoczi
@ 2021-07-28 18:08     ` John Johnson
  2021-07-29  8:06       ` Stefan Hajnoczi
  0 siblings, 1 reply; 55+ messages in thread
From: John Johnson @ 2021-07-28 18:08 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, Jag Raman, swapnil.ingle, John Levon,
	QEMU Devel Mailing List, Alex Williamson



> On Jul 27, 2021, at 9:34 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Sun, Jul 18, 2021 at 11:27:42PM -0700, Elena Ufimtseva wrote:
>> From: John G Johnson <john.g.johnson@oracle.com>
>> 
>> Add user.c and user.h files for vfio-user with the basic
>> send and receive functions.
>> 
>> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
>> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
>> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
>> ---
>> hw/vfio/user.h                | 120 ++++++++++++++
>> include/hw/vfio/vfio-common.h |   2 +
>> hw/vfio/user.c                | 286 ++++++++++++++++++++++++++++++++++
>> MAINTAINERS                   |   4 +
>> hw/vfio/meson.build           |   1 +
>> 5 files changed, 413 insertions(+)
>> create mode 100644 hw/vfio/user.h
>> create mode 100644 hw/vfio/user.c
> 
> The multi-threading, coroutine, and blocking I/O requirements of
> vfio_user_recv() and vfio_user_send_reply() are unclear to me. Please
> document them so it's clear what environment they can be called from. I
> guess they are not called from coroutines and proxy->ioc is a blocking
> IOChannel?
> 

	Yes to both, moreover, a block comment above vfio_user_recv() would
be useful.  The call to setup vfio_user_recv() as the socket handler isn’t
in this patch, do you want the series re-org’d?



>> 
>> diff --git a/hw/vfio/user.h b/hw/vfio/user.h
>> new file mode 100644
>> index 0000000000..cdbc074579
>> --- /dev/null
>> +++ b/hw/vfio/user.h
>> @@ -0,0 +1,120 @@
>> +#ifndef VFIO_USER_H
>> +#define VFIO_USER_H
>> +
>> +/*
>> + * vfio protocol over a UNIX socket.
>> + *
>> + * Copyright © 2018, 2021 Oracle and/or its affiliates.
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2.  See
>> + * the COPYING file in the top-level directory.
>> + *
>> + * Each message has a standard header that describes the command
>> + * being sent, which is almost always a VFIO ioctl().
>> + *
>> + * The header may be followed by command-specfic data, such as the
>> + * region and offset info for read and write commands.
>> + */
>> +
>> +/* commands */
>> +enum vfio_user_command {
>> +    VFIO_USER_VERSION                   = 1,
>> +    VFIO_USER_DMA_MAP                   = 2,
>> +    VFIO_USER_DMA_UNMAP                 = 3,
>> +    VFIO_USER_DEVICE_GET_INFO           = 4,
>> +    VFIO_USER_DEVICE_GET_REGION_INFO    = 5,
>> +    VFIO_USER_DEVICE_GET_REGION_IO_FDS  = 6,
>> +    VFIO_USER_DEVICE_GET_IRQ_INFO       = 7,
>> +    VFIO_USER_DEVICE_SET_IRQS           = 8,
>> +    VFIO_USER_REGION_READ               = 9,
>> +    VFIO_USER_REGION_WRITE              = 10,
>> +    VFIO_USER_DMA_READ                  = 11,
>> +    VFIO_USER_DMA_WRITE                 = 12,
>> +    VFIO_USER_DEVICE_RESET              = 13,
>> +    VFIO_USER_DIRTY_PAGES               = 14,
>> +    VFIO_USER_MAX,
>> +};
>> +
>> +/* flags */
>> +#define VFIO_USER_REQUEST       0x0
>> +#define VFIO_USER_REPLY         0x1
>> +#define VFIO_USER_TYPE          0xF
>> +
>> +#define VFIO_USER_NO_REPLY      0x10
>> +#define VFIO_USER_ERROR         0x20
>> +
>> +typedef struct vfio_user_hdr {
>> +    uint16_t id;
>> +    uint16_t command;
>> +    uint32_t size;
>> +    uint32_t flags;
>> +    uint32_t error_reply;
>> +} vfio_user_hdr_t;
> 
> Please use QEMU coding style in QEMU code (i.e. not shared with Linux or
> external libraries):
> 
>  typedef struct {
>      ...
>  } VfioUserHdr;
> 
> You can also specify the struct VfioUserHdr tag if you want but it's
> only necessary to reference the struct before the end of the typedef
> definition.
> 
> https://qemu-project.gitlab.io/qemu/devel/style.html
> 

	OK

>> +
>> +/*
>> + * VFIO_USER_VERSION
>> + */
>> +#define VFIO_USER_MAJOR_VER     0
>> +#define VFIO_USER_MINOR_VER     0
>> +
>> +struct vfio_user_version {
>> +    vfio_user_hdr_t hdr;
>> +    uint16_t major;
>> +    uint16_t minor;
>> +    char capabilities[];
>> +};
>> +
>> +#define VFIO_USER_DEF_MAX_FDS   8
>> +#define VFIO_USER_MAX_MAX_FDS   16
>> +
>> +#define VFIO_USER_DEF_MAX_XFER  (1024 * 1024)
>> +#define VFIO_USER_MAX_MAX_XFER  (64 * 1024 * 1024)
>> +
>> +typedef struct VFIOUserFDs {
>> +    int send_fds;
>> +    int recv_fds;
>> +    int *fds;
>> +} VFIOUserFDs;
> 
> I think around here we switch from vfio-user spec definitions to QEMU
> implementation details. It might be nice to keep the vfio-user spec
> definitions in a separate header file so the boundary is clear.
> 

	OK


>> +
>> +typedef struct VFIOUserReply {
>> +    QTAILQ_ENTRY(VFIOUserReply) next;
>> +    vfio_user_hdr_t *msg;
>> +    VFIOUserFDs *fds;
>> +    int rsize;
>> +    uint32_t id;
>> +    QemuCond cv;
>> +    uint8_t complete;
> 
> Please use bool.
> 

	OK

>> +} VFIOUserReply;
>> +
>> +enum proxy_state {
>> +    CONNECTED = 1,
>> +    RECV_ERROR = 2,
>> +    CLOSING = 3,
>> +    CLOSED = 4,
>> +};
> 
> These enum values probably need a prefix (VFIO_PROXY_*). Generic short
> names like CONNECTED, CLOSED, etc could lead to namespace collisions.
> Enum constants are in the global namespace.
> 

	OK


>> +
>> +typedef struct VFIOProxy {
>> +    QLIST_ENTRY(VFIOProxy) next;
>> +    char *sockname;
>> +    struct QIOChannel *ioc;
>> +    int (*request)(void *opaque, char *buf, VFIOUserFDs *fds);
>> +    void *reqarg;
>> +    int flags;
>> +    QemuCond close_cv;
>> +
>> +    /*
>> +     * above only changed when iolock is held
> 
> Please use "BQL" instead of "iolock". git grep shows many results for
> BQL and the only result for iolock is in mpqemu code.
> 

	OK

>> +     * below are protected by per-proxy lock
>> +     */
>> +    QemuMutex lock;
>> +    QTAILQ_HEAD(, VFIOUserReply) free;
>> +    QTAILQ_HEAD(, VFIOUserReply) pending;
>> +    enum proxy_state state;
>> +    int close_wait;
> 
> Is this a bool? Please use bool.

	yes it’s a bool


> 
>> +} VFIOProxy;
>> +
>> +#define VFIO_PROXY_CLIENT       0x1
> 
> A comment that shows which field VFIO_PROXY_CLIENT relates would make this clearer:
> 
>  /* VFIOProxy->flags */
>  #define VFIO_PROXY_CLIENT 0x1
> 

	OK

>> +
>> +void vfio_user_recv(void *opaque);
>> +void vfio_user_send_reply(VFIOProxy *proxy, char *buf, int ret);
>> +#endif /* VFIO_USER_H */
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 8af11b0a76..f43dc6e5d0 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -75,6 +75,7 @@ typedef struct VFIOAddressSpace {
>> } VFIOAddressSpace;
>> 
>> struct VFIOGroup;
>> +typedef struct VFIOProxy VFIOProxy;
>> 
>> typedef struct VFIOContainer {
>>     VFIOAddressSpace *space;
>> @@ -143,6 +144,7 @@ typedef struct VFIODevice {
>>     VFIOMigration *migration;
>>     Error *migration_blocker;
>>     OnOffAuto pre_copy_dirty_page_tracking;
>> +    VFIOProxy *proxy;
>> } VFIODevice;
>> 
>> struct VFIODeviceOps {
>> diff --git a/hw/vfio/user.c b/hw/vfio/user.c
>> new file mode 100644
>> index 0000000000..021d5540e0
>> --- /dev/null
>> +++ b/hw/vfio/user.c
>> @@ -0,0 +1,286 @@
>> +/*
>> + * vfio protocol over a UNIX socket.
>> + *
>> + * Copyright © 2018, 2021 Oracle and/or its affiliates.
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + *
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include <linux/vfio.h>
>> +#include <sys/ioctl.h>
>> +
>> +#include "qemu/error-report.h"
>> +#include "qapi/error.h"
>> +#include "qemu/main-loop.h"
>> +#include "hw/hw.h"
>> +#include "hw/vfio/vfio-common.h"
>> +#include "hw/vfio/vfio.h"
>> +#include "qemu/sockets.h"
>> +#include "io/channel.h"
>> +#include "io/channel-util.h"
>> +#include "sysemu/iothread.h"
>> +#include "user.h"
>> +
>> +static uint64_t max_xfer_size = VFIO_USER_DEF_MAX_XFER;
>> +static IOThread *vfio_user_iothread;
>> +static void vfio_user_send_locked(VFIOProxy *proxy, vfio_user_hdr_t *msg,
>> +                                  VFIOUserFDs *fds);
>> +static void vfio_user_send(VFIOProxy *proxy, vfio_user_hdr_t *msg,
>> +                           VFIOUserFDs *fds);
>> +static void vfio_user_shutdown(VFIOProxy *proxy);
>> +
>> +static void vfio_user_shutdown(VFIOProxy *proxy)
>> +{
>> +    qio_channel_shutdown(proxy->ioc, QIO_CHANNEL_SHUTDOWN_READ, NULL);
>> +    qio_channel_set_aio_fd_handler(proxy->ioc,
>> +                                   iothread_get_aio_context(vfio_user_iothread),
>> +                                   NULL, NULL, NULL);
> 
> There is no other qio_channel_set_aio_fd_handler() call in this patch.
> Why is this one necessary?
> 

	See first comment.

>> +}
>> +
>> +void vfio_user_send_reply(VFIOProxy *proxy, char *buf, int ret)
>> +{
>> +    vfio_user_hdr_t *hdr = (vfio_user_hdr_t *)buf;
>> +
>> +    /*
>> +     * convert header to associated reply
>> +     * positive ret is reply size, negative is error code
>> +     */
>> +    hdr->flags = VFIO_USER_REPLY;
>> +    if (ret > 0) {
>> +        hdr->size = ret;
>> +    } else if (ret < 0) {
>> +        hdr->flags |= VFIO_USER_ERROR;
>> +        hdr->error_reply = -ret;
>> +        hdr->size = sizeof(*hdr);
>> +    }
> 
> assert(ret != 0)? That case doesn't seem to be defined so maybe an
> assertion is worthwhile.
> 

	I should test for positive size less than the header size as an error.


>> +    vfio_user_send(proxy, hdr, NULL);
>> +}
>> +
>> +void vfio_user_recv(void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOProxy *proxy = vbasedev->proxy;
>> +    VFIOUserReply *reply = NULL;
>> +    g_autofree int *fdp = NULL;
>> +    VFIOUserFDs reqfds = { 0, 0, fdp };
>> +    vfio_user_hdr_t msg;
>> +    struct iovec iov = {
>> +        .iov_base = &msg,
>> +        .iov_len = sizeof(msg),
>> +    };
>> +    int isreply, i, ret;
>> +    size_t msgleft, numfds = 0;
>> +    char *data = NULL;
>> +    g_autofree char *buf = NULL;
>> +    Error *local_err = NULL;
>> +
>> +    qemu_mutex_lock(&proxy->lock);
>> +    if (proxy->state == CLOSING) {
>> +        qemu_mutex_unlock(&proxy->lock);
> 
> QEMU_LOCK_GUARD() automatically unlocks mutexes when the function
> returns and is less error-prone than manual lock/unlock calls.
> 

	will look into it


>> +        return;
>> +    }
>> +
>> +    ret = qio_channel_readv_full(proxy->ioc, &iov, 1, &fdp, &numfds,
>> +                                 &local_err);
>> +    if (ret <= 0) {
>> +        /* read error or other side closed connection */
>> +        error_setg_errno(&local_err, errno, "vfio_user_recv read error");
> 
> This will trigger an assertion failure when local_err was already set by
> qio_channel_readv_full():
> 
>  static void error_setv(Error **errp,
>                         const char *src, int line, const char *func,
>                         ErrorClass err_class, const char *fmt, va_list ap,
>                         const char *suffix)
>  {
>      Error *err;
>      int saved_errno = errno;
> 
>      if (errp == NULL) {
>          return;
>      }
>      assert(*errp == NULL);
>      ^^^^^^^^^^^^^^^^^^^^^^
> 
> I think this error_setg_errno() call should be dropped. You can use
> error_prepend() if you'd like to add more information to the error
> message from qio_channel_readv_full().
> 

	OK


>> +        goto fatal;
>> +    }
>> +
>> +    if (ret < sizeof(msg)) {
>> +        error_setg(&local_err, "vfio_user_recv short read of header");
>> +        goto err;
>> +    }
>> +
>> +    /*
>> +     * For replies, find the matching pending request
>> +     */
>> +    switch (msg.flags & VFIO_USER_TYPE) {
>> +    case VFIO_USER_REQUEST:
>> +        isreply = 0;
>> +        break;
>> +    case VFIO_USER_REPLY:
>> +        isreply = 1;
>> +        break;
>> +    default:
>> +        error_setg(&local_err, "vfio_user_recv unknown message type");
>> +        goto err;
>> +    }
>> +
>> +    if (isreply) {
>> +        QTAILQ_FOREACH(reply, &proxy->pending, next) {
>> +            if (msg.id == reply->id) {
>> +                break;
>> +            }
>> +        }
> 
> I'm surprised to see this loop since proxy->lock prevents additional
> requests from being sent while we're trying to receive a message. Can
> there really be multiple replies pending with this locking scheme?
> 

	I didn’t want to assume that was always true.  Note an email
exchange with Peter Xu where I can drop BQL in the middle of a memory
region transaction that causes dma_map/unmap messages to be sent.  The
fix to that issue will be to send the messages async, then wait for the
youngest reply when the transaction commits.



>> +        if (reply == NULL) {
>> +            error_setg(&local_err, "vfio_user_recv unexpected reply");
>> +            goto err;
>> +        }
>> +        QTAILQ_REMOVE(&proxy->pending, reply, next);
>> +
>> +        /*
>> +         * Process any received FDs
>> +         */
>> +        if (numfds != 0) {
>> +            if (reply->fds == NULL || reply->fds->recv_fds < numfds) {
>> +                error_setg(&local_err, "vfio_user_recv unexpected FDs");
>> +                goto err;
>> +            }
>> +            reply->fds->recv_fds = numfds;
>> +            memcpy(reply->fds->fds, fdp, numfds * sizeof(int));
>> +        }
>> +
>> +    } else {
>> +        /*
>> +         * The client doesn't expect any FDs in requests, but
>> +         * they will be expected on the server
>> +         */
>> +        if (numfds != 0 && (proxy->flags & VFIO_PROXY_CLIENT)) {
>> +            error_setg(&local_err, "vfio_user_recv fd in client reply");
>> +            goto err;
>> +        }
>> +        reqfds.recv_fds = numfds;
>> +    }
>> +
>> +    /*
>> +     * put the whole message into a single buffer
>> +     */
>> +    msgleft = msg.size - sizeof(msg);
> 
> msg.size has not been validated so this could underflow. Please validate
> all inputs so malicious servers/clients cannot crash or compromise the
> program.
> 

	OK

>> +    if (isreply) {
>> +        if (msg.size > reply->rsize) {
> 
> rsize is an int. Should it be uint32_t like msg.size?
> 

	OK

>> +            error_setg(&local_err,
>> +                       "vfio_user_recv reply larger than recv buffer");
>> +            goto fatal;
>> +        }
>> +        *reply->msg = msg;
>> +        data = (char *)reply->msg + sizeof(msg);
>> +    } else {
>> +        if (msg.size > max_xfer_size) {
>> +            error_setg(&local_err, "vfio_user_recv request larger than max");
>> +            goto fatal;
>> +        }
> 
> Missing check to prevent buffer overflow:
> 
>  if (msg.size < sizeof(msg)) {
>      error_setg(&local_err, "vfio_user_recv request too small");
>      goto fatal;
>  }
> 

	I will put this check up before the msgleft calculation in
the review comment above.


>> +        buf = g_malloc0(msg.size);
>> +        memcpy(buf, &msg, sizeof(msg));
>> +        data = buf + sizeof(msg);
>> +    }
>> +
>> +    if (msgleft != 0) {
>> +        ret = qio_channel_read(proxy->ioc, data, msgleft, &local_err);
>> +        if (ret < 0) {
>> +            goto fatal;
>> +        }
>> +        if (ret != msgleft) {
>> +            error_setg(&local_err, "vfio_user_recv short read of msg body");
>> +            goto err;
>> +        }
>> +    }
>> +
>> +    /*
>> +     * Replies signal a waiter, requests get processed by vfio code
>> +     * that may assume the iothread lock is held.
>> +     */
>> +    qemu_mutex_unlock(&proxy->lock);
>> +    if (isreply) {
>> +        reply->complete = 1;
>> +        qemu_cond_signal(&reply->cv);
> 
> signal must be called with the mutex held to avoid race conditions. If
> the waiter acquires the lock and still sees complete == 0, then we
> signal before wait is entered, the signal is missed and the waiter is
> stuck.
> 

	Yes, this is a bug


>> +    } else {
>> +        qemu_mutex_lock_iothread();
>> +        /*
>> +         * make sure proxy wasn't closed while we waited
>> +         * checking without holding the proxy lock is safe
>> +         * since state is only set to CLOSING when iolock is held
> 
> s/iolock/the BQL/
> 

	OK


>> +         */
>> +        if (proxy->state != CLOSING) {
>> +            ret = proxy->request(proxy->reqarg, buf, &reqfds);
>> +            if (ret < 0 && !(msg.flags & VFIO_USER_NO_REPLY)) {
>> +                vfio_user_send_reply(proxy, buf, ret);
>> +            }
>> +        }
>> +        qemu_mutex_unlock_iothread();
>> +    }
>> +
>> +    return;
>> + fatal:
>> +    vfio_user_shutdown(proxy);
>> +    proxy->state = RECV_ERROR;
>> +
>> + err:
>> +    qemu_mutex_unlock(&proxy->lock);
>> +    for (i = 0; i < numfds; i++) {
>> +        close(fdp[i]);
>> +    }
>> +    if (reply != NULL) {
>> +        /* force an error to keep sending thread from hanging */
>> +        reply->msg->flags |= VFIO_USER_ERROR;
>> +        reply->msg->error_reply = EINVAL;
>> +        reply->complete = 1;
>> +        qemu_cond_signal(&reply->cv);
> 
> This has the race condition too.
> 

	Yes

>> +    }
>> +    error_report_err(local_err);
>> +}
>> +
>> +static void vfio_user_send_locked(VFIOProxy *proxy, vfio_user_hdr_t *msg,
>> +                                  VFIOUserFDs *fds)
>> +{
>> +    struct iovec iov = {
>> +        .iov_base = msg,
>> +        .iov_len = msg->size,
>> +    };
>> +    size_t numfds = 0;
>> +    int msgleft, ret, *fdp = NULL;
>> +    char *buf;
>> +    Error *local_err = NULL;
>> +
>> +    if (proxy->state != CONNECTED) {
>> +        msg->flags |= VFIO_USER_ERROR;
>> +        msg->error_reply = ECONNRESET;
>> +        return;
>> +    }
>> +
>> +    if (fds != NULL && fds->send_fds != 0) {
>> +        numfds = fds->send_fds;
>> +        fdp = fds->fds;
>> +    }
>> +    ret = qio_channel_writev_full(proxy->ioc, &iov, 1, fdp, numfds, &local_err);
>> +    if (ret < 0) {
>> +        goto err;
>> +    }
>> +    if (ret == msg->size) {
>> +        return;
>> +    }
>> +
>> +    buf = iov.iov_base + ret;
>> +    msgleft = iov.iov_len - ret;
>> +    do {
>> +        ret = qio_channel_write(proxy->ioc, buf, msgleft, &local_err);
>> +        if (ret < 0) {
>> +            goto err;
>> +        }
>> +        buf += ret, msgleft -= ret;
> 
> Please use semicolon. Comma operators are rare, requiring readers to
> check their exact semantics. There is no need to use comma here.
> 

	OK


>> +    } while (msgleft != 0);
>> +    return;
>> +
>> + err:
>> +    error_report_err(local_err);
> 
> State remains unchanged and msg->error_reply isn't set?
> 

	They should be set.

						JJ



>> +}
>> +
>> +static void vfio_user_send(VFIOProxy *proxy, vfio_user_hdr_t *msg,
>> +                           VFIOUserFDs *fds)
>> +{
>> +    bool iolock = qemu_mutex_iothread_locked();
>> +
>> +    if (iolock) {
>> +        qemu_mutex_unlock_iothread();
>> +    }
>> +    qemu_mutex_lock(&proxy->lock);
>> +    vfio_user_send_locked(proxy, msg, fds);
>> +    qemu_mutex_unlock(&proxy->lock);
>> +    if (iolock) {
>> +        qemu_mutex_lock_iothread();
>> +    }
>> +}
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 12d69f3a45..aa4df6c418 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -1883,8 +1883,12 @@ L: qemu-s390x@nongnu.org
>> vfio-user
>> M: John G Johnson <john.g.johnson@oracle.com>
>> M: Thanos Makatos <thanos.makatos@nutanix.com>
>> +M: Elena Ufimtseva <elena.ufimtseva@oracle.com>
>> +M: Jagannathan Raman <jag.raman@oracle.com>
>> S: Supported
>> F: docs/devel/vfio-user.rst
>> +F: hw/vfio/user.c
>> +F: hw/vfio/user.h
>> 
>> vhost
>> M: Michael S. Tsirkin <mst@redhat.com>
>> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
>> index da9af297a0..739b30be73 100644
>> --- a/hw/vfio/meson.build
>> +++ b/hw/vfio/meson.build
>> @@ -8,6 +8,7 @@ vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
>>   'display.c',
>>   'pci-quirks.c',
>>   'pci.c',
>> +  'user.c',
>> ))
>> vfio_ss.add(when: 'CONFIG_VFIO_CCW', if_true: files('ccw.c'))
>> vfio_ss.add(when: 'CONFIG_VFIO_PLATFORM', if_true: files('platform.c'))
>> -- 
>> 2.25.1
>> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC 04/19] vfio-user: Define type vfio_user_pci_dev_info
  2021-07-28 10:16   ` Stefan Hajnoczi
@ 2021-07-29  0:55     ` John Johnson
  2021-07-29  8:22       ` Stefan Hajnoczi
  0 siblings, 1 reply; 55+ messages in thread
From: John Johnson @ 2021-07-29  0:55 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, Jag Raman, swapnil.ingle, John Levon,
	QEMU Devel Mailing List, Alex Williamson



> On Jul 28, 2021, at 3:16 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Sun, Jul 18, 2021 at 11:27:43PM -0700, Elena Ufimtseva wrote:
>> From: John G Johnson <john.g.johnson@oracle.com>
>> 
>> New class for vfio-user with its class and instance
>> constructors and destructors.
>> 
>> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
>> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
>> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
>> ---
>> hw/vfio/pci.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 49 insertions(+)
>> 
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index bea95efc33..554b562769 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -42,6 +42,7 @@
>> #include "qapi/error.h"
>> #include "migration/blocker.h"
>> #include "migration/qemu-file.h"
>> +#include "hw/vfio/user.h"
>> 
>> #define TYPE_VFIO_PCI_NOHOTPLUG "vfio-pci-nohotplug"
>> 
>> @@ -3326,3 +3327,51 @@ static void register_vfio_pci_dev_type(void)
>> }
>> 
>> type_init(register_vfio_pci_dev_type)
>> +
>> +static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
>> +{
>> +    ERRP_GUARD();
>> +    VFIOUserPCIDevice *udev = VFIO_USER_PCI(pdev);
>> +
>> +    if (!udev->sock_name) {
>> +        error_setg(errp, "No socket specified");
>> +        error_append_hint(errp, "Use -device vfio-user-pci,socket=<name>\n");
>> +        return;
>> +    }
>> +}
>> +
>> +static void vfio_user_instance_finalize(Object *obj)
>> +{
>> +}
>> +
>> +static Property vfio_user_pci_dev_properties[] = {
>> +    DEFINE_PROP_STRING("socket", VFIOUserPCIDevice, sock_name),
> 
> Please use SocketAddress so that alternative socket connection details
> can be supported without inventing custom syntax for vfio-user-pci. For
> example, file descriptor passing should be possible.
> 
> I think this requires a bit of command-line parsing work, so don't worry
> about it for now, but please add a TODO comment. When the -device
> vfio-user-pci syntax is finalized (i.e. when the code is merged and the
> device name doesn't start with the experimental x- prefix), then it
> needs to be solved.
> 

	What do you want the options to look like at the endgame?  I’d
rather work backward from that than have several different flavors of
options as new socket options are added.  I did look at -chardev socket,
and it was confusing enough that I went for the simple string.



>> +    DEFINE_PROP_BOOL("secure-dma", VFIOUserPCIDevice, secure, false),
> 
> I'm not sure what "secure-dma" means and the "secure" variable name is
> even more inscrutable. Does this mean don't share memory so that each
> DMA access is checked individually?
> 

	Yes.  Do you have another name you’d prefer? “no-shared-mem”?

						JJ



>> +    DEFINE_PROP_END_OF_LIST(),
>> +};
>> +
>> +static void vfio_user_pci_dev_class_init(ObjectClass *klass, void *data)
>> +{
>> +    DeviceClass *dc = DEVICE_CLASS(klass);
>> +    PCIDeviceClass *pdc = PCI_DEVICE_CLASS(klass);
>> +
>> +    device_class_set_props(dc, vfio_user_pci_dev_properties);
>> +    dc->desc = "VFIO over socket PCI device assignment";
>> +    pdc->realize = vfio_user_pci_realize;
>> +}
>> +
>> +static const TypeInfo vfio_user_pci_dev_info = {
>> +    .name = TYPE_VFIO_USER_PCI,
>> +    .parent = TYPE_VFIO_PCI_BASE,
>> +    .instance_size = sizeof(VFIOUserPCIDevice),
>> +    .class_init = vfio_user_pci_dev_class_init,
>> +    .instance_init = vfio_instance_init,
>> +    .instance_finalize = vfio_user_instance_finalize,
>> +};
>> +
>> +static void register_vfio_user_dev_type(void)
>> +{
>> +    type_register_static(&vfio_user_pci_dev_info);
>> +}
>> +
>> +type_init(register_vfio_user_dev_type)
>> -- 
>> 2.25.1
>> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC 03/19] vfio-user: define VFIO Proxy and communication functions
  2021-07-28 18:08     ` John Johnson
@ 2021-07-29  8:06       ` Stefan Hajnoczi
  0 siblings, 0 replies; 55+ messages in thread
From: Stefan Hajnoczi @ 2021-07-29  8:06 UTC (permalink / raw)
  To: John Johnson
  Cc: Elena Ufimtseva, Jag Raman, swapnil.ingle, John Levon,
	QEMU Devel Mailing List, Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 1566 bytes --]

On Wed, Jul 28, 2021 at 06:08:26PM +0000, John Johnson wrote:
> 
> 
> > On Jul 27, 2021, at 9:34 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > 
> > On Sun, Jul 18, 2021 at 11:27:42PM -0700, Elena Ufimtseva wrote:
> >> From: John G Johnson <john.g.johnson@oracle.com>
> >> 
> >> Add user.c and user.h files for vfio-user with the basic
> >> send and receive functions.
> >> 
> >> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> >> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> >> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> >> ---
> >> hw/vfio/user.h                | 120 ++++++++++++++
> >> include/hw/vfio/vfio-common.h |   2 +
> >> hw/vfio/user.c                | 286 ++++++++++++++++++++++++++++++++++
> >> MAINTAINERS                   |   4 +
> >> hw/vfio/meson.build           |   1 +
> >> 5 files changed, 413 insertions(+)
> >> create mode 100644 hw/vfio/user.h
> >> create mode 100644 hw/vfio/user.c
> > 
> > The multi-threading, coroutine, and blocking I/O requirements of
> > vfio_user_recv() and vfio_user_send_reply() are unclear to me. Please
> > document them so it's clear what environment they can be called from. I
> > guess they are not called from coroutines and proxy->ioc is a blocking
> > IOChannel?
> > 
> 
> 	Yes to both, moreover, a block comment above vfio_user_recv() would
> be useful.  The call to setup vfio_user_recv() as the socket handler isn’t
> in this patch, do you want the series re-org’d?

That would help with review, thanks!

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC 04/19] vfio-user: Define type vfio_user_pci_dev_info
  2021-07-29  0:55     ` John Johnson
@ 2021-07-29  8:22       ` Stefan Hajnoczi
  0 siblings, 0 replies; 55+ messages in thread
From: Stefan Hajnoczi @ 2021-07-29  8:22 UTC (permalink / raw)
  To: John Johnson
  Cc: Elena Ufimtseva, Jag Raman, swapnil.ingle, John Levon,
	QEMU Devel Mailing List, Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 4117 bytes --]

On Thu, Jul 29, 2021 at 12:55:08AM +0000, John Johnson wrote:
> 
> 
> > On Jul 28, 2021, at 3:16 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > 
> > On Sun, Jul 18, 2021 at 11:27:43PM -0700, Elena Ufimtseva wrote:
> >> From: John G Johnson <john.g.johnson@oracle.com>
> >> 
> >> New class for vfio-user with its class and instance
> >> constructors and destructors.
> >> 
> >> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> >> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> >> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> >> ---
> >> hw/vfio/pci.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
> >> 1 file changed, 49 insertions(+)
> >> 
> >> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> >> index bea95efc33..554b562769 100644
> >> --- a/hw/vfio/pci.c
> >> +++ b/hw/vfio/pci.c
> >> @@ -42,6 +42,7 @@
> >> #include "qapi/error.h"
> >> #include "migration/blocker.h"
> >> #include "migration/qemu-file.h"
> >> +#include "hw/vfio/user.h"
> >> 
> >> #define TYPE_VFIO_PCI_NOHOTPLUG "vfio-pci-nohotplug"
> >> 
> >> @@ -3326,3 +3327,51 @@ static void register_vfio_pci_dev_type(void)
> >> }
> >> 
> >> type_init(register_vfio_pci_dev_type)
> >> +
> >> +static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
> >> +{
> >> +    ERRP_GUARD();
> >> +    VFIOUserPCIDevice *udev = VFIO_USER_PCI(pdev);
> >> +
> >> +    if (!udev->sock_name) {
> >> +        error_setg(errp, "No socket specified");
> >> +        error_append_hint(errp, "Use -device vfio-user-pci,socket=<name>\n");
> >> +        return;
> >> +    }
> >> +}
> >> +
> >> +static void vfio_user_instance_finalize(Object *obj)
> >> +{
> >> +}
> >> +
> >> +static Property vfio_user_pci_dev_properties[] = {
> >> +    DEFINE_PROP_STRING("socket", VFIOUserPCIDevice, sock_name),
> > 
> > Please use SocketAddress so that alternative socket connection details
> > can be supported without inventing custom syntax for vfio-user-pci. For
> > example, file descriptor passing should be possible.
> > 
> > I think this requires a bit of command-line parsing work, so don't worry
> > about it for now, but please add a TODO comment. When the -device
> > vfio-user-pci syntax is finalized (i.e. when the code is merged and the
> > device name doesn't start with the experimental x- prefix), then it
> > needs to be solved.
> > 
> 
> 	What do you want the options to look like at the endgame?  I’d
> rather work backward from that than have several different flavors of
> options as new socket options are added.  I did look at -chardev socket,
> and it was confusing enough that I went for the simple string.

The standard socket syntax is present in qemu-storage-daemon's --export
and --nbd-server options:

  addr.type=inet,addr.host=<host>,addr.port=<port>
  addr.type=unix,addr.path=<socket-path>
  addr.type=fd,addr.str=<fd>

--export and --nbd-server use QAPI to generate parsers for these options
(they use 'SocketAddress' from qapi/sockets.json). I'm not sure whether
it's easier to reuse the QAPI parser or to simply add qdev properties
mimicking the same syntax. Either way, there should probably be a common
qdev property API for SocketAddress values.

> >> +    DEFINE_PROP_BOOL("secure-dma", VFIOUserPCIDevice, secure, false),
> > 
> > I'm not sure what "secure-dma" means and the "secure" variable name is
> > even more inscrutable. Does this mean don't share memory so that each
> > DMA access is checked individually?
> > 
> 
> 	Yes.  Do you have another name you’d prefer? “no-shared-mem”?

I'm not sure other property names are much clearer, so feel free to
stick with "secure-dma". Renaming the "secure" field to "secure_dma" and
adding a comment that clarifies its purpose would be enough.

Here are some options:
- The vfio-user protocol message for sharing memory is called
  VFIO_USER_DMA_MAP. The option could be dma-map=on|off (default on).
  But this is based on protocol internals and may not be clear to users.
- shared-mem=on|off
- shared-ram=on|off

Thanks,
Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC server 05/11] vfio-user: run vfio-user context
  2021-07-20 14:17     ` Thanos Makatos
@ 2021-08-13 14:51       ` Jag Raman
  2021-08-16 12:52         ` John Levon
  0 siblings, 1 reply; 55+ messages in thread
From: Jag Raman @ 2021-08-13 14:51 UTC (permalink / raw)
  To: Thanos Makatos
  Cc: Elena Ufimtseva, John Johnson, Swapnil Ingle, John Levon,
	qemu-devel, alex.williamson, stefanha



> On Jul 20, 2021, at 10:17 AM, Thanos Makatos <thanos.makatos@nutanix.com> wrote:
> 
>> -----Original Message-----
>> From: Jagannathan Raman <jag.raman@oracle.com>
>> Sent: 19 July 2021 21:00
>> To: qemu-devel@nongnu.org
>> Cc: stefanha@redhat.com; alex.williamson@redhat.com;
>> elena.ufimtseva@oracle.com; John Levon <john.levon@nutanix.com>;
>> john.g.johnson@oracle.com; Thanos Makatos
>> <thanos.makatos@nutanix.com>; Swapnil Ingle
>> <swapnil.ingle@nutanix.com>; jag.raman@oracle.com
>> Subject: [PATCH RFC server 05/11] vfio-user: run vfio-user context
>> 
>> Setup a separate thread to run the vfio-user context. The thread acts as
>> the main loop for the device.
> 
> In your "vfio-user: instantiate vfio-user context" patch you create the vfu context in blocking-mode, so the only way to run device emulation is in a separate thread.
> Were you going to create a separate thread anyway? You can run device emulation in polling mode therefore you can avoid creating a separate thread, thus saving resources. Do plan to do that in the future?

Thanks for the information about the Blocking and Non-Blocking mode.

I’d like to explain why we are using a separate thread presently and
check with you if it’s possible to poll on multiple vfu contexts at the
same time (similar to select/poll for fds).

Concerning my understanding on how devices are executed in QEMU,
QEMU initializes the device instance - where the device registers
callbacks for BAR and config space accesses. The device is then
subsequently driven by these callbacks - whenever the vcpu thread tries
to access the BAR addresses or places a config space access to the PCI
bus, the vcpu exits to QEMU which handles these accesses. As such, the
device is driven by the vcpu thread. Since there are no vcpu threads in the
remote process, we created a separate thread as a replacement. As you
can see already, this thread blocks on vfu_run_ctx() which I believe polls
on the socket for messages from client.

If there is a way to run multiple vfu contexts at the same time, that would
help with conserving threads on the host CPU. For example, if there’s a
way to add vfu contexts to a list of contexts that expect messages from
client, that could be a good idea. Alternatively, this QEMU server could
also implement a similar mechanism to group all non-blocking vfu
contexts to just a single thread, instead of having separate threads for
each context.

--
Jag

> 
>> 
>> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
>> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
>> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
>> ---
>> hw/remote/vfio-user-obj.c | 44
>> ++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 44 insertions(+)
>> 
>> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
>> index e362709..6a2d0f5 100644
>> --- a/hw/remote/vfio-user-obj.c
>> +++ b/hw/remote/vfio-user-obj.c
>> @@ -35,6 +35,7 @@
>> #include "trace.h"
>> #include "sysemu/runstate.h"
>> #include "qemu/notify.h"
>> +#include "qemu/thread.h"
>> #include "qapi/error.h"
>> #include "sysemu/sysemu.h"
>> #include "hw/qdev-core.h"
>> @@ -66,6 +67,8 @@ struct VfuObject {
>>     vfu_ctx_t *vfu_ctx;
>> 
>>     PCIDevice *pci_dev;
>> +
>> +    QemuThread vfu_ctx_thread;
>> };
>> 
>> static void vfu_object_set_socket(Object *obj, const char *str, Error **errp)
>> @@ -90,6 +93,44 @@ static void vfu_object_set_devid(Object *obj, const
>> char *str, Error **errp)
>>     trace_vfu_prop("devid", str);
>> }
>> 
>> +static void *vfu_object_ctx_run(void *opaque)
>> +{
>> +    VfuObject *o = opaque;
>> +    int ret;
>> +
>> +    ret = vfu_realize_ctx(o->vfu_ctx);
>> +    if (ret < 0) {
>> +        error_setg(&error_abort, "vfu: Failed to realize device %s- %s",
>> +                   o->devid, strerror(errno));
>> +        return NULL;
>> +    }
>> +
>> +    ret = vfu_attach_ctx(o->vfu_ctx);
>> +    if (ret < 0) {
>> +        error_setg(&error_abort,
>> +                   "vfu: Failed to attach device %s to context - %s",
>> +                   o->devid, strerror(errno));
>> +        return NULL;
>> +    }
>> +
>> +    do {
>> +        ret = vfu_run_ctx(o->vfu_ctx);
>> +        if (ret < 0) {
>> +            if (errno == EINTR) {
>> +                ret = 0;
>> +            } else if (errno == ENOTCONN) {
>> +                object_unparent(OBJECT(o));
>> +                break;
>> +            } else {
>> +                error_setg(&error_abort, "vfu: Failed to run device %s - %s",
>> +                           o->devid, strerror(errno));
>> +            }
>> +        }
>> +    } while (ret == 0);
>> +
>> +    return NULL;
>> +}
>> +
>> static void vfu_object_machine_done(Notifier *notifier, void *data)
>> {
>>     VfuObject *o = container_of(notifier, VfuObject, machine_done);
>> @@ -125,6 +166,9 @@ static void vfu_object_machine_done(Notifier
>> *notifier, void *data)
>>                    pci_get_word(o->pci_dev->config + PCI_DEVICE_ID),
>>                    pci_get_word(o->pci_dev->config +
>> PCI_SUBSYSTEM_VENDOR_ID),
>>                    pci_get_word(o->pci_dev->config + PCI_SUBSYSTEM_ID));
>> +
>> +    qemu_thread_create(&o->vfu_ctx_thread, "VFU ctx runner",
>> vfu_object_ctx_run,
>> +                       o, QEMU_THREAD_JOINABLE);
>> }
>> 
>> static void vfu_object_init(Object *obj)
>> --
>> 1.8.3.1
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC server 05/11] vfio-user: run vfio-user context
  2021-08-13 14:51       ` Jag Raman
@ 2021-08-16 12:52         ` John Levon
  2021-08-16 14:10           ` Jag Raman
  0 siblings, 1 reply; 55+ messages in thread
From: John Levon @ 2021-08-16 12:52 UTC (permalink / raw)
  To: Jag Raman
  Cc: Elena Ufimtseva, John Johnson, Swapnil Ingle, qemu-devel,
	alex.williamson, stefanha, Thanos Makatos

On Fri, Aug 13, 2021 at 02:51:53PM +0000, Jag Raman wrote:

> Thanks for the information about the Blocking and Non-Blocking mode.
> 
> I’d like to explain why we are using a separate thread presently and
> check with you if it’s possible to poll on multiple vfu contexts at the
> same time (similar to select/poll for fds).
> 
> Concerning my understanding on how devices are executed in QEMU,
> QEMU initializes the device instance - where the device registers
> callbacks for BAR and config space accesses. The device is then
> subsequently driven by these callbacks - whenever the vcpu thread tries
> to access the BAR addresses or places a config space access to the PCI
> bus, the vcpu exits to QEMU which handles these accesses. As such, the
> device is driven by the vcpu thread. Since there are no vcpu threads in the
> remote process, we created a separate thread as a replacement. As you
> can see already, this thread blocks on vfu_run_ctx() which I believe polls
> on the socket for messages from client.
> 
> If there is a way to run multiple vfu contexts at the same time, that would
> help with conserving threads on the host CPU. For example, if there’s a
> way to add vfu contexts to a list of contexts that expect messages from
> client, that could be a good idea. Alternatively, this QEMU server could
> also implement a similar mechanism to group all non-blocking vfu
> contexts to just a single thread, instead of having separate threads for
> each context.

You can use vfu_get_poll_fd() to retrieve the underlying socket fd (simplest
would be to do this after vfu_attach_ctx(), but that might depend), then poll on
the fd set, doing vfu_run_ctx() when the fd is ready. An async hangup on the
socket would show up as ENOTCONN, in which case you'd remove the fd from the
set.

Note that we're not completely async yet (e.g. the actual socket read/writes are
synchronous). In practice that's not typically an issue but it could be if you
wanted to support multiple VMs from a single server, etc.


regards
john

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC server 05/11] vfio-user: run vfio-user context
  2021-08-16 12:52         ` John Levon
@ 2021-08-16 14:10           ` Jag Raman
  0 siblings, 0 replies; 55+ messages in thread
From: Jag Raman @ 2021-08-16 14:10 UTC (permalink / raw)
  To: John Levon
  Cc: Elena Ufimtseva, John Johnson, Swapnil Ingle, qemu-devel,
	alex.williamson, stefanha, Thanos Makatos



> On Aug 16, 2021, at 8:52 AM, John Levon <john.levon@nutanix.com> wrote:
> 
> On Fri, Aug 13, 2021 at 02:51:53PM +0000, Jag Raman wrote:
> 
>> Thanks for the information about the Blocking and Non-Blocking mode.
>> 
>> I’d like to explain why we are using a separate thread presently and
>> check with you if it’s possible to poll on multiple vfu contexts at the
>> same time (similar to select/poll for fds).
>> 
>> Concerning my understanding on how devices are executed in QEMU,
>> QEMU initializes the device instance - where the device registers
>> callbacks for BAR and config space accesses. The device is then
>> subsequently driven by these callbacks - whenever the vcpu thread tries
>> to access the BAR addresses or places a config space access to the PCI
>> bus, the vcpu exits to QEMU which handles these accesses. As such, the
>> device is driven by the vcpu thread. Since there are no vcpu threads in the
>> remote process, we created a separate thread as a replacement. As you
>> can see already, this thread blocks on vfu_run_ctx() which I believe polls
>> on the socket for messages from client.
>> 
>> If there is a way to run multiple vfu contexts at the same time, that would
>> help with conserving threads on the host CPU. For example, if there’s a
>> way to add vfu contexts to a list of contexts that expect messages from
>> client, that could be a good idea. Alternatively, this QEMU server could
>> also implement a similar mechanism to group all non-blocking vfu
>> contexts to just a single thread, instead of having separate threads for
>> each context.
> 
> You can use vfu_get_poll_fd() to retrieve the underlying socket fd (simplest
> would be to do this after vfu_attach_ctx(), but that might depend), then poll on
> the fd set, doing vfu_run_ctx() when the fd is ready. An async hangup on the
> socket would show up as ENOTCONN, in which case you'd remove the fd from the
> set.

OK sounds good, will check this model out. Thank you!

--
Jag

> 
> Note that we're not completely async yet (e.g. the actual socket read/writes are
> synchronous). In practice that's not typically an issue but it could be if you
> wanted to support multiple VMs from a single server, etc.
> 
> 
> regards
> john


^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2021-08-16 14:11 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-19  6:27 [PATCH RFC 00/19] vfio-user implementation Elena Ufimtseva
2021-07-19  6:27 ` [PATCH RFC 01/19] vfio-user: introduce vfio-user protocol specification Elena Ufimtseva
2021-07-19  6:27 ` [PATCH RFC 02/19] vfio-user: add VFIO base abstract class Elena Ufimtseva
2021-07-19  6:27 ` [PATCH RFC 03/19] vfio-user: define VFIO Proxy and communication functions Elena Ufimtseva
2021-07-27 16:34   ` Stefan Hajnoczi
2021-07-28 18:08     ` John Johnson
2021-07-29  8:06       ` Stefan Hajnoczi
2021-07-19  6:27 ` [PATCH RFC 04/19] vfio-user: Define type vfio_user_pci_dev_info Elena Ufimtseva
2021-07-28 10:16   ` Stefan Hajnoczi
2021-07-29  0:55     ` John Johnson
2021-07-29  8:22       ` Stefan Hajnoczi
2021-07-19  6:27 ` [PATCH RFC 05/19] vfio-user: connect vfio proxy to remote server Elena Ufimtseva
2021-07-19  6:27 ` [PATCH RFC 06/19] vfio-user: negotiate protocol with " Elena Ufimtseva
2021-07-19  6:27 ` [PATCH RFC 07/19] vfio-user: define vfio-user pci ops Elena Ufimtseva
2021-07-19  6:27 ` [PATCH RFC 08/19] vfio-user: VFIO container setup & teardown Elena Ufimtseva
2021-07-19  6:27 ` [PATCH RFC 09/19] vfio-user: get device info and get irq info Elena Ufimtseva
2021-07-19  6:27 ` [PATCH RFC 10/19] vfio-user: device region read/write Elena Ufimtseva
2021-07-19  6:27 ` [PATCH RFC 11/19] vfio-user: get region and DMA map/unmap operations Elena Ufimtseva
2021-07-19  6:27 ` [PATCH RFC 12/19] vfio-user: probe remote device's BARs Elena Ufimtseva
2021-07-19 22:59   ` Alex Williamson
2021-07-20  1:39     ` John Johnson
2021-07-20  3:01       ` Alex Williamson
2021-07-19  6:27 ` [PATCH RFC 13/19] vfio-user: respond to remote DMA read/write requests Elena Ufimtseva
2021-07-19  6:27 ` [PATCH RFC 14/19] vfio_user: setup MSI/X interrupts and PCI config operations Elena Ufimtseva
2021-07-19  6:27 ` [PATCH RFC 15/19] vfio-user: vfio user device realize Elena Ufimtseva
2021-07-19  6:27 ` [PATCH RFC 16/19] vfio-user: pci reset Elena Ufimtseva
2021-07-19  6:27 ` [PATCH RFC 17/19] vfio-user: probe remote device ROM BAR Elena Ufimtseva
2021-07-19  6:27 ` [PATCH RFC 18/19] vfio-user: migration support Elena Ufimtseva
2021-07-19  6:27 ` [PATCH RFC 19/19] vfio-user: add migration cli options and version negotiation Elena Ufimtseva
2021-07-19 20:00 ` [PATCH RFC server 00/11] vfio-user server in QEMU Jagannathan Raman
2021-07-19 20:00   ` [PATCH RFC server 01/11] vfio-user: build library Jagannathan Raman
2021-07-19 20:24     ` John Levon
2021-07-20 12:06       ` Jag Raman
2021-07-20 12:20         ` Marc-André Lureau
2021-07-20 13:09           ` John Levon
2021-07-19 20:00   ` [PATCH RFC server 02/11] vfio-user: define vfio-user object Jagannathan Raman
2021-07-19 20:00   ` [PATCH RFC server 03/11] vfio-user: instantiate vfio-user context Jagannathan Raman
2021-07-19 20:00   ` [PATCH RFC server 04/11] vfio-user: find and init PCI device Jagannathan Raman
2021-07-26 15:05     ` John Levon
2021-07-28 17:08       ` Jag Raman
2021-07-19 20:00   ` [PATCH RFC server 05/11] vfio-user: run vfio-user context Jagannathan Raman
2021-07-20 14:17     ` Thanos Makatos
2021-08-13 14:51       ` Jag Raman
2021-08-16 12:52         ` John Levon
2021-08-16 14:10           ` Jag Raman
2021-07-19 20:00   ` [PATCH RFC server 06/11] vfio-user: handle PCI config space accesses Jagannathan Raman
2021-07-26 15:10     ` John Levon
2021-07-19 20:00   ` [PATCH RFC server 07/11] vfio-user: handle DMA mappings Jagannathan Raman
2021-07-20 14:38     ` Thanos Makatos
2021-07-19 20:00   ` [PATCH RFC server 08/11] vfio-user: handle PCI BAR accesses Jagannathan Raman
2021-07-19 20:00   ` [PATCH RFC server 09/11] vfio-user: handle device interrupts Jagannathan Raman
2021-07-19 20:00   ` [PATCH RFC server 10/11] vfio-user: register handlers to facilitate migration Jagannathan Raman
2021-07-20 14:05     ` Thanos Makatos
2021-07-19 20:00   ` [PATCH RFC server 11/11] vfio-user: acceptance test Jagannathan Raman
2021-07-20 16:12     ` Thanos Makatos

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).