All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC v2 00/16] vfio-user implementation
@ 2021-08-16 16:42 Elena Ufimtseva
  2021-08-16 16:42 ` [PATCH RFC v2 01/16] vfio-user: introduce vfio-user protocol specification Elena Ufimtseva
                   ` (16 more replies)
  0 siblings, 17 replies; 108+ messages in thread
From: Elena Ufimtseva @ 2021-08-16 16:42 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha, thanos.makatos

Hi

This is v2 of the RFC patches for vfio-user multi-process QEMU project[1].

Thank you for the review of v1 of the RFC patches.

vfio-user is a protocol that allows a device to be emulated in a separate
process outside of QEMU. It encapsulates the messages sent from QEMU to the
kernel VFIO driver, and sends them to a remote process over a UNIX socket.

The vfio-user framework consists of 3 parts:
 1) The protocol specification.
 2) A server - the VFIO generic device in QEMU that exchanges the protocol messages with the client.
 3) A client - remote process that emulates a device.

This patchset implements parts 1 and 2.
The protocol's specification can be found here [2]:
We also include this as the first patch of the series.

The libvfio-user project (https://github.com/nutanix/libvfio-user)
can be used by a remote process to handle the protocol to implement the
third part.
We also worked on implementing a client and will be sending this patch
series shortly.

Contributors:

John G Johnson <john.g.johnson@oracle.com>
John Levon <john.levon@nutanix.com>
Thanos Makatos <thanos.makatos@nutanix.com>
Elena Ufimtseva <elena.ufimtseva@oracle.com>
Jagannathan Raman <jag.raman@oracle.com>

 Changes in v2:
 - combine some patches with relevant functionality.
 - use SocketAddress with idea to modify later the command line options.
 - define protocol bits in user-protocol.h.
 - use QEMU_LOCK_GUARD where appropriate.
 - fix the locking when event signaling.
 - do not drop BQL on dma map/unmap.
 - added checks for message sizes in communication functions.

John Johnson (15):
  vfio-user: add VFIO base abstract class
  vfio-user: Define type vfio_user_pci_dev_info
  vfio-user: connect vfio proxy to remote server
  vfio-user: define VFIO Proxy and communication functions
  vfio-user: negotiate version with remote server
  vfio-user: get device info
  vfio-user: get region info
  vfio-user: region read/write
  vfio-user: pci_user_realize PCI setup
  vfio-user: get and set IRQs
  vfio-user: proxy container connect/disconnect
  vfio-user: dma map/unmap operations
  vfio-user: dma read/write operations
  vfio-user: pci reset
  vfio-user: migration support

Thanos Makatos (1):
  vfio-user: introduce vfio-user protocol specification

 docs/devel/index.rst          |    1 +
 docs/devel/vfio-user.rst      | 1809 +++++++++++++++++++++++++++++++++
 hw/vfio/pci.h                 |   25 +-
 hw/vfio/user-protocol.h       |  210 ++++
 hw/vfio/user.h                |   95 ++
 include/hw/vfio/vfio-common.h |    9 +
 hw/vfio/common.c              |  296 +++++-
 hw/vfio/migration.c           |   34 +-
 hw/vfio/pci.c                 |  571 +++++++++--
 hw/vfio/user.c                | 1104 ++++++++++++++++++++
 MAINTAINERS                   |   11 +
 hw/vfio/meson.build           |    1 +
 12 files changed, 4062 insertions(+), 104 deletions(-)
 create mode 100644 docs/devel/vfio-user.rst
 create mode 100644 hw/vfio/user-protocol.h
 create mode 100644 hw/vfio/user.h
 create mode 100644 hw/vfio/user.c

-- 
2.25.1



^ permalink raw reply	[flat|nested] 108+ messages in thread

* [PATCH RFC v2 01/16] vfio-user: introduce vfio-user protocol specification
  2021-08-16 16:42 [PATCH RFC v2 00/16] vfio-user implementation Elena Ufimtseva
@ 2021-08-16 16:42 ` Elena Ufimtseva
  2021-08-17 23:04   ` Alex Williamson
  2021-08-16 16:42 ` [PATCH RFC v2 02/16] vfio-user: add VFIO base abstract class Elena Ufimtseva
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 108+ messages in thread
From: Elena Ufimtseva @ 2021-08-16 16:42 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha, thanos.makatos

From: Thanos Makatos <thanos.makatos@nutanix.com>

This patch introduces the vfio-user protocol specification (formerly
known as VFIO-over-socket), which is designed to allow devices to be
emulated outside QEMU, in a separate process. vfio-user reuses the
existing VFIO defines, structs and concepts.

This patch is sourced from:
https://patchwork.kernel.org/project/qemu-devel/patch/20210614104608.212276-1-thanos.makatos@nutanix.com/

It has been earlier discussed as an RFC in:
"RFC: use VFIO over a UNIX domain socket to implement device offloading"

Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Thanos Makatos <thanos.makatos@nutanix.com>
Signed-off-by: John Levon <john.levon@nutanix.com>
---
 docs/devel/index.rst     |    1 +
 docs/devel/vfio-user.rst | 1809 ++++++++++++++++++++++++++++++++++++++
 MAINTAINERS              |    6 +
 3 files changed, 1816 insertions(+)
 create mode 100644 docs/devel/vfio-user.rst

diff --git a/docs/devel/index.rst b/docs/devel/index.rst
index 5522db7241..304ca1c12f 100644
--- a/docs/devel/index.rst
+++ b/docs/devel/index.rst
@@ -44,3 +44,4 @@ modifying QEMU's source code.
    vfio-migration
    qapi-code-gen
    writing-qmp-commands
+   vfio-user
diff --git a/docs/devel/vfio-user.rst b/docs/devel/vfio-user.rst
new file mode 100644
index 0000000000..0b2acec101
--- /dev/null
+++ b/docs/devel/vfio-user.rst
@@ -0,0 +1,1809 @@
+.. include:: <isonum.txt>
+********************************
+vfio-user Protocol Specification
+********************************
+
+--------------
+Version_ 0.9.1
+--------------
+
+.. contents:: Table of Contents
+
+Introduction
+============
+vfio-user is a protocol that allows a device to be emulated in a separate
+process outside of a Virtual Machine Monitor (VMM). vfio-user devices consist
+of a generic VFIO device type, living inside the VMM, which we call the client,
+and the core device implementation, living outside the VMM, which we call the
+server.
+
+The vfio-user specification is partly based on the
+`Linux VFIO ioctl interface <https://www.kernel.org/doc/html/latest/driver-api/vfio.html>`_.
+
+VFIO is a mature and stable API, backed by an extensively used framework. The
+existing VFIO client implementation in QEMU (``qemu/hw/vfio/``) can be largely
+re-used, though there is nothing in this specification that requires that
+particular implementation. None of the VFIO kernel modules are required for
+supporting the protocol, on either the client or server side. Some source
+definitions in VFIO are re-used for vfio-user.
+
+The main idea is to allow a virtual device to function in a separate process in
+the same host over a UNIX domain socket. A UNIX domain socket (``AF_UNIX``) is
+chosen because file descriptors can be trivially sent over it, which in turn
+allows:
+
+* Sharing of client memory for DMA with the server.
+* Sharing of server memory with the client for fast MMIO.
+* Efficient sharing of eventfd's for triggering interrupts.
+
+Other socket types could be used which allow the server to run in a separate
+guest in the same host (``AF_VSOCK``) or remotely (``AF_INET``). Theoretically
+the underlying transport does not necessarily have to be a socket, however we do
+not examine such alternatives. In this protocol version we focus on using a UNIX
+domain socket and introduce basic support for the other two types of sockets
+without considering performance implications.
+
+While passing of file descriptors is desirable for performance reasons, support
+is not necessary for either the client or the server in order to implement the
+protocol. There is always an in-band, message-passing fall back mechanism.
+
+Overview
+========
+
+VFIO is a framework that allows a physical device to be securely passed through
+to a user space process; the device-specific kernel driver does not drive the
+device at all.  Typically, the user space process is a VMM and the device is
+passed through to it in order to achieve high performance. VFIO provides an API
+and the required functionality in the kernel. QEMU has adopted VFIO to allow a
+guest to directly access physical devices, instead of emulating them in
+software.
+
+vfio-user reuses the core VFIO concepts defined in its API, but implements them
+as messages to be sent over a socket. It does not change the kernel-based VFIO
+in any way, in fact none of the VFIO kernel modules need to be loaded to use
+vfio-user. It is also possible for the client to concurrently use the current
+kernel-based VFIO for one device, and vfio-user for another device.
+
+VFIO Device Model
+-----------------
+
+A device under VFIO presents a standard interface to the user process. Many of
+the VFIO operations in the existing interface use the ``ioctl()`` system call, and
+references to the existing interface are called the ``ioctl()`` implementation in
+this document.
+
+The following sections describe the set of messages that implement the vfio-user
+interface over a socket. In many cases, the messages are analogous to data
+structures used in the ``ioctl()`` implementation. Messages derived from the
+``ioctl()`` will have a name derived from the ``ioctl()`` command name.  E.g., the
+``VFIO_DEVICE_GET_INFO`` ``ioctl()`` command becomes a
+``VFIO_USER_DEVICE_GET_INFO`` message.  The purpose of this reuse is to share as
+much code as feasible with the ``ioctl()`` implementation``.
+
+Connection Initiation
+^^^^^^^^^^^^^^^^^^^^^
+
+After the client connects to the server, the initial client message is
+``VFIO_USER_VERSION`` to propose a protocol version and set of capabilities to
+apply to the session. The server replies with a compatible version and set of
+capabilities it supports, or closes the connection if it cannot support the
+advertised version.
+
+Device Information
+^^^^^^^^^^^^^^^^^^
+
+The client uses a ``VFIO_USER_DEVICE_GET_INFO`` message to query the server for
+information about the device. This information includes:
+
+* The device type and whether it supports reset (``VFIO_DEVICE_FLAGS_``),
+* the number of device regions, and
+* the device presents to the client the number of interrupt types the device
+  supports.
+
+Region Information
+^^^^^^^^^^^^^^^^^^
+
+The client uses ``VFIO_USER_DEVICE_GET_REGION_INFO`` messages to query the
+server for information about the device's regions. This information describes:
+
+* Read and write permissions, whether it can be memory mapped, and whether it
+  supports additional capabilities (``VFIO_REGION_INFO_CAP_``).
+* Region index, size, and offset.
+
+When a device region can be mapped by the client, the server provides a file
+descriptor which the client can ``mmap()``. The server is responsible for
+polling for client updates to memory mapped regions.
+
+Region Capabilities
+"""""""""""""""""""
+
+Some regions have additional capabilities that cannot be described adequately
+by the region info data structure. These capabilities are returned in the
+region info reply in a list similar to PCI capabilities in a PCI device's
+configuration space.
+
+Sparse Regions
+""""""""""""""
+A region can be memory-mappable in whole or in part. When only a subset of a
+region can be mapped by the client, a ``VFIO_REGION_INFO_CAP_SPARSE_MMAP``
+capability is included in the region info reply. This capability describes
+which portions can be mapped by the client.
+
+.. Note::
+   For example, in a virtual NVMe controller, sparse regions can be used so
+   that accesses to the NVMe registers (found in the beginning of BAR0) are
+   trapped (an infrequent event), while allowing direct access to the doorbells
+   (an extremely frequent event as every I/O submission requires a write to
+   BAR0), found in the next page after the NVMe registers in BAR0.
+
+Device-Specific Regions
+"""""""""""""""""""""""
+
+A device can define regions additional to the standard ones (e.g. PCI indexes
+0-8). This is achieved by including a ``VFIO_REGION_INFO_CAP_TYPE`` capability
+in the region info reply of a device-specific region. Such regions are reflected
+in ``struct vfio_user_device_info.num_regions``. Thus, for PCI devices this
+value can be equal to, or higher than, ``VFIO_PCI_NUM_REGIONS``.
+
+Region I/O via file descriptors
+-------------------------------
+
+For unmapped regions, region I/O from the client is done via
+``VFIO_USER_REGION_READ/WRITE``.  As an optimization, ioeventfds or ioregionfds
+may be configured for sub-regions of some regions. A client may request
+information on these sub-regions via ``VFIO_USER_DEVICE_GET_REGION_IO_FDS``; by
+configuring the returned file descriptors as ioeventfds or ioregionfds, the
+server can be directly notified of I/O (for example, by KVM) without taking a
+trip through the client.
+
+Interrupts
+^^^^^^^^^^
+
+The client uses ``VFIO_USER_DEVICE_GET_IRQ_INFO`` messages to query the server
+for the device's interrupt types. The interrupt types are specific to the bus
+the device is attached to, and the client is expected to know the capabilities
+of each interrupt type. The server can signal an interrupt by directly injecting
+interrupts into the guest via an event file descriptor. The client configures
+how the server signals an interrupt with ``VFIO_USER_SET_IRQS`` messages.
+
+Device Read and Write
+^^^^^^^^^^^^^^^^^^^^^
+
+When the guest executes load or store operations to an unmapped device region,
+the client forwards these operations to the server with
+``VFIO_USER_REGION_READ`` or ``VFIO_USER_REGION_WRITE`` messages. The server
+will reply with data from the device on read operations or an acknowledgement on
+write operations. See `Read and Write Operations`_.
+
+Client memory access
+--------------------
+
+The client uses ``VFIO_USER_DMA_MAP`` and ``VFIO_USER_DMA_UNMAP`` messages to
+inform the server of the valid DMA ranges that the server can access on behalf
+of a device (typically, VM guest memory). DMA memory may be accessed by the
+server via ``VFIO_USER_DMA_READ`` and ``VFIO_USER_DMA_WRITE`` messages over the
+socket. In this case, the "DMA" part of the naming is a misnomer.
+
+Actual direct memory access of client memory from the server is possible if the
+client provides file descriptors the server can ``mmap()``. Note that ``mmap()``
+privileges cannot be revoked by the client, therefore file descriptors should
+only be exported in environments where the client trusts the server not to
+corrupt guest memory.
+
+See `Read and Write Operations`_.
+
+Client/server interactions
+==========================
+
+Socket
+------
+
+A server can serve:
+
+1) one or more clients, and/or
+2) one or more virtual devices, belonging to one or more clients.
+
+The current protocol specification requires a dedicated socket per
+client/server connection. It is a server-side implementation detail whether a
+single server handles multiple virtual devices from the same or multiple
+clients. The location of the socket is implementation-specific. Multiplexing
+clients, devices, and servers over the same socket is not supported in this
+version of the protocol.
+
+Authentication
+--------------
+
+For ``AF_UNIX``, we rely on OS mandatory access controls on the socket files,
+therefore it is up to the management layer to set up the socket as required.
+Socket types than span guests or hosts will require a proper authentication
+mechanism. Defining that mechanism is deferred to a future version of the
+protocol.
+
+Command Concurrency
+-------------------
+
+A client may pipeline multiple commands without waiting for previous command
+replies.  The server will process commands in the order they are received.  A
+consequence of this is if a client issues a command with the *No_reply* bit,
+then subsequently issues a command without *No_reply*, the older command will
+have been processed before the reply to the younger command is sent by the
+server.  The client must be aware of the device's capability to process
+concurrent commands if pipelining is used.  For example, pipelining allows
+multiple client threads to concurrently access device regions; the client must
+ensure these accesses obey device semantics.
+
+An example is a frame buffer device, where the device may allow concurrent
+access to different areas of video memory, but may have indeterminate behavior
+if concurrent accesses are performed to command or status registers.
+
+Note that unrelated messages sent from the server to the client can appear in
+between a client to server request/reply and vice versa.
+
+Implementers should be prepared for certain commands to exhibit potentially
+unbounded latencies.  For example, ``VFIO_USER_DEVICE_RESET`` may take an
+arbitrarily long time to complete; clients should take care not to block
+unnecessarily.
+
+Socket Disconnection Behavior
+-----------------------------
+The server and the client can disconnect from each other, either intentionally
+or unexpectedly. Both the client and the server need to know how to handle such
+events.
+
+Server Disconnection
+^^^^^^^^^^^^^^^^^^^^
+A server disconnecting from the client may indicate that:
+
+1) A virtual device has been restarted, either intentionally (e.g. because of a
+   device update) or unintentionally (e.g. because of a crash).
+2) A virtual device has been shut down with no intention to be restarted.
+
+It is impossible for the client to know whether or not a failure is
+intermittent or innocuous and should be retried, therefore the client should
+reset the VFIO device when it detects the socket has been disconnected.
+Error recovery will be driven by the guest's device error handling
+behavior.
+
+Client Disconnection
+^^^^^^^^^^^^^^^^^^^^
+The client disconnecting from the server primarily means that the client
+has exited. Currently, this means that the guest is shut down so the device is
+no longer needed therefore the server can automatically exit. However, there
+can be cases where a client disconnection should not result in a server exit:
+
+1) A single server serving multiple clients.
+2) A multi-process QEMU upgrading itself step by step, which is not yet
+   implemented.
+
+Therefore in order for the protocol to be forward compatible, the server should
+respond to a client disconnection as follows:
+
+ - all client memory regions are unmapped and cleaned up (including closing any
+   passed file descriptors)
+ - all IRQ file descriptors passed from the old client are closed
+ - the device state should otherwise be retained
+
+The expectation is that when a client reconnects, it will re-establish IRQ and
+client memory mappings.
+
+If anything happens to the client (such as qemu really did exit), the control
+stack will know about it and can clean up resources accordingly.
+
+Security Considerations
+-----------------------
+
+Speaking generally, vfio-user clients should not trust servers, and vice versa.
+Standard tools and mechanisms should be used on both sides to validate input and
+prevent against denial of service scenarios, buffer overflow, etc.
+
+Request Retry and Response Timeout
+----------------------------------
+A failed command is a command that has been successfully sent and has been
+responded to with an error code. Failure to send the command in the first place
+(e.g. because the socket is disconnected) is a different type of error examined
+earlier in the disconnect section.
+
+.. Note::
+   QEMU's VFIO retries certain operations if they fail. While this makes sense
+   for real HW, we don't know for sure whether it makes sense for virtual
+   devices.
+
+Defining a retry and timeout scheme is deferred to a future version of the
+protocol.
+
+Message sizes
+-------------
+
+Some requests have an ``argsz`` field. In a request, it defines the maximum
+expected reply payload size, which should be at least the size of the fixed
+reply payload headers defined here. The *request* payload size is defined by the
+usual ``msg_size`` field in the header, not the ``argsz`` field.
+
+In a reply, the server sets ``argsz`` field to the size needed for a full
+payload size. This may be less than the requested maximum size. This may be
+larger than the requested maximum size: in that case, the full payload is not
+included in the reply, but the ``argsz`` field in the reply indicates the needed
+size, allowing a client to allocate a larger buffer for holding the reply before
+trying again.
+
+In addition, during negotiation (see  `Version`_), the client and server may
+each specify a ``max_data_xfer_size`` value; this defines the maximum data that
+may be read or written via one of the ``VFIO_USER_DMA/REGION_READ/WRITE``
+messages; see `Read and Write Operations`_.
+
+Protocol Specification
+======================
+
+To distinguish from the base VFIO symbols, all vfio-user symbols are prefixed
+with ``vfio_user`` or ``VFIO_USER``. In this revision, all data is in the
+little-endian format, although this may be relaxed in future revisions in cases
+where the client and server are both big-endian.
+
+Unless otherwise specified, all sizes should be presumed to be in bytes.
+
+.. _Commands:
+
+Commands
+--------
+The following table lists the VFIO message command IDs, and whether the
+message command is sent from the client or the server.
+
+======================================  =========  =================
+Name                                    Command    Request Direction
+======================================  =========  =================
+``VFIO_USER_VERSION``                   1          client -> server
+``VFIO_USER_DMA_MAP``                   2          client -> server
+``VFIO_USER_DMA_UNMAP``                 3          client -> server
+``VFIO_USER_DEVICE_GET_INFO``           4          client -> server
+``VFIO_USER_DEVICE_GET_REGION_INFO``    5          client -> server
+``VFIO_USER_DEVICE_GET_REGION_IO_FDS``  6          client -> server
+``VFIO_USER_DEVICE_GET_IRQ_INFO``       7          client -> server
+``VFIO_USER_DEVICE_SET_IRQS``           8          client -> server
+``VFIO_USER_REGION_READ``               9          client -> server
+``VFIO_USER_REGION_WRITE``              10         client -> server
+``VFIO_USER_DMA_READ``                  11         server -> client
+``VFIO_USER_DMA_WRITE``                 12         server -> client
+``VFIO_USER_DEVICE_RESET``              13         client -> server
+``VFIO_USER_DIRTY_PAGES``               14         client -> server
+======================================  =========  =================
+
+Header
+------
+
+All messages, both command messages and reply messages, are preceded by a
+16-byte header that contains basic information about the message. The header is
+followed by message-specific data described in the sections below.
+
++----------------+--------+-------------+
+| Name           | Offset | Size        |
++================+========+=============+
+| Message ID     | 0      | 2           |
++----------------+--------+-------------+
+| Command        | 2      | 2           |
++----------------+--------+-------------+
+| Message size   | 4      | 4           |
++----------------+--------+-------------+
+| Flags          | 8      | 4           |
++----------------+--------+-------------+
+|                | +-----+------------+ |
+|                | | Bit | Definition | |
+|                | +=====+============+ |
+|                | | 0-3 | Type       | |
+|                | +-----+------------+ |
+|                | | 4   | No_reply   | |
+|                | +-----+------------+ |
+|                | | 5   | Error      | |
+|                | +-----+------------+ |
++----------------+--------+-------------+
+| Error          | 12     | 4           |
++----------------+--------+-------------+
+| <message data> | 16     | variable    |
++----------------+--------+-------------+
+
+* *Message ID* identifies the message, and is echoed in the command's reply
+  message. Message IDs belong entirely to the sender, can be re-used (even
+  concurrently) and the receiver must not make any assumptions about their
+  uniqueness.
+* *Command* specifies the command to be executed, listed in Commands_. It is
+  also set in the reply header.
+* *Message size* contains the size of the entire message, including the header.
+* *Flags* contains attributes of the message:
+
+  * The *Type* bits indicate the message type.
+
+    *  *Command* (value 0x0) indicates a command message.
+    *  *Reply* (value 0x1) indicates a reply message acknowledging a previous
+       command with the same message ID.
+  * *No_reply* in a command message indicates that no reply is needed for this
+    command.  This is commonly used when multiple commands are sent, and only
+    the last needs acknowledgement.
+  * *Error* in a reply message indicates the command being acknowledged had
+    an error. In this case, the *Error* field will be valid.
+
+* *Error* in a reply message is an optional UNIX errno value. It may be zero
+  even if the Error bit is set in Flags. It is reserved in a command message.
+
+Each command message in Commands_ must be replied to with a reply message,
+unless the message sets the *No_Reply* bit.  The reply consists of the header
+with the *Reply* bit set, plus any additional data.
+
+If an error occurs, the reply message must only include the reply header.
+
+As the header is standard in both requests and replies, it is not included in
+the command-specific specifications below; each message definition should be
+appended to the standard header, and the offsets are given from the end of the
+standard header.
+
+``VFIO_USER_VERSION``
+---------------------
+
+.. _Version:
+
+This is the initial message sent by the client after the socket connection is
+established; the same format is used for the server's reply.
+
+Upon establishing a connection, the client must send a ``VFIO_USER_VERSION``
+message proposing a protocol version and a set of capabilities. The server
+compares these with the versions and capabilities it supports and sends a
+``VFIO_USER_VERSION`` reply according to the following rules.
+
+* The major version in the reply must be the same as proposed. If the client
+  does not support the proposed major, it closes the connection.
+* The minor version in the reply must be equal to or less than the minor
+  version proposed.
+* The capability list must be a subset of those proposed. If the server
+  requires a capability the client did not include, it closes the connection.
+
+The protocol major version will only change when incompatible protocol changes
+are made, such as changing the message format. The minor version may change
+when compatible changes are made, such as adding new messages or capabilities,
+Both the client and server must support all minor versions less than the
+maximum minor version it supports. E.g., an implementation that supports
+version 1.3 must also support 1.0 through 1.2.
+
+When making a change to this specification, the protocol version number must
+be included in the form "added in version X.Y"
+
+Request
+^^^^^^^
+
+==============  ======  ====
+Name            Offset  Size
+==============  ======  ====
+version major   0       2
+version minor   2       2
+version data    4       variable (including terminating NUL). Optional.
+==============  ======  ====
+
+The version data is an optional UTF-8 encoded JSON byte array with the following
+format:
+
++--------------+--------+-----------------------------------+
+| Name         | Type   | Description                       |
++==============+========+===================================+
+| capabilities | object | Contains common capabilities that |
+|              |        | the sender supports. Optional.    |
++--------------+--------+-----------------------------------+
+
+Capabilities:
+
++--------------------+--------+------------------------------------------------+
+| Name               | Type   | Description                                    |
++====================+========+================================================+
+| max_msg_fds        | number | Maximum number of file descriptors that can be |
+|                    |        | received by the sender in one message.         |
+|                    |        | Optional. If not specified then the receiver   |
+|                    |        | must assume a value of ``1``.                  |
++--------------------+--------+------------------------------------------------+
+| max_data_xfer_size | number | Maximum ``count`` for data transfer messages;  |
+|                    |        | see `Read and Write Operations`_. Optional,    |
+|                    |        | with a default value of 1048576 bytes.         |
++--------------------+--------+------------------------------------------------+
+| migration          | object | Migration capability parameters. If missing    |
+|                    |        | then migration is not supported by the sender. |
++--------------------+--------+------------------------------------------------+
+
+The migration capability contains the following name/value pairs:
+
++--------+--------+-----------------------------------------------+
+| Name   | Type   | Description                                   |
++========+========+===============================================+
+| pgsize | number | Page size of dirty pages bitmap. The smallest |
+|        |        | between the client and the server is used.    |
++--------+--------+-----------------------------------------------+
+
+Reply
+^^^^^
+
+The same message format is used in the server's reply with the semantics
+described above.
+
+``VFIO_USER_DMA_MAP``
+---------------------
+
+This command message is sent by the client to the server to inform it of the
+memory regions the server can access. It must be sent before the server can
+perform any DMA to the client. It is normally sent directly after the version
+handshake is completed, but may also occur when memory is added to the client,
+or if the client uses a vIOMMU.
+
+Request
+^^^^^^^
+
+The request payload for this message is a structure of the following format:
+
++-------------+--------+-------------+
+| Name        | Offset | Size        |
++=============+========+=============+
+| argsz       | 0      | 4           |
++-------------+--------+-------------+
+| flags       | 4      | 4           |
++-------------+--------+-------------+
+|             | +-----+------------+ |
+|             | | Bit | Definition | |
+|             | +=====+============+ |
+|             | | 0   | readable   | |
+|             | +-----+------------+ |
+|             | | 1   | writeable  | |
+|             | +-----+------------+ |
++-------------+--------+-------------+
+| offset      | 8      | 8           |
++-------------+--------+-------------+
+| address     | 16     | 8           |
++-------------+--------+-------------+
+| size        | 24     | 8           |
++-------------+--------+-------------+
+
+* *argsz* is the size of the above structure. Note there is no reply payload,
+  so this field differs from other message types.
+* *flags* contains the following region attributes:
+
+  * *readable* indicates that the region can be read from.
+
+  * *writeable* indicates that the region can be written to.
+
+* *offset* is the file offset of the region with respect to the associated file
+  descriptor, or zero if the region is not mappable
+* *address* is the base DMA address of the region.
+* *size* is the size of the region.
+
+This structure is 32 bytes in size, so the message size is 16 + 32 bytes.
+
+If the DMA region being added can be directly mapped by the server, a file
+descriptor must be sent as part of the message meta-data. The region can be
+mapped via the mmap() system call. On ``AF_UNIX`` sockets, the file descriptor
+must be passed as ``SCM_RIGHTS`` type ancillary data.  Otherwise, if the DMA
+region cannot be directly mapped by the server, no file descriptor must be sent
+as part of the message meta-data and the DMA region can be accessed by the
+server using ``VFIO_USER_DMA_READ`` and ``VFIO_USER_DMA_WRITE`` messages,
+explained in `Read and Write Operations`_. A command to map over an existing
+region must be failed by the server with ``EEXIST`` set in error field in the
+reply.
+
+Reply
+^^^^^
+
+There is no payload in the reply message.
+
+``VFIO_USER_DMA_UNMAP``
+-----------------------
+
+This command message is sent by the client to the server to inform it that a
+DMA region, previously made available via a ``VFIO_USER_DMA_MAP`` command
+message, is no longer available for DMA. It typically occurs when memory is
+subtracted from the client or if the client uses a vIOMMU. The DMA region is
+described by the following structure:
+
+Request
+^^^^^^^
+
+The request payload for this message is a structure of the following format:
+
++--------------+--------+------------------------+
+| Name         | Offset | Size                   |
++==============+========+========================+
+| argsz        | 0      | 4                      |
++--------------+--------+------------------------+
+| flags        | 4      | 4                      |
++--------------+--------+------------------------+
+|              | +-----+-----------------------+ |
+|              | | Bit | Definition            | |
+|              | +=====+=======================+ |
+|              | | 0   | get dirty page bitmap | |
+|              | +-----+-----------------------+ |
++--------------+--------+------------------------+
+| address      | 8      | 8                      |
++--------------+--------+------------------------+
+| size         | 16     | 8                      |
++--------------+--------+------------------------+
+
+* *argsz* is the maximum size of the reply payload.
+* *flags* contains the following DMA region attributes:
+
+  * *get dirty page bitmap* indicates that a dirty page bitmap must be
+    populated before unmapping the DMA region. The client must provide a
+    `VFIO Bitmap`_ structure, explained below, immediately following this
+    entry.
+
+* *address* is the base DMA address of the DMA region.
+* *size* is the size of the DMA region.
+
+The address and size of the DMA region being unmapped must match exactly a
+previous mapping. The size of request message depends on whether or not the
+*get dirty page bitmap* bit is set in Flags:
+
+* If not set, the size of the total request message is: 16 + 24.
+
+* If set, the size of the total request message is: 16 + 24 + 16.
+
+.. _VFIO Bitmap:
+
+VFIO Bitmap Format
+""""""""""""""""""
+
++--------+--------+------+
+| Name   | Offset | Size |
++========+========+======+
+| pgsize | 0      | 8    |
++--------+--------+------+
+| size   | 8      | 8    |
++--------+--------+------+
+
+* *pgsize* is the page size for the bitmap, in bytes.
+* *size* is the size for the bitmap, in bytes, excluding the VFIO bitmap header.
+
+Reply
+^^^^^
+
+Upon receiving a ``VFIO_USER_DMA_UNMAP`` command, if the file descriptor is
+mapped then the server must release all references to that DMA region before
+replying, which potentially includes in-flight DMA transactions.
+
+The server responds with the original DMA entry in the request. If the
+*get dirty page bitmap* bit is set in flags in the request, then
+the server also includes the `VFIO Bitmap`_ structure sent in the request,
+followed by the corresponding dirty page bitmap, where each bit represents
+one page of size *pgsize* in `VFIO Bitmap`_ .
+
+The total size of the total reply message is:
+16 + 24 + (16 + *size* in `VFIO Bitmap`_ if *get dirty page bitmap* is set).
+
+``VFIO_USER_DEVICE_GET_INFO``
+-----------------------------
+
+This command message is sent by the client to the server to query for basic
+information about the device.
+
+Request
+^^^^^^^
+
++-------------+--------+--------------------------+
+| Name        | Offset | Size                     |
++=============+========+==========================+
+| argsz       | 0      | 4                        |
++-------------+--------+--------------------------+
+| flags       | 4      | 4                        |
++-------------+--------+--------------------------+
+|             | +-----+-------------------------+ |
+|             | | Bit | Definition              | |
+|             | +=====+=========================+ |
+|             | | 0   | VFIO_DEVICE_FLAGS_RESET | |
+|             | +-----+-------------------------+ |
+|             | | 1   | VFIO_DEVICE_FLAGS_PCI   | |
+|             | +-----+-------------------------+ |
++-------------+--------+--------------------------+
+| num_regions | 8      | 4                        |
++-------------+--------+--------------------------+
+| num_irqs    | 12     | 4                        |
++-------------+--------+--------------------------+
+
+* *argsz* is the maximum size of the reply payload
+* all other fields must be zero.
+
+Reply
+^^^^^
+
++-------------+--------+--------------------------+
+| Name        | Offset | Size                     |
++=============+========+==========================+
+| argsz       | 0      | 4                        |
++-------------+--------+--------------------------+
+| flags       | 4      | 4                        |
++-------------+--------+--------------------------+
+|             | +-----+-------------------------+ |
+|             | | Bit | Definition              | |
+|             | +=====+=========================+ |
+|             | | 0   | VFIO_DEVICE_FLAGS_RESET | |
+|             | +-----+-------------------------+ |
+|             | | 1   | VFIO_DEVICE_FLAGS_PCI   | |
+|             | +-----+-------------------------+ |
++-------------+--------+--------------------------+
+| num_regions | 8      | 4                        |
++-------------+--------+--------------------------+
+| num_irqs    | 12     | 4                        |
++-------------+--------+--------------------------+
+
+* *argsz* is the size required for the full reply payload (16 bytes today)
+* *flags* contains the following device attributes.
+
+  * ``VFIO_DEVICE_FLAGS_RESET`` indicates that the device supports the
+    ``VFIO_USER_DEVICE_RESET`` message.
+  * ``VFIO_DEVICE_FLAGS_PCI`` indicates that the device is a PCI device.
+
+* *num_regions* is the number of memory regions that the device exposes.
+* *num_irqs* is the number of distinct interrupt types that the device supports.
+
+This version of the protocol only supports PCI devices. Additional devices may
+be supported in future versions.
+
+``VFIO_USER_DEVICE_GET_REGION_INFO``
+------------------------------------
+
+This command message is sent by the client to the server to query for
+information about device regions. The VFIO region info structure is defined in
+``<linux/vfio.h>`` (``struct vfio_region_info``).
+
+Request
+^^^^^^^
+
++------------+--------+------------------------------+
+| Name       | Offset | Size                         |
++============+========+==============================+
+| argsz      | 0      | 4                            |
++------------+--------+------------------------------+
+| flags      | 4      | 4                            |
++------------+--------+------------------------------+
+| index      | 8      | 4                            |
++------------+--------+------------------------------+
+| cap_offset | 12     | 4                            |
++------------+--------+------------------------------+
+| size       | 16     | 8                            |
++------------+--------+------------------------------+
+| offset     | 24     | 8                            |
++------------+--------+------------------------------+
+
+* *argsz* the maximum size of the reply payload
+* *index* is the index of memory region being queried, it is the only field
+  that is required to be set in the command message.
+* all other fields must be zero.
+
+Reply
+^^^^^
+
++------------+--------+------------------------------+
+| Name       | Offset | Size                         |
++============+========+==============================+
+| argsz      | 0      | 4                            |
++------------+--------+------------------------------+
+| flags      | 4      | 4                            |
++------------+--------+------------------------------+
+|            | +-----+-----------------------------+ |
+|            | | Bit | Definition                  | |
+|            | +=====+=============================+ |
+|            | | 0   | VFIO_REGION_INFO_FLAG_READ  | |
+|            | +-----+-----------------------------+ |
+|            | | 1   | VFIO_REGION_INFO_FLAG_WRITE | |
+|            | +-----+-----------------------------+ |
+|            | | 2   | VFIO_REGION_INFO_FLAG_MMAP  | |
+|            | +-----+-----------------------------+ |
+|            | | 3   | VFIO_REGION_INFO_FLAG_CAPS  | |
+|            | +-----+-----------------------------+ |
++------------+--------+------------------------------+
++------------+--------+------------------------------+
+| index      | 8      | 4                            |
++------------+--------+------------------------------+
+| cap_offset | 12     | 4                            |
++------------+--------+------------------------------+
+| size       | 16     | 8                            |
++------------+--------+------------------------------+
+| offset     | 24     | 8                            |
++------------+--------+------------------------------+
+
+* *argsz* is the size required for the full reply payload (region info structure
+  plus the size of any region capabilities)
+* *flags* are attributes of the region:
+
+  * ``VFIO_REGION_INFO_FLAG_READ`` allows client read access to the region.
+  * ``VFIO_REGION_INFO_FLAG_WRITE`` allows client write access to the region.
+  * ``VFIO_REGION_INFO_FLAG_MMAP`` specifies the client can mmap() the region.
+    When this flag is set, the reply will include a file descriptor in its
+    meta-data. On ``AF_UNIX`` sockets, the file descriptors will be passed as
+    ``SCM_RIGHTS`` type ancillary data.
+  * ``VFIO_REGION_INFO_FLAG_CAPS`` indicates additional capabilities found in the
+    reply.
+
+* *index* is the index of memory region being queried, it is the only field
+  that is required to be set in the command message.
+* *cap_offset* describes where additional region capabilities can be found.
+  cap_offset is relative to the beginning of the VFIO region info structure.
+  The data structure it points is a VFIO cap header defined in
+  ``<linux/vfio.h>``.
+* *size* is the size of the region.
+* *offset* is the offset that should be given to the mmap() system call for
+  regions with the MMAP attribute. It is also used as the base offset when
+  mapping a VFIO sparse mmap area, described below.
+
+VFIO region capabilities
+""""""""""""""""""""""""
+
+The VFIO region information can also include a capabilities list. This list is
+similar to a PCI capability list - each entry has a common header that
+identifies a capability and where the next capability in the list can be found.
+The VFIO capability header format is defined in ``<linux/vfio.h>`` (``struct
+vfio_info_cap_header``).
+
+VFIO cap header format
+""""""""""""""""""""""
+
++---------+--------+------+
+| Name    | Offset | Size |
++=========+========+======+
+| id      | 0      | 2    |
++---------+--------+------+
+| version | 2      | 2    |
++---------+--------+------+
+| next    | 4      | 4    |
++---------+--------+------+
+
+* *id* is the capability identity.
+* *version* is a capability-specific version number.
+* *next* specifies the offset of the next capability in the capability list. It
+  is relative to the beginning of the VFIO region info structure.
+
+VFIO sparse mmap cap header
+"""""""""""""""""""""""""""
+
++------------------+----------------------------------+
+| Name             | Value                            |
++==================+==================================+
+| id               | VFIO_REGION_INFO_CAP_SPARSE_MMAP |
++------------------+----------------------------------+
+| version          | 0x1                              |
++------------------+----------------------------------+
+| next             | <next>                           |
++------------------+----------------------------------+
+| sparse mmap info | VFIO region info sparse mmap     |
++------------------+----------------------------------+
+
+This capability is defined when only a subrange of the region supports
+direct access by the client via mmap(). The VFIO sparse mmap area is defined in
+``<linux/vfio.h>`` (``struct vfio_region_sparse_mmap_area`` and ``struct
+vfio_region_info_cap_sparse_mmap``).
+
+VFIO region info cap sparse mmap
+""""""""""""""""""""""""""""""""
+
++----------+--------+------+
+| Name     | Offset | Size |
++==========+========+======+
+| nr_areas | 0      | 4    |
++----------+--------+------+
+| reserved | 4      | 4    |
++----------+--------+------+
+| offset   | 8      | 8    |
++----------+--------+------+
+| size     | 16     | 9    |
++----------+--------+------+
+| ...      |        |      |
++----------+--------+------+
+
+* *nr_areas* is the number of sparse mmap areas in the region.
+* *offset* and size describe a single area that can be mapped by the client.
+  There will be *nr_areas* pairs of offset and size. The offset will be added to
+  the base offset given in the ``VFIO_USER_DEVICE_GET_REGION_INFO`` to form the
+  offset argument of the subsequent mmap() call.
+
+The VFIO sparse mmap area is defined in ``<linux/vfio.h>`` (``struct
+vfio_region_info_cap_sparse_mmap``).
+
+VFIO region type cap header
+"""""""""""""""""""""""""""
+
++------------------+---------------------------+
+| Name             | Value                     |
++==================+===========================+
+| id               | VFIO_REGION_INFO_CAP_TYPE |
++------------------+---------------------------+
+| version          | 0x1                       |
++------------------+---------------------------+
+| next             | <next>                    |
++------------------+---------------------------+
+| region info type | VFIO region info type     |
++------------------+---------------------------+
+
+This capability is defined when a region is specific to the device.
+
+VFIO region info type cap
+"""""""""""""""""""""""""
+
+The VFIO region info type is defined in ``<linux/vfio.h>``
+(``struct vfio_region_info_cap_type``).
+
++---------+--------+------+
+| Name    | Offset | Size |
++=========+========+======+
+| type    | 0      | 4    |
++---------+--------+------+
+| subtype | 4      | 4    |
++---------+--------+------+
+
+The only device-specific region type and subtype supported by vfio-user is
+``VFIO_REGION_TYPE_MIGRATION`` (3) and ``VFIO_REGION_SUBTYPE_MIGRATION`` (1).
+
+``VFIO_USER_DEVICE_GET_REGION_IO_FDS``
+--------------------------------------
+
+Clients can access regions via ``VFIO_USER_REGION_READ/WRITE`` or, if provided, by
+``mmap()`` of a file descriptor provided by the server.
+
+``VFIO_USER_DEVICE_GET_REGION_IO_FDS`` provides an alternative access mechanism via
+file descriptors. This is an optional feature intended for performance
+improvements where an underlying sub-system (such as KVM) supports communication
+across such file descriptors to the vfio-user server, without needing to
+round-trip through the client.
+
+The server returns an array of sub-regions for the requested region. Each
+sub-region describes a span (offset and size) of a region, along with the
+requested file descriptor notification mechanism to use.  Each sub-region in the
+response message may choose to use a different method, as defined below.  The
+two mechanisms supported in this specification are ioeventfds and ioregionfds.
+
+The server in addition returns a file descriptor in the ancillary data; clients
+are expected to configure each sub-region's file descriptor with the requested
+notification method. For example, a client could configure KVM with the
+requested ioeventfd via a ``KVM_IOEVENTFD`` ``ioctl()``.
+
+Request
+^^^^^^^
+
++-------------+--------+------+
+| Name        | Offset | Size |
++=============+========+======+
+| argsz       | 0      | 4    |
++-------------+--------+------+
+| flags       | 4      | 4    |
++-------------+--------+------+
+| index       | 8      | 4    |
++-------------+--------+------+
+| count       | 12     | 4    |
++-------------+--------+------+
+
+* *argsz* the maximum size of the reply payload
+* *index* is the index of memory region being queried
+* all other fields must be zero
+
+The client must set ``flags`` to zero and specify the region being queried in
+the ``index``.
+
+Reply
+^^^^^
+
++-------------+--------+------+
+| Name        | Offset | Size |
++=============+========+======+
+| argsz       | 0      | 4    |
++-------------+--------+------+
+| flags       | 4      | 4    |
++-------------+--------+------+
+| index       | 8      | 4    |
++-------------+--------+------+
+| count       | 12     | 4    |
++-------------+--------+------+
+| sub-regions | 16     | ...  |
++-------------+--------+------+
+
+* *argsz* is the size of the region IO FD info structure plus the
+  total size of the sub-region array. Thus, each array entry "i" is at offset
+  i * ((argsz - 32) / count). Note that currently this is 40 bytes for both IO
+  FD types, but this is not to be relied on. As elsewhere, this indicates the
+  full reply payload size needed.
+* *flags* must be zero
+* *index* is the index of memory region being queried
+* *count* is the number of sub-regions in the array
+* *sub-regions* is the array of Sub-Region IO FD info structures
+
+The reply message will additionally include at least one file descriptor in the
+ancillary data. Note that more than one sub-region may share the same file
+descriptor.
+
+Note that it is the client's responsibility to verify the requested values (for
+example, that the requested offset does not exceed the region's bounds).
+
+Each sub-region given in the response has one of two possible structures,
+depending whether *type* is ``VFIO_USER_IO_FD_TYPE_IOEVENTFD`` or
+``VFIO_USER_IO_FD_TYPE_IOREGIONFD``:
+
+Sub-Region IO FD info format (ioeventfd)
+""""""""""""""""""""""""""""""""""""""""
+
++-----------+--------+------+
+| Name      | Offset | Size |
++===========+========+======+
+| offset    | 0      | 8    |
++-----------+--------+------+
+| size      | 8      | 8    |
++-----------+--------+------+
+| fd_index  | 16     | 4    |
++-----------+--------+------+
+| type      | 20     | 4    |
++-----------+--------+------+
+| flags     | 24     | 4    |
++-----------+--------+------+
+| padding   | 28     | 4    |
++-----------+--------+------+
+| datamatch | 32     | 8    |
++-----------+--------+------+
+
+* *offset* is the offset of the start of the sub-region within the region
+  requested ("physical address offset" for the region)
+* *size* is the length of the sub-region. This may be zero if the access size is
+  not relevant, which may allow for optimizations
+* *fd_index* is the index in the ancillary data of the FD to use for ioeventfd
+  notification; it may be shared.
+* *type* is ``VFIO_USER_IO_FD_TYPE_IOEVENTFD``
+* *flags* is any of:
+
+  * ``KVM_IOEVENTFD_FLAG_DATAMATCH``
+  * ``KVM_IOEVENTFD_FLAG_PIO``
+  * ``KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY`` (FIXME: makes sense?)
+
+* *datamatch* is the datamatch value if needed
+
+See https://www.kernel.org/doc/Documentation/virtual/kvm/api.txt, *4.59
+KVM_IOEVENTFD* for further context on the ioeventfd-specific fields.
+
+Sub-Region IO FD info format (ioregionfd)
+"""""""""""""""""""""""""""""""""""""""""
+
++-----------+--------+------+
+| Name      | Offset | Size |
++===========+========+======+
+| offset    | 0      | 8    |
++-----------+--------+------+
+| size      | 8      | 8    |
++-----------+--------+------+
+| fd_index  | 16     | 4    |
++-----------+--------+------+
+| type      | 20     | 4    |
++-----------+--------+------+
+| flags     | 24     | 4    |
++-----------+--------+------+
+| padding   | 28     | 4    |
++-----------+--------+------+
+| user_data | 32     | 8    |
++-----------+--------+------+
+
+* *offset* is the offset of the start of the sub-region within the region
+  requested ("physical address offset" for the region)
+* *size* is the length of the sub-region. This may be zero if the access size is
+  not relevant, which may allow for optimizations; ``KVM_IOREGION_POSTED_WRITES``
+  must be set in *flags* in this case
+* *fd_index* is the index in the ancillary data of the FD to use for ioregionfd
+  messages; it may be shared
+* *type* is ``VFIO_USER_IO_FD_TYPE_IOREGIONFD``
+* *flags* is any of:
+
+  * ``KVM_IOREGION_PIO``
+  * ``KVM_IOREGION_POSTED_WRITES``
+
+* *user_data* is an opaque value passed back to the server via a message on the
+  file descriptor
+
+For further information on the ioregionfd-specific fields, see:
+https://lore.kernel.org/kvm/cover.1613828726.git.eafanasova@gmail.com/
+
+(FIXME: update with final API docs.)
+
+``VFIO_USER_DEVICE_GET_IRQ_INFO``
+---------------------------------
+
+This command message is sent by the client to the server to query for
+information about device interrupt types. The VFIO IRQ info structure is
+defined in ``<linux/vfio.h>`` (``struct vfio_irq_info``).
+
+Request
+^^^^^^^
+
++-------+--------+---------------------------+
+| Name  | Offset | Size                      |
++=======+========+===========================+
+| argsz | 0      | 4                         |
++-------+--------+---------------------------+
+| flags | 4      | 4                         |
++-------+--------+---------------------------+
+|       | +-----+--------------------------+ |
+|       | | Bit | Definition               | |
+|       | +=====+==========================+ |
+|       | | 0   | VFIO_IRQ_INFO_EVENTFD    | |
+|       | +-----+--------------------------+ |
+|       | | 1   | VFIO_IRQ_INFO_MASKABLE   | |
+|       | +-----+--------------------------+ |
+|       | | 2   | VFIO_IRQ_INFO_AUTOMASKED | |
+|       | +-----+--------------------------+ |
+|       | | 3   | VFIO_IRQ_INFO_NORESIZE   | |
+|       | +-----+--------------------------+ |
++-------+--------+---------------------------+
+| index | 8      | 4                         |
++-------+--------+---------------------------+
+| count | 12     | 4                         |
++-------+--------+---------------------------+
+
+* *argsz* is the maximum size of the reply payload (16 bytes today)
+* index is the index of IRQ type being queried (e.g. ``VFIO_PCI_MSIX_IRQ_INDEX``)
+* all other fields must be zero
+
+Reply
+^^^^^
+
++-------+--------+---------------------------+
+| Name  | Offset | Size                      |
++=======+========+===========================+
+| argsz | 0      | 4                         |
++-------+--------+---------------------------+
+| flags | 4      | 4                         |
++-------+--------+---------------------------+
+|       | +-----+--------------------------+ |
+|       | | Bit | Definition               | |
+|       | +=====+==========================+ |
+|       | | 0   | VFIO_IRQ_INFO_EVENTFD    | |
+|       | +-----+--------------------------+ |
+|       | | 1   | VFIO_IRQ_INFO_MASKABLE   | |
+|       | +-----+--------------------------+ |
+|       | | 2   | VFIO_IRQ_INFO_AUTOMASKED | |
+|       | +-----+--------------------------+ |
+|       | | 3   | VFIO_IRQ_INFO_NORESIZE   | |
+|       | +-----+--------------------------+ |
++-------+--------+---------------------------+
+| index | 8      | 4                         |
++-------+--------+---------------------------+
+| count | 12     | 4                         |
++-------+--------+---------------------------+
+
+* *argsz* is the size required for the full reply payload (16 bytes today)
+* *flags* defines IRQ attributes:
+
+  * ``VFIO_IRQ_INFO_EVENTFD`` indicates the IRQ type can support server eventfd
+    signalling.
+  * ``VFIO_IRQ_INFO_MASKABLE`` indicates that the IRQ type supports the ``MASK``
+    and ``UNMASK`` actions in a ``VFIO_USER_DEVICE_SET_IRQS`` message.
+  * ``VFIO_IRQ_INFO_AUTOMASKED`` indicates the IRQ type masks itself after being
+    triggered, and the client must send an ``UNMASK`` action to receive new
+    interrupts.
+  * ``VFIO_IRQ_INFO_NORESIZE`` indicates ``VFIO_USER_SET_IRQS`` operations setup
+    interrupts as a set, and new sub-indexes cannot be enabled without disabling
+    the entire type.
+* index is the index of IRQ type being queried
+* count describes the number of interrupts of the queried type.
+
+``VFIO_USER_DEVICE_SET_IRQS``
+-----------------------------
+
+This command message is sent by the client to the server to set actions for
+device interrupt types. The VFIO IRQ set structure is defined in
+``<linux/vfio.h>`` (``struct vfio_irq_set``).
+
+Request
+^^^^^^^
+
++-------+--------+------------------------------+
+| Name  | Offset | Size                         |
++=======+========+==============================+
+| argsz | 0      | 4                            |
++-------+--------+------------------------------+
+| flags | 4      | 4                            |
++-------+--------+------------------------------+
+|       | +-----+-----------------------------+ |
+|       | | Bit | Definition                  | |
+|       | +=====+=============================+ |
+|       | | 0   | VFIO_IRQ_SET_DATA_NONE      | |
+|       | +-----+-----------------------------+ |
+|       | | 1   | VFIO_IRQ_SET_DATA_BOOL      | |
+|       | +-----+-----------------------------+ |
+|       | | 2   | VFIO_IRQ_SET_DATA_EVENTFD   | |
+|       | +-----+-----------------------------+ |
+|       | | 3   | VFIO_IRQ_SET_ACTION_MASK    | |
+|       | +-----+-----------------------------+ |
+|       | | 4   | VFIO_IRQ_SET_ACTION_UNMASK  | |
+|       | +-----+-----------------------------+ |
+|       | | 5   | VFIO_IRQ_SET_ACTION_TRIGGER | |
+|       | +-----+-----------------------------+ |
++-------+--------+------------------------------+
+| index | 8      | 4                            |
++-------+--------+------------------------------+
+| start | 12     | 4                            |
++-------+--------+------------------------------+
+| count | 16     | 4                            |
++-------+--------+------------------------------+
+| data  | 20     | variable                     |
++-------+--------+------------------------------+
+
+* *argsz* is the size of the VFIO IRQ set request payload, including any *data*
+  field. Note there is no reply payload, so this field differs from other
+  message types.
+* *flags* defines the action performed on the interrupt range. The ``DATA``
+  flags describe the data field sent in the message; the ``ACTION`` flags
+  describe the action to be performed. The flags are mutually exclusive for
+  both sets.
+
+  * ``VFIO_IRQ_SET_DATA_NONE`` indicates there is no data field in the command.
+    The action is performed unconditionally.
+  * ``VFIO_IRQ_SET_DATA_BOOL`` indicates the data field is an array of boolean
+    bytes. The action is performed if the corresponding boolean is true.
+  * ``VFIO_IRQ_SET_DATA_EVENTFD`` indicates an array of event file descriptors
+    was sent in the message meta-data. These descriptors will be signalled when
+    the action defined by the action flags occurs. In ``AF_UNIX`` sockets, the
+    descriptors are sent as ``SCM_RIGHTS`` type ancillary data.
+    If no file descriptors are provided, this de-assigns the specified
+    previously configured interrupts.
+  * ``VFIO_IRQ_SET_ACTION_MASK`` indicates a masking event. It can be used with
+    ``VFIO_IRQ_SET_DATA_BOOL`` or ``VFIO_IRQ_SET_DATA_NONE`` to mask an interrupt,
+    or with ``VFIO_IRQ_SET_DATA_EVENTFD`` to generate an event when the guest masks
+    the interrupt.
+  * ``VFIO_IRQ_SET_ACTION_UNMASK`` indicates an unmasking event. It can be used
+    with ``VFIO_IRQ_SET_DATA_BOOL`` or ``VFIO_IRQ_SET_DATA_NONE`` to unmask an
+    interrupt, or with ``VFIO_IRQ_SET_DATA_EVENTFD`` to generate an event when the
+    guest unmasks the interrupt.
+  * ``VFIO_IRQ_SET_ACTION_TRIGGER`` indicates a triggering event. It can be used
+    with ``VFIO_IRQ_SET_DATA_BOOL`` or ``VFIO_IRQ_SET_DATA_NONE`` to trigger an
+    interrupt, or with ``VFIO_IRQ_SET_DATA_EVENTFD`` to generate an event when the
+    server triggers the interrupt.
+
+* *index* is the index of IRQ type being setup.
+* *start* is the start of the sub-index being set.
+* *count* describes the number of sub-indexes being set. As a special case, a
+  count (and start) of 0, with data flags of ``VFIO_IRQ_SET_DATA_NONE`` disables
+  all interrupts of the index.
+* *data* is an optional field included when the
+  ``VFIO_IRQ_SET_DATA_BOOL`` flag is present. It contains an array of booleans
+  that specify whether the action is to be performed on the corresponding
+  index. It's used when the action is only performed on a subset of the range
+  specified.
+
+Not all interrupt types support every combination of data and action flags.
+The client must know the capabilities of the device and IRQ index before it
+sends a ``VFIO_USER_DEVICE_SET_IRQ`` message.
+
+In typical operation, a specific IRQ may operate as follows:
+
+1. The client sends a ``VFIO_USER_DEVICE_SET_IRQ`` message with
+   ``flags=(VFIO_IRQ_SET_DATA_EVENTFD|VFIO_IRQ_SET_ACTION_TRIGGER)`` along
+   with an eventfd. This associates the IRQ with a particular eventfd on the
+   server side.
+
+#. The client may send a ``VFIO_USER_DEVICE_SET_IRQ`` message with
+   ``flags=(VFIO_IRQ_SET_DATA_EVENTFD|VFIO_IRQ_SET_ACTION_MASK/UNMASK)`` along
+   with another eventfd. This associates the given eventfd with the
+   mask/unmask state on the server side.
+
+#. The server may trigger the IRQ by writing 1 to the eventfd.
+
+#. The server may mask/unmask an IRQ which will write 1 to the corresponding
+   mask/unmask eventfd, if there is one.
+
+5. A client may trigger a device IRQ itself, by sending a
+   ``VFIO_USER_DEVICE_SET_IRQ`` message with
+   ``flags=(VFIO_IRQ_SET_DATA_NONE/BOOL|VFIO_IRQ_SET_ACTION_TRIGGER)``.
+
+6. A client may mask or unmask the IRQ, by sending a
+   ``VFIO_USER_DEVICE_SET_IRQ`` message with
+   ``flags=(VFIO_IRQ_SET_DATA_NONE/BOOL|VFIO_IRQ_SET_ACTION_MASK/UNMASK)``.
+
+Reply
+^^^^^
+
+There is no payload in the reply.
+
+.. _Read and Write Operations:
+
+Note that all of these operations must be supported by the client and/or server,
+even if the corresponding memory or device region has been shared as mappable.
+
+The ``count`` field must not exceed the value of ``max_data_xfer_size`` of the
+peer, for both reads and writes.
+
+``VFIO_USER_REGION_READ``
+-------------------------
+
+If a device region is not mappable, it's not directly accessible by the client
+via ``mmap()`` of the underlying file descriptor. In this case, a client can
+read from a device region with this message.
+
+Request
+^^^^^^^
+
++--------+--------+----------+
+| Name   | Offset | Size     |
++========+========+==========+
+| offset | 0      | 8        |
++--------+--------+----------+
+| region | 8      | 4        |
++--------+--------+----------+
+| count  | 12     | 4        |
++--------+--------+----------+
+
+* *offset* into the region being accessed.
+* *region* is the index of the region being accessed.
+* *count* is the size of the data to be transferred.
+
+Reply
+^^^^^
+
++--------+--------+----------+
+| Name   | Offset | Size     |
++========+========+==========+
+| offset | 0      | 8        |
++--------+--------+----------+
+| region | 8      | 4        |
++--------+--------+----------+
+| count  | 12     | 4        |
++--------+--------+----------+
+| data   | 16     | variable |
++--------+--------+----------+
+
+* *offset* into the region accessed.
+* *region* is the index of the region accessed.
+* *count* is the size of the data transferred.
+* *data* is the data that was read from the device region.
+
+``VFIO_USER_REGION_WRITE``
+--------------------------
+
+If a device region is not mappable, it's not directly accessible by the client
+via mmap() of the underlying fd. In this case, a client can write to a device
+region with this message.
+
+Request
+^^^^^^^
+
++--------+--------+----------+
+| Name   | Offset | Size     |
++========+========+==========+
+| offset | 0      | 8        |
++--------+--------+----------+
+| region | 8      | 4        |
++--------+--------+----------+
+| count  | 12     | 4        |
++--------+--------+----------+
+| data   | 16     | variable |
++--------+--------+----------+
+
+* *offset* into the region being accessed.
+* *region* is the index of the region being accessed.
+* *count* is the size of the data to be transferred.
+* *data* is the data to write
+
+Reply
+^^^^^
+
++--------+--------+----------+
+| Name   | Offset | Size     |
++========+========+==========+
+| offset | 0      | 8        |
++--------+--------+----------+
+| region | 8      | 4        |
++--------+--------+----------+
+| count  | 12     | 4        |
++--------+--------+----------+
+
+* *offset* into the region accessed.
+* *region* is the index of the region accessed.
+* *count* is the size of the data transferred.
+
+``VFIO_USER_DMA_READ``
+-----------------------
+
+If the client has not shared mappable memory, the server can use this message to
+read from guest memory.
+
+Request
+^^^^^^^
+
++---------+--------+----------+
+| Name    | Offset | Size     |
++=========+========+==========+
+| address | 0      | 8        |
++---------+--------+----------+
+| count   | 8      | 8        |
++---------+--------+----------+
+
+* *address* is the client DMA memory address being accessed. This address must have
+  been previously exported to the server with a ``VFIO_USER_DMA_MAP`` message.
+* *count* is the size of the data to be transferred.
+
+Reply
+^^^^^
+
++---------+--------+----------+
+| Name    | Offset | Size     |
++=========+========+==========+
+| address | 0      | 8        |
++---------+--------+----------+
+| count   | 8      | 8        |
++---------+--------+----------+
+| data    | 16     | variable |
++---------+--------+----------+
+
+* *address* is the client DMA memory address being accessed.
+* *count* is the size of the data transferred.
+* *data* is the data read.
+
+``VFIO_USER_DMA_WRITE``
+-----------------------
+
+If the client has not shared mappable memory, the server can use this message to
+write to guest memory.
+
+Request
+^^^^^^^
+
++---------+--------+----------+
+| Name    | Offset | Size     |
++=========+========+==========+
+| address | 0      | 8        |
++---------+--------+----------+
+| count   | 8      | 8        |
++---------+--------+----------+
+| data    | 16     | variable |
++---------+--------+----------+
+
+* *address* is the client DMA memory address being accessed. This address must have
+  been previously exported to the server with a ``VFIO_USER_DMA_MAP`` message.
+* *count* is the size of the data to be transferred.
+* *data* is the data to write
+
+Reply
+^^^^^
+
++---------+--------+----------+
+| Name    | Offset | Size     |
++=========+========+==========+
+| address | 0      | 8        |
++---------+--------+----------+
+| count   | 8      | 4        |
++---------+--------+----------+
+
+* *address* is the client DMA memory address being accessed.
+* *count* is the size of the data transferred.
+
+``VFIO_USER_DEVICE_RESET``
+--------------------------
+
+This command message is sent from the client to the server to reset the device.
+Neither the request or reply have a payload.
+
+``VFIO_USER_DIRTY_PAGES``
+-------------------------
+
+This command is analogous to ``VFIO_IOMMU_DIRTY_PAGES``. It is sent by the client
+to the server in order to control logging of dirty pages, usually during a live
+migration.
+
+Dirty page tracking is optional for server implementation; clients should not
+rely on it.
+
+Request
+^^^^^^^
+
++-------+--------+-----------------------------------------+
+| Name  | Offset | Size                                    |
++=======+========+=========================================+
+| argsz | 0      | 4                                       |
++-------+--------+-----------------------------------------+
+| flags | 4      | 4                                       |
++-------+--------+-----------------------------------------+
+|       | +-----+----------------------------------------+ |
+|       | | Bit | Definition                             | |
+|       | +=====+========================================+ |
+|       | | 0   | VFIO_IOMMU_DIRTY_PAGES_FLAG_START      | |
+|       | +-----+----------------------------------------+ |
+|       | | 1   | VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP       | |
+|       | +-----+----------------------------------------+ |
+|       | | 2   | VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP | |
+|       | +-----+----------------------------------------+ |
++-------+--------+-----------------------------------------+
+
+* *argsz* is the size of the VFIO dirty bitmap info structure for
+  ``START/STOP``; and for ``GET_BITMAP``, the maximum size of the reply payload
+
+* *flags* defines the action to be performed by the server:
+
+  * ``VFIO_IOMMU_DIRTY_PAGES_FLAG_START`` instructs the server to start logging
+    pages it dirties. Logging continues until explicitly disabled by
+    ``VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP``.
+
+  * ``VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP`` instructs the server to stop logging
+    dirty pages.
+
+  * ``VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP`` requests the server to return
+    the dirty bitmap for a specific IOVA range. The IOVA range is specified by
+    a "VFIO Bitmap Range" structure, which must immediately follow this
+    "VFIO Dirty Pages" structure. See `VFIO Bitmap Range Format`_.
+    This operation is only valid if logging of dirty pages has been previously
+    started.
+
+  These flags are mutually exclusive with each other.
+
+This part of the request is analogous to VFIO's ``struct
+vfio_iommu_type1_dirty_bitmap``.
+
+.. _VFIO Bitmap Range Format:
+
+VFIO Bitmap Range Format
+""""""""""""""""""""""""
+
++--------+--------+------+
+| Name   | Offset | Size |
++========+========+======+
+| iova   | 0      | 8    |
++--------+--------+------+
+| size   | 8      | 8    |
++--------+--------+------+
+| bitmap | 16     | 24   |
++--------+--------+------+
+
+* *iova* is the IOVA offset
+
+* *size* is the size of the IOVA region
+
+* *bitmap* is the VFIO Bitmap explained in `VFIO Bitmap`_.
+
+This part of the request is analogous to VFIO's ``struct
+vfio_iommu_type1_dirty_bitmap_get``.
+
+Reply
+^^^^^
+
+For ``VFIO_IOMMU_DIRTY_PAGES_FLAG_START`` or
+``VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP``, there is no reply payload.
+
+For ``VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP``, the reply payload is as follows:
+
++--------------+--------+-----------------------------------------+
+| Name         | Offset | Size                                    |
++==============+========+=========================================+
+| argsz        | 0      | 4                                       |
++--------------+--------+-----------------------------------------+
+| flags        | 4      | 4                                       |
++--------------+--------+-----------------------------------------+
+|              | +-----+----------------------------------------+ |
+|              | | Bit | Definition                             | |
+|              | +=====+========================================+ |
+|              | | 2   | VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP | |
+|              | +-----+----------------------------------------+ |
++--------------+--------+-----------------------------------------+
+| bitmap range | 8      | 40                                      |
++--------------+--------+-----------------------------------------+
+| bitmap       | 48     | variable                                |
++--------------+--------+-----------------------------------------+
+
+* *argsz* is the size required for the full reply payload (dirty pages structure
+  + bitmap range structure + actual bitmap)
+* *flags* is ``VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP``
+* *bitmap range* is the same bitmap range struct provided in the request, as
+  defined in `VFIO Bitmap Range Format`_.
+* *bitmap* is the actual dirty pages bitmap corresponding to the range request
+
+VFIO Device Migration Info
+--------------------------
+
+A device may contain a migration region (of type
+``VFIO_REGION_TYPE_MIGRATION``).  The beginning of the region must contain
+``struct vfio_device_migration_info``, defined in ``<linux/vfio.h>``. This
+subregion is accessed like any other part of a standard vfio-user region
+using ``VFIO_USER_REGION_READ``/``VFIO_USER_REGION_WRITE``.
+
++---------------+--------+-----------------------------+
+| Name          | Offset | Size                        |
++===============+========+=============================+
+| device_state  | 0      | 4                           |
++---------------+--------+-----------------------------+
+|               | +-----+----------------------------+ |
+|               | | Bit | Definition                 | |
+|               | +=====+============================+ |
+|               | | 0   | VFIO_DEVICE_STATE_RUNNING  | |
+|               | +-----+----------------------------+ |
+|               | | 1   | VFIO_DEVICE_STATE_SAVING   | |
+|               | +-----+----------------------------+ |
+|               | | 2   | VFIO_DEVICE_STATE_RESUMING | |
+|               | +-----+----------------------------+ |
++---------------+--------+-----------------------------+
+| reserved      | 4      | 4                           |
++---------------+--------+-----------------------------+
+| pending_bytes | 8      | 8                           |
++---------------+--------+-----------------------------+
+| data_offset   | 16     | 8                           |
++---------------+--------+-----------------------------+
+| data_size     | 24     | 8                           |
++---------------+--------+-----------------------------+
+
+* *device_state* defines the state of the device:
+
+  The client initiates device state transition by writing the intended state.
+  The server must respond only after it has successfully transitioned to the new
+  state. If an error occurs then the server must respond to the
+  ``VFIO_USER_REGION_WRITE`` operation with the Error field set accordingly and
+  must remain at the previous state, or in case of internal error it must
+  transition to the error state, defined as
+  ``VFIO_DEVICE_STATE_RESUMING | VFIO_DEVICE_STATE_SAVING``. The client must
+  re-read the device state in order to determine it afresh.
+
+  The following device states are defined:
+
+  +-----------+---------+----------+-----------------------------------+
+  | _RESUMING | _SAVING | _RUNNING | Description                       |
+  +===========+=========+==========+===================================+
+  | 0         | 0       | 0        | Device is stopped.                |
+  +-----------+---------+----------+-----------------------------------+
+  | 0         | 0       | 1        | Device is running, default state. |
+  +-----------+---------+----------+-----------------------------------+
+  | 0         | 1       | 0        | Stop-and-copy state               |
+  +-----------+---------+----------+-----------------------------------+
+  | 0         | 1       | 1        | Pre-copy state                    |
+  +-----------+---------+----------+-----------------------------------+
+  | 1         | 0       | 0        | Resuming                          |
+  +-----------+---------+----------+-----------------------------------+
+  | 1         | 0       | 1        | Invalid state                     |
+  +-----------+---------+----------+-----------------------------------+
+  | 1         | 1       | 0        | Error state                       |
+  +-----------+---------+----------+-----------------------------------+
+  | 1         | 1       | 1        | Invalid state                     |
+  +-----------+---------+----------+-----------------------------------+
+
+  Valid state transitions are shown in the following table:
+
+  +-------------------------+---------+---------+---------------+----------+----------+
+  | |darr| From / To |rarr| | Stopped | Running | Stop-and-copy | Pre-copy | Resuming |
+  +=========================+=========+=========+===============+==========+==========+
+  | Stopped                 |    \-   |    1    |       0       |    0     |     0    |
+  +-------------------------+---------+---------+---------------+----------+----------+
+  | Running                 |    1    |    \-   |       1       |    1     |     1    |
+  +-------------------------+---------+---------+---------------+----------+----------+
+  | Stop-and-copy           |    1    |    1    |       \-      |    0     |     0    |
+  +-------------------------+---------+---------+---------------+----------+----------+
+  | Pre-copy                |    0    |    0    |       1       |    \-    |     0    |
+  +-------------------------+---------+---------+---------------+----------+----------+
+  | Resuming                |    0    |    1    |       0       |    0     |     \-   |
+  +-------------------------+---------+---------+---------------+----------+----------+
+
+  A device is migrated to the destination as follows:
+
+  * The source client transitions the device state from the running state to
+    the pre-copy state. This transition is optional for the client but must be
+    supported by the server. The source server starts sending device state data
+    to the source client through the migration region while the device is
+    running.
+
+  * The source client transitions the device state from the running state or the
+    pre-copy state to the stop-and-copy state. The source server stops the
+    device, saves device state and sends it to the source client through the
+    migration region.
+
+  The source client is responsible for sending the migration data to the
+  destination client.
+
+  A device is resumed on the destination as follows:
+
+  * The destination client transitions the device state from the running state
+    to the resuming state. The destination server uses the device state data
+    received through the migration region to resume the device.
+
+  * The destination client provides saved device state to the destination
+    server and then transitions the device to back to the running state.
+
+* *reserved* This field is reserved and any access to it must be ignored by the
+  server.
+
+* *pending_bytes* Remaining bytes to be migrated by the server. This field is
+  read only.
+
+* *data_offset* Offset in the migration region where the client must:
+
+  * read from, during the pre-copy or stop-and-copy state, or
+
+  * write to, during the resuming state.
+
+  This field is read only.
+
+* *data_size* Contains the size, in bytes, of the amount of data copied to:
+
+  * the source migration region by the source server during the pre-copy or
+    stop-and copy state, or
+
+  * the destination migration region by the destination client during the
+    resuming state.
+
+Device-specific data must be stored at any position after
+``struct vfio_device_migration_info``. Note that the migration region can be
+memory mappable, even partially. In practise, only the migration data portion
+can be memory mapped.
+
+The client processes device state data during the pre-copy and the
+stop-and-copy state in the following iterative manner:
+
+  1. The client reads ``pending_bytes`` to mark a new iteration. Repeated reads
+     of this field is an idempotent operation. If there are no migration data
+     to be consumed then the next step depends on the current device state:
+
+     * pre-copy: the client must try again.
+
+     * stop-and-copy: this procedure can end and the device can now start
+       resuming on the destination.
+
+  2. The client reads ``data_offset``; at this point the server must make
+     available a portion of migration data at this offset to be read by the
+     client, which must happen *before* completing the read operation. The
+     amount of data to be read must be stored in the ``data_size`` field, which
+     the client reads next.
+
+  3. The client reads ``data_size`` to determine the amount of migration data
+     available.
+
+  4. The client reads and processes the migration data.
+
+  5. Go to step 1.
+
+Note that the client can transition the device from the pre-copy state to the
+stop-and-copy state at any time; ``pending_bytes`` does not need to become zero.
+
+The client initializes the device state on the destination by setting the
+device state in the resuming state and writing the migration data to the
+destination migration region at ``data_offset`` offset. The client can write the
+source migration data in an iterative manner and the server must consume this
+data before completing each write operation, updating the ``data_offset`` field.
+The server must apply the source migration data on the device resume state. The
+client must write data on the same order and transaction size as read.
+
+If an error occurs then the server must fail the read or write operation. It is
+an implementation detail of the client how to handle errors.
+
+Appendices
+==========
+
+Unused VFIO ``ioctl()`` commands
+--------------------------------
+
+The following VFIO commands do not have an equivalent vfio-user command:
+
+* ``VFIO_GET_API_VERSION``
+* ``VFIO_CHECK_EXTENSION``
+* ``VFIO_SET_IOMMU``
+* ``VFIO_GROUP_GET_STATUS``
+* ``VFIO_GROUP_SET_CONTAINER``
+* ``VFIO_GROUP_UNSET_CONTAINER``
+* ``VFIO_GROUP_GET_DEVICE_FD``
+* ``VFIO_IOMMU_GET_INFO``
+
+However, once support for live migration for VFIO devices is finalized some
+of the above commands may have to be handled by the client in their
+corresponding vfio-user form. This will be addressed in a future protocol
+version.
+
+VFIO groups and containers
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The current VFIO implementation includes group and container idioms that
+describe how a device relates to the host IOMMU. In the vfio-user
+implementation, the IOMMU is implemented in SW by the client, and is not
+visible to the server. The simplest idea would be that the client put each
+device into its own group and container.
+
+Backend Program Conventions
+---------------------------
+
+vfio-user backend program conventions are based on the vhost-user ones.
+
+* The backend program must not daemonize itself.
+* No assumptions must be made as to what access the backend program has on the
+  system.
+* File descriptors 0, 1 and 2 must exist, must have regular
+  stdin/stdout/stderr semantics, and can be redirected.
+* The backend program must honor the SIGTERM signal.
+* The backend program must accept the following commands line options:
+
+  * ``--socket-path=PATH``: path to UNIX domain socket,
+  * ``--fd=FDNUM``: file descriptor for UNIX domain socket, incompatible with
+    ``--socket-path``
+* The backend program must be accompanied with a JSON file stored under
+  ``/usr/share/vfio-user``.
+
+TODO add schema similar to docs/interop/vhost-user.json.
diff --git a/MAINTAINERS b/MAINTAINERS
index 694973ed23..d838b9e3f2 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1879,6 +1879,12 @@ F: hw/vfio/ap.c
 F: docs/system/s390x/vfio-ap.rst
 L: qemu-s390x@nongnu.org
 
+vfio-user
+M: John G Johnson <john.g.johnson@oracle.com>
+M: Thanos Makatos <thanos.makatos@nutanix.com>
+S: Supported
+F: docs/devel/vfio-user.rst
+
 vhost
 M: Michael S. Tsirkin <mst@redhat.com>
 S: Supported
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH RFC v2 02/16] vfio-user: add VFIO base abstract class
  2021-08-16 16:42 [PATCH RFC v2 00/16] vfio-user implementation Elena Ufimtseva
  2021-08-16 16:42 ` [PATCH RFC v2 01/16] vfio-user: introduce vfio-user protocol specification Elena Ufimtseva
@ 2021-08-16 16:42 ` Elena Ufimtseva
  2021-08-16 16:42 ` [PATCH RFC v2 03/16] vfio-user: Define type vfio_user_pci_dev_info Elena Ufimtseva
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 108+ messages in thread
From: Elena Ufimtseva @ 2021-08-16 16:42 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha, thanos.makatos

From: John Johnson <john.g.johnson@oracle.com>

Add an abstract base class both the kernel driver
and user socket implementations can use to share code.

Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/vfio/pci.h | 16 +++++++++++--
 hw/vfio/pci.c | 63 ++++++++++++++++++++++++++++++++-------------------
 2 files changed, 54 insertions(+), 25 deletions(-)

diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 64777516d1..bbc78aaeb3 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -114,8 +114,13 @@ typedef struct VFIOMSIXInfo {
     unsigned long *pending;
 } VFIOMSIXInfo;
 
-#define TYPE_VFIO_PCI "vfio-pci"
-OBJECT_DECLARE_SIMPLE_TYPE(VFIOPCIDevice, VFIO_PCI)
+/*
+ * TYPE_VFIO_PCI_BASE is an abstract type used to share code
+ * between VFIO implementations that use a kernel driver
+ * with those that use user sockets.
+ */
+#define TYPE_VFIO_PCI_BASE "vfio-pci-base"
+OBJECT_DECLARE_SIMPLE_TYPE(VFIOPCIDevice, VFIO_PCI_BASE)
 
 struct VFIOPCIDevice {
     PCIDevice pdev;
@@ -175,6 +180,13 @@ struct VFIOPCIDevice {
     Notifier irqchip_change_notifier;
 };
 
+#define TYPE_VFIO_PCI "vfio-pci"
+OBJECT_DECLARE_SIMPLE_TYPE(VFIOKernPCIDevice, VFIO_PCI)
+
+struct VFIOKernPCIDevice {
+    VFIOPCIDevice device;
+};
+
 /* Use uin32_t for vendor & device so PCI_ANY_ID expands and cannot match hw */
 static inline bool vfio_pci_is(VFIOPCIDevice *vdev, uint32_t vendor, uint32_t device)
 {
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index e1ea1d8a23..bea95efc33 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -231,7 +231,7 @@ static void vfio_intx_update(VFIOPCIDevice *vdev, PCIINTxRoute *route)
 
 static void vfio_intx_routing_notifier(PCIDevice *pdev)
 {
-    VFIOPCIDevice *vdev = VFIO_PCI(pdev);
+    VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
     PCIINTxRoute route;
 
     if (vdev->interrupt != VFIO_INT_INTx) {
@@ -457,7 +457,7 @@ static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
 static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
                                    MSIMessage *msg, IOHandler *handler)
 {
-    VFIOPCIDevice *vdev = VFIO_PCI(pdev);
+    VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
     VFIOMSIVector *vector;
     int ret;
 
@@ -542,7 +542,7 @@ static int vfio_msix_vector_use(PCIDevice *pdev,
 
 static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
 {
-    VFIOPCIDevice *vdev = VFIO_PCI(pdev);
+    VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
     VFIOMSIVector *vector = &vdev->msi_vectors[nr];
 
     trace_vfio_msix_vector_release(vdev->vbasedev.name, nr);
@@ -1063,7 +1063,7 @@ static const MemoryRegionOps vfio_vga_ops = {
  */
 static void vfio_sub_page_bar_update_mapping(PCIDevice *pdev, int bar)
 {
-    VFIOPCIDevice *vdev = VFIO_PCI(pdev);
+    VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
     VFIORegion *region = &vdev->bars[bar].region;
     MemoryRegion *mmap_mr, *region_mr, *base_mr;
     PCIIORegion *r;
@@ -1109,7 +1109,7 @@ static void vfio_sub_page_bar_update_mapping(PCIDevice *pdev, int bar)
  */
 uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len)
 {
-    VFIOPCIDevice *vdev = VFIO_PCI(pdev);
+    VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
     uint32_t emu_bits = 0, emu_val = 0, phys_val = 0, val;
 
     memcpy(&emu_bits, vdev->emulated_config_bits + addr, len);
@@ -1142,7 +1142,7 @@ uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len)
 void vfio_pci_write_config(PCIDevice *pdev,
                            uint32_t addr, uint32_t val, int len)
 {
-    VFIOPCIDevice *vdev = VFIO_PCI(pdev);
+    VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
     uint32_t val_le = cpu_to_le32(val);
 
     trace_vfio_pci_write_config(vdev->vbasedev.name, addr, val, len);
@@ -2782,7 +2782,7 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
 
 static void vfio_realize(PCIDevice *pdev, Error **errp)
 {
-    VFIOPCIDevice *vdev = VFIO_PCI(pdev);
+    VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
     VFIODevice *vbasedev_iter;
     VFIOGroup *group;
     char *tmp, *subsys, group_path[PATH_MAX], *group_name;
@@ -3105,7 +3105,7 @@ error:
 
 static void vfio_instance_finalize(Object *obj)
 {
-    VFIOPCIDevice *vdev = VFIO_PCI(obj);
+    VFIOPCIDevice *vdev = VFIO_PCI_BASE(obj);
     VFIOGroup *group = vdev->vbasedev.group;
 
     vfio_display_finalize(vdev);
@@ -3125,7 +3125,7 @@ static void vfio_instance_finalize(Object *obj)
 
 static void vfio_exitfn(PCIDevice *pdev)
 {
-    VFIOPCIDevice *vdev = VFIO_PCI(pdev);
+    VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
 
     vfio_unregister_req_notifier(vdev);
     vfio_unregister_err_notifier(vdev);
@@ -3144,7 +3144,7 @@ static void vfio_exitfn(PCIDevice *pdev)
 
 static void vfio_pci_reset(DeviceState *dev)
 {
-    VFIOPCIDevice *vdev = VFIO_PCI(dev);
+    VFIOPCIDevice *vdev = VFIO_PCI_BASE(dev);
 
     trace_vfio_pci_reset(vdev->vbasedev.name);
 
@@ -3184,7 +3184,7 @@ post_reset:
 static void vfio_instance_init(Object *obj)
 {
     PCIDevice *pci_dev = PCI_DEVICE(obj);
-    VFIOPCIDevice *vdev = VFIO_PCI(obj);
+    VFIOPCIDevice *vdev = VFIO_PCI_BASE(obj);
 
     device_add_bootindex_property(obj, &vdev->bootindex,
                                   "bootindex", NULL,
@@ -3253,28 +3253,24 @@ static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
-static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
+static void vfio_pci_base_dev_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
     PCIDeviceClass *pdc = PCI_DEVICE_CLASS(klass);
 
-    dc->reset = vfio_pci_reset;
-    device_class_set_props(dc, vfio_pci_dev_properties);
-    dc->desc = "VFIO-based PCI device assignment";
+    dc->desc = "VFIO PCI base device";
     set_bit(DEVICE_CATEGORY_MISC, dc->categories);
-    pdc->realize = vfio_realize;
     pdc->exit = vfio_exitfn;
     pdc->config_read = vfio_pci_read_config;
     pdc->config_write = vfio_pci_write_config;
 }
 
-static const TypeInfo vfio_pci_dev_info = {
-    .name = TYPE_VFIO_PCI,
+static const TypeInfo vfio_pci_base_dev_info = {
+    .name = TYPE_VFIO_PCI_BASE,
     .parent = TYPE_PCI_DEVICE,
-    .instance_size = sizeof(VFIOPCIDevice),
-    .class_init = vfio_pci_dev_class_init,
-    .instance_init = vfio_instance_init,
-    .instance_finalize = vfio_instance_finalize,
+    .instance_size = 0,
+    .abstract = true,
+    .class_init = vfio_pci_base_dev_class_init,
     .interfaces = (InterfaceInfo[]) {
         { INTERFACE_PCIE_DEVICE },
         { INTERFACE_CONVENTIONAL_PCI_DEVICE },
@@ -3282,6 +3278,26 @@ static const TypeInfo vfio_pci_dev_info = {
     },
 };
 
+static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    PCIDeviceClass *pdc = PCI_DEVICE_CLASS(klass);
+
+    dc->reset = vfio_pci_reset;
+    device_class_set_props(dc, vfio_pci_dev_properties);
+    dc->desc = "VFIO-based PCI device assignment";
+    pdc->realize = vfio_realize;
+}
+
+static const TypeInfo vfio_pci_dev_info = {
+    .name = TYPE_VFIO_PCI,
+    .parent = TYPE_VFIO_PCI_BASE,
+    .instance_size = sizeof(VFIOKernPCIDevice),
+    .class_init = vfio_pci_dev_class_init,
+    .instance_init = vfio_instance_init,
+    .instance_finalize = vfio_instance_finalize,
+};
+
 static Property vfio_pci_dev_nohotplug_properties[] = {
     DEFINE_PROP_BOOL("ramfb", VFIOPCIDevice, enable_ramfb, false),
     DEFINE_PROP_END_OF_LIST(),
@@ -3298,12 +3314,13 @@ static void vfio_pci_nohotplug_dev_class_init(ObjectClass *klass, void *data)
 static const TypeInfo vfio_pci_nohotplug_dev_info = {
     .name = TYPE_VFIO_PCI_NOHOTPLUG,
     .parent = TYPE_VFIO_PCI,
-    .instance_size = sizeof(VFIOPCIDevice),
+    .instance_size = sizeof(VFIOKernPCIDevice),
     .class_init = vfio_pci_nohotplug_dev_class_init,
 };
 
 static void register_vfio_pci_dev_type(void)
 {
+    type_register_static(&vfio_pci_base_dev_info);
     type_register_static(&vfio_pci_dev_info);
     type_register_static(&vfio_pci_nohotplug_dev_info);
 }
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH RFC v2 03/16] vfio-user: Define type vfio_user_pci_dev_info
  2021-08-16 16:42 [PATCH RFC v2 00/16] vfio-user implementation Elena Ufimtseva
  2021-08-16 16:42 ` [PATCH RFC v2 01/16] vfio-user: introduce vfio-user protocol specification Elena Ufimtseva
  2021-08-16 16:42 ` [PATCH RFC v2 02/16] vfio-user: add VFIO base abstract class Elena Ufimtseva
@ 2021-08-16 16:42 ` Elena Ufimtseva
  2021-08-24 13:52   ` Stefan Hajnoczi
  2021-08-16 16:42 ` [PATCH RFC v2 04/16] vfio-user: connect vfio proxy to remote server Elena Ufimtseva
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 108+ messages in thread
From: Elena Ufimtseva @ 2021-08-16 16:42 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha, thanos.makatos

From: John Johnson <john.g.johnson@oracle.com>

New class for vfio-user with its class and instance
constructors and destructors, and its pci ops.

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/vfio/pci.h |  9 ++++++
 hw/vfio/pci.c | 86 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 95 insertions(+)

diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index bbc78aaeb3..08ac6475a4 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -187,6 +187,15 @@ struct VFIOKernPCIDevice {
     VFIOPCIDevice device;
 };
 
+#define TYPE_VFIO_USER_PCI "vfio-user-pci"
+OBJECT_DECLARE_SIMPLE_TYPE(VFIOUserPCIDevice, VFIO_USER_PCI)
+
+struct VFIOUserPCIDevice {
+    VFIOPCIDevice device;
+    char *sock_name;
+    bool secure_dma; /* disable shared mem for DMA */
+};
+
 /* Use uin32_t for vendor & device so PCI_ANY_ID expands and cannot match hw */
 static inline bool vfio_pci_is(VFIOPCIDevice *vdev, uint32_t vendor, uint32_t device)
 {
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index bea95efc33..d642aafb7f 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3326,3 +3326,89 @@ static void register_vfio_pci_dev_type(void)
 }
 
 type_init(register_vfio_pci_dev_type)
+
+
+/*
+ * vfio-user routines.
+ */
+
+/*
+ * Emulated devices don't use host hot reset
+ */
+static int vfio_user_pci_no_reset(VFIODevice *vbasedev)
+{
+    error_printf("vfio-user - no hot reset\n");
+    return 0;
+}
+
+static void vfio_user_pci_not_needed(VFIODevice *vbasedev)
+{
+    vbasedev->needs_reset = false;
+}
+
+static VFIODeviceOps vfio_user_pci_ops = {
+    .vfio_compute_needs_reset = vfio_user_pci_not_needed,
+    .vfio_hot_reset_multi = vfio_user_pci_no_reset,
+    .vfio_eoi = vfio_intx_eoi,
+    .vfio_get_object = vfio_pci_get_object,
+    .vfio_save_config = vfio_pci_save_config,
+    .vfio_load_config = vfio_pci_load_config,
+};
+
+static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
+{
+    ERRP_GUARD();
+    VFIOUserPCIDevice *udev = VFIO_USER_PCI(pdev);
+    VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
+    VFIODevice *vbasedev = &vdev->vbasedev;
+
+    if (!udev->sock_name) {
+        error_setg(errp, "No socket specified");
+        error_append_hint(errp, "Use -device vfio-user-pci,socket=<name>\n");
+        return;
+    }
+
+    vbasedev->name = g_strdup_printf("VFIO user <%s>", udev->sock_name);
+    vbasedev->dev = DEVICE(vdev);
+    vbasedev->fd = -1;
+    vbasedev->type = VFIO_DEVICE_TYPE_PCI;
+    vbasedev->no_mmap = false;
+    vbasedev->ops = &vfio_user_pci_ops;
+
+}
+
+static void vfio_user_instance_finalize(Object *obj)
+{
+}
+
+static Property vfio_user_pci_dev_properties[] = {
+    DEFINE_PROP_STRING("socket", VFIOUserPCIDevice, sock_name),
+    DEFINE_PROP_BOOL("secure-dma", VFIOUserPCIDevice, secure_dma, false),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+static void vfio_user_pci_dev_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    PCIDeviceClass *pdc = PCI_DEVICE_CLASS(klass);
+
+    device_class_set_props(dc, vfio_user_pci_dev_properties);
+    dc->desc = "VFIO over socket PCI device assignment";
+    pdc->realize = vfio_user_pci_realize;
+}
+
+static const TypeInfo vfio_user_pci_dev_info = {
+    .name = TYPE_VFIO_USER_PCI,
+    .parent = TYPE_VFIO_PCI_BASE,
+    .instance_size = sizeof(VFIOUserPCIDevice),
+    .class_init = vfio_user_pci_dev_class_init,
+    .instance_init = vfio_instance_init,
+    .instance_finalize = vfio_user_instance_finalize,
+};
+
+static void register_vfio_user_dev_type(void)
+{
+    type_register_static(&vfio_user_pci_dev_info);
+}
+
+type_init(register_vfio_user_dev_type)
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH RFC v2 04/16] vfio-user: connect vfio proxy to remote server
  2021-08-16 16:42 [PATCH RFC v2 00/16] vfio-user implementation Elena Ufimtseva
                   ` (2 preceding siblings ...)
  2021-08-16 16:42 ` [PATCH RFC v2 03/16] vfio-user: Define type vfio_user_pci_dev_info Elena Ufimtseva
@ 2021-08-16 16:42 ` Elena Ufimtseva
  2021-08-18 18:47   ` Alex Williamson
  2021-08-24 14:15   ` Stefan Hajnoczi
  2021-08-16 16:42 ` [PATCH RFC v2 05/16] vfio-user: define VFIO Proxy and communication functions Elena Ufimtseva
                   ` (12 subsequent siblings)
  16 siblings, 2 replies; 108+ messages in thread
From: Elena Ufimtseva @ 2021-08-16 16:42 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha, thanos.makatos

From: John Johnson <john.g.johnson@oracle.com>

Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/vfio/user.h                |  66 ++++++++++++++
 include/hw/vfio/vfio-common.h |   2 +
 hw/vfio/pci.c                 |  29 ++++++
 hw/vfio/user.c                | 160 ++++++++++++++++++++++++++++++++++
 MAINTAINERS                   |   4 +
 hw/vfio/meson.build           |   1 +
 6 files changed, 262 insertions(+)
 create mode 100644 hw/vfio/user.h
 create mode 100644 hw/vfio/user.c

diff --git a/hw/vfio/user.h b/hw/vfio/user.h
new file mode 100644
index 0000000000..62b2d03d56
--- /dev/null
+++ b/hw/vfio/user.h
@@ -0,0 +1,66 @@
+#ifndef VFIO_USER_H
+#define VFIO_USER_H
+
+/*
+ * vfio protocol over a UNIX socket.
+ *
+ * Copyright © 2018, 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+typedef struct {
+    int send_fds;
+    int recv_fds;
+    int *fds;
+} VFIOUserFDs;
+
+typedef struct VFIOUserReply {
+    QTAILQ_ENTRY(VFIOUserReply) next;
+    VFIOUserFDs *fds;
+    uint32_t rsize;
+    uint32_t id;
+    QemuCond cv;
+    bool complete;
+    bool nowait;
+} VFIOUserReply;
+
+
+enum proxy_state {
+    VFIO_PROXY_CONNECTED = 1,
+    VFIO_PROXY_RECV_ERROR = 2,
+    VFIO_PROXY_CLOSING = 3,
+    VFIO_PROXY_CLOSED = 4,
+};
+
+typedef struct VFIOProxy {
+    QLIST_ENTRY(VFIOProxy) next;
+    char *sockname;
+    struct QIOChannel *ioc;
+    int (*request)(void *opaque, char *buf, VFIOUserFDs *fds);
+    void *reqarg;
+    int flags;
+    QemuCond close_cv;
+
+    /*
+     * above only changed when BQL is held
+     * below are protected by per-proxy lock
+     */
+    QemuMutex lock;
+    QTAILQ_HEAD(, VFIOUserReply) free;
+    QTAILQ_HEAD(, VFIOUserReply) pending;
+    VFIOUserReply *last_nowait;
+    enum proxy_state state;
+    bool close_wait;
+} VFIOProxy;
+
+/* VFIOProxy flags */
+#define VFIO_PROXY_CLIENT       0x1
+#define VFIO_PROXY_SECURE       0x2
+
+VFIOProxy *vfio_user_connect_dev(SocketAddress *addr, Error **errp);
+void vfio_user_disconnect(VFIOProxy *proxy);
+
+#endif /* VFIO_USER_H */
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 8af11b0a76..f43dc6e5d0 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -75,6 +75,7 @@ typedef struct VFIOAddressSpace {
 } VFIOAddressSpace;
 
 struct VFIOGroup;
+typedef struct VFIOProxy VFIOProxy;
 
 typedef struct VFIOContainer {
     VFIOAddressSpace *space;
@@ -143,6 +144,7 @@ typedef struct VFIODevice {
     VFIOMigration *migration;
     Error *migration_blocker;
     OnOffAuto pre_copy_dirty_page_tracking;
+    VFIOProxy *proxy;
 } VFIODevice;
 
 struct VFIODeviceOps {
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index d642aafb7f..7c2d245ca5 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -42,6 +42,7 @@
 #include "qapi/error.h"
 #include "migration/blocker.h"
 #include "migration/qemu-file.h"
+#include "hw/vfio/user.h"
 
 #define TYPE_VFIO_PCI_NOHOTPLUG "vfio-pci-nohotplug"
 
@@ -3361,13 +3362,35 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
     VFIOUserPCIDevice *udev = VFIO_USER_PCI(pdev);
     VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
     VFIODevice *vbasedev = &vdev->vbasedev;
+    SocketAddress addr;
+    VFIOProxy *proxy;
+    Error *err = NULL;
 
+    /*
+     * TODO: make option parser understand SocketAddress
+     * and use that instead of having scaler options
+     * for each socket type.
+     */
     if (!udev->sock_name) {
         error_setg(errp, "No socket specified");
         error_append_hint(errp, "Use -device vfio-user-pci,socket=<name>\n");
         return;
     }
 
+    memset(&addr, 0, sizeof(addr));
+    addr.type = SOCKET_ADDRESS_TYPE_UNIX;
+    addr.u.q_unix.path = udev->sock_name;
+    proxy = vfio_user_connect_dev(&addr, &err);
+    if (!proxy) {
+        error_setg(errp, "Remote proxy not found");
+        return;
+    }
+    vbasedev->proxy = proxy;
+
+    if (udev->secure_dma) {
+        proxy->flags |= VFIO_PROXY_SECURE;
+    }
+
     vbasedev->name = g_strdup_printf("VFIO user <%s>", udev->sock_name);
     vbasedev->dev = DEVICE(vdev);
     vbasedev->fd = -1;
@@ -3379,6 +3402,12 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
 
 static void vfio_user_instance_finalize(Object *obj)
 {
+    VFIOPCIDevice *vdev = VFIO_PCI_BASE(obj);
+    VFIODevice *vbasedev = &vdev->vbasedev;
+
+    vfio_put_device(vdev);
+
+    vfio_user_disconnect(vbasedev->proxy);
 }
 
 static Property vfio_user_pci_dev_properties[] = {
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
new file mode 100644
index 0000000000..3bd304e036
--- /dev/null
+++ b/hw/vfio/user.c
@@ -0,0 +1,160 @@
+/*
+ * vfio protocol over a UNIX socket.
+ *
+ * Copyright © 2018, 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include <linux/vfio.h>
+#include <sys/ioctl.h>
+
+#include "qemu/error-report.h"
+#include "qapi/error.h"
+#include "qemu/main-loop.h"
+#include "hw/hw.h"
+#include "hw/vfio/vfio-common.h"
+#include "hw/vfio/vfio.h"
+#include "qemu/sockets.h"
+#include "io/channel.h"
+#include "io/channel-socket.h"
+#include "io/channel-util.h"
+#include "sysemu/iothread.h"
+#include "user.h"
+
+static IOThread *vfio_user_iothread;
+static void vfio_user_shutdown(VFIOProxy *proxy);
+
+
+/*
+ * Functions called by main, CPU, or iothread threads
+ */
+
+static void vfio_user_shutdown(VFIOProxy *proxy)
+{
+    qio_channel_shutdown(proxy->ioc, QIO_CHANNEL_SHUTDOWN_READ, NULL);
+}
+
+
+/*
+ * Functions only called by iothread
+ */
+
+static void vfio_user_cb(void *opaque)
+{
+    VFIOProxy *proxy = opaque;
+
+    qemu_mutex_lock(&proxy->lock);
+    proxy->state = VFIO_PROXY_CLOSED;
+    qemu_mutex_unlock(&proxy->lock);
+    qemu_cond_signal(&proxy->close_cv);
+}
+
+
+/*
+ * Functions called by main or CPU threads
+ */
+
+static QLIST_HEAD(, VFIOProxy) vfio_user_sockets =
+    QLIST_HEAD_INITIALIZER(vfio_user_sockets);
+
+VFIOProxy *vfio_user_connect_dev(SocketAddress *addr, Error **errp)
+{
+    VFIOProxy *proxy;
+    QIOChannelSocket *sioc;
+    QIOChannel *ioc;
+    char *sockname;
+
+    if (addr->type != SOCKET_ADDRESS_TYPE_UNIX) {
+        error_setg(errp, "vfio_user_connect - bad address family");
+        return NULL;
+    }
+    sockname = addr->u.q_unix.path;
+
+    sioc = qio_channel_socket_new();
+    ioc = QIO_CHANNEL(sioc);
+    if (qio_channel_socket_connect_sync(sioc, addr, errp)) {
+        object_unref(OBJECT(ioc));
+        return NULL;
+    }
+    qio_channel_set_blocking(ioc, true, NULL);
+
+    proxy = g_malloc0(sizeof(VFIOProxy));
+    proxy->sockname = sockname;
+    proxy->ioc = ioc;
+    proxy->flags = VFIO_PROXY_CLIENT;
+    proxy->state = VFIO_PROXY_CONNECTED;
+    qemu_cond_init(&proxy->close_cv);
+
+    if (vfio_user_iothread == NULL) {
+        vfio_user_iothread = iothread_create("VFIO user", errp);
+    }
+
+    qemu_mutex_init(&proxy->lock);
+    QTAILQ_INIT(&proxy->free);
+    QTAILQ_INIT(&proxy->pending);
+    QLIST_INSERT_HEAD(&vfio_user_sockets, proxy, next);
+
+    return proxy;
+}
+
+void vfio_user_disconnect(VFIOProxy *proxy)
+{
+    VFIOUserReply *r1, *r2;
+
+    qemu_mutex_lock(&proxy->lock);
+
+    /* our side is quitting */
+    if (proxy->state == VFIO_PROXY_CONNECTED) {
+        vfio_user_shutdown(proxy);
+        if (!QTAILQ_EMPTY(&proxy->pending)) {
+            error_printf("vfio_user_disconnect: outstanding requests\n");
+        }
+    }
+    object_unref(OBJECT(proxy->ioc));
+    proxy->ioc = NULL;
+
+    proxy->state = VFIO_PROXY_CLOSING;
+    QTAILQ_FOREACH_SAFE(r1, &proxy->pending, next, r2) {
+        qemu_cond_destroy(&r1->cv);
+        QTAILQ_REMOVE(&proxy->pending, r1, next);
+        g_free(r1);
+    }
+    QTAILQ_FOREACH_SAFE(r1, &proxy->free, next, r2) {
+        qemu_cond_destroy(&r1->cv);
+        QTAILQ_REMOVE(&proxy->free, r1, next);
+        g_free(r1);
+    }
+
+    /*
+     * Make sure the iothread isn't blocking anywhere
+     * with a ref to this proxy by waiting for a BH
+     * handler to run after the proxy fd handlers were
+     * deleted above.
+     */
+    proxy->close_wait = 1;
+    aio_bh_schedule_oneshot(iothread_get_aio_context(vfio_user_iothread),
+                            vfio_user_cb, proxy);
+
+    /* drop locks so the iothread can make progress */
+    qemu_mutex_unlock_iothread();
+    qemu_cond_wait(&proxy->close_cv, &proxy->lock);
+
+    /* we now hold the only ref to proxy */
+    qemu_mutex_unlock(&proxy->lock);
+    qemu_cond_destroy(&proxy->close_cv);
+    qemu_mutex_destroy(&proxy->lock);
+
+    qemu_mutex_lock_iothread();
+
+    QLIST_REMOVE(proxy, next);
+    if (QLIST_EMPTY(&vfio_user_sockets)) {
+        iothread_destroy(vfio_user_iothread);
+        vfio_user_iothread = NULL;
+    }
+
+    g_free(proxy);
+}
diff --git a/MAINTAINERS b/MAINTAINERS
index d838b9e3f2..f429bab391 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1882,8 +1882,12 @@ L: qemu-s390x@nongnu.org
 vfio-user
 M: John G Johnson <john.g.johnson@oracle.com>
 M: Thanos Makatos <thanos.makatos@nutanix.com>
+M: Elena Ufimtseva <elena.ufimtseva@oracle.com>
+M: Jagannathan Raman <jag.raman@oracle.com>
 S: Supported
 F: docs/devel/vfio-user.rst
+F: hw/vfio/user.c
+F: hw/vfio/user.h
 
 vhost
 M: Michael S. Tsirkin <mst@redhat.com>
diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index da9af297a0..739b30be73 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -8,6 +8,7 @@ vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
   'display.c',
   'pci-quirks.c',
   'pci.c',
+  'user.c',
 ))
 vfio_ss.add(when: 'CONFIG_VFIO_CCW', if_true: files('ccw.c'))
 vfio_ss.add(when: 'CONFIG_VFIO_PLATFORM', if_true: files('platform.c'))
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH RFC v2 05/16] vfio-user: define VFIO Proxy and communication functions
  2021-08-16 16:42 [PATCH RFC v2 00/16] vfio-user implementation Elena Ufimtseva
                   ` (3 preceding siblings ...)
  2021-08-16 16:42 ` [PATCH RFC v2 04/16] vfio-user: connect vfio proxy to remote server Elena Ufimtseva
@ 2021-08-16 16:42 ` Elena Ufimtseva
  2021-08-24 15:14   ` Stefan Hajnoczi
  2021-08-16 16:42 ` [PATCH RFC v2 06/16] vfio-user: negotiate version with remote server Elena Ufimtseva
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 108+ messages in thread
From: Elena Ufimtseva @ 2021-08-16 16:42 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha, thanos.makatos

From: John Johnson <john.g.johnson@oracle.com>

Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/vfio/user-protocol.h |  62 +++++++++
 hw/vfio/user.h          |   8 ++
 hw/vfio/pci.c           |   6 +
 hw/vfio/user.c          | 289 ++++++++++++++++++++++++++++++++++++++++
 MAINTAINERS             |   1 +
 5 files changed, 366 insertions(+)
 create mode 100644 hw/vfio/user-protocol.h

diff --git a/hw/vfio/user-protocol.h b/hw/vfio/user-protocol.h
new file mode 100644
index 0000000000..27062cb910
--- /dev/null
+++ b/hw/vfio/user-protocol.h
@@ -0,0 +1,62 @@
+#ifndef VFIO_USER_PROTOCOL_H
+#define VFIO_USER_PROTOCOL_H
+
+/*
+ * vfio protocol over a UNIX socket.
+ *
+ * Copyright © 2018, 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * Each message has a standard header that describes the command
+ * being sent, which is almost always a VFIO ioctl().
+ *
+ * The header may be followed by command-specific data, such as the
+ * region and offset info for read and write commands.
+ */
+
+typedef struct {
+    uint16_t id;
+    uint16_t command;
+    uint32_t size;
+    uint32_t flags;
+    uint32_t error_reply;
+} VFIOUserHdr;
+
+/* VFIOUserHdr commands */
+enum vfio_user_command {
+    VFIO_USER_VERSION                   = 1,
+    VFIO_USER_DMA_MAP                   = 2,
+    VFIO_USER_DMA_UNMAP                 = 3,
+    VFIO_USER_DEVICE_GET_INFO           = 4,
+    VFIO_USER_DEVICE_GET_REGION_INFO    = 5,
+    VFIO_USER_DEVICE_GET_REGION_IO_FDS  = 6,
+    VFIO_USER_DEVICE_GET_IRQ_INFO       = 7,
+    VFIO_USER_DEVICE_SET_IRQS           = 8,
+    VFIO_USER_REGION_READ               = 9,
+    VFIO_USER_REGION_WRITE              = 10,
+    VFIO_USER_DMA_READ                  = 11,
+    VFIO_USER_DMA_WRITE                 = 12,
+    VFIO_USER_DEVICE_RESET              = 13,
+    VFIO_USER_DIRTY_PAGES               = 14,
+    VFIO_USER_MAX,
+};
+
+/* VFIOUserHdr flags */
+#define VFIO_USER_REQUEST       0x0
+#define VFIO_USER_REPLY         0x1
+#define VFIO_USER_TYPE          0xF
+
+#define VFIO_USER_NO_REPLY      0x10
+#define VFIO_USER_ERROR         0x20
+
+
+#define VFIO_USER_DEF_MAX_FDS   8
+#define VFIO_USER_MAX_MAX_FDS   16
+
+#define VFIO_USER_DEF_MAX_XFER  (1024 * 1024)
+#define VFIO_USER_MAX_MAX_XFER  (64 * 1024 * 1024)
+
+
+#endif /* VFIO_USER_PROTOCOL_H */
diff --git a/hw/vfio/user.h b/hw/vfio/user.h
index 62b2d03d56..905e374e12 100644
--- a/hw/vfio/user.h
+++ b/hw/vfio/user.h
@@ -11,6 +11,8 @@
  *
  */
 
+#include "user-protocol.h"
+
 typedef struct {
     int send_fds;
     int recv_fds;
@@ -19,6 +21,7 @@ typedef struct {
 
 typedef struct VFIOUserReply {
     QTAILQ_ENTRY(VFIOUserReply) next;
+    VFIOUserHdr *msg;
     VFIOUserFDs *fds;
     uint32_t rsize;
     uint32_t id;
@@ -62,5 +65,10 @@ typedef struct VFIOProxy {
 
 VFIOProxy *vfio_user_connect_dev(SocketAddress *addr, Error **errp);
 void vfio_user_disconnect(VFIOProxy *proxy);
+void vfio_user_set_reqhandler(VFIODevice *vbasdev,
+                              int (*handler)(void *opaque, char *buf,
+                                             VFIOUserFDs *fds),
+                                             void *reqarg);
+void vfio_user_send_reply(VFIOProxy *proxy, char *buf, int ret);
 
 #endif /* VFIO_USER_H */
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 7c2d245ca5..7005d9f891 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3333,6 +3333,11 @@ type_init(register_vfio_pci_dev_type)
  * vfio-user routines.
  */
 
+static int vfio_user_pci_process_req(void *opaque, char *buf, VFIOUserFDs *fds)
+{
+    return 0;
+}
+
 /*
  * Emulated devices don't use host hot reset
  */
@@ -3386,6 +3391,7 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
         return;
     }
     vbasedev->proxy = proxy;
+    vfio_user_set_reqhandler(vbasedev, vfio_user_pci_process_req, vdev);
 
     if (udev->secure_dma) {
         proxy->flags |= VFIO_PROXY_SECURE;
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
index 3bd304e036..2fcc77d997 100644
--- a/hw/vfio/user.c
+++ b/hw/vfio/user.c
@@ -25,8 +25,15 @@
 #include "sysemu/iothread.h"
 #include "user.h"
 
+static uint64_t max_xfer_size = VFIO_USER_DEF_MAX_XFER;
 static IOThread *vfio_user_iothread;
+
 static void vfio_user_shutdown(VFIOProxy *proxy);
+static void vfio_user_recv(void *opaque);
+static void vfio_user_send_locked(VFIOProxy *proxy, VFIOUserHdr *msg,
+                                  VFIOUserFDs *fds);
+static void vfio_user_send(VFIOProxy *proxy, VFIOUserHdr *msg,
+                           VFIOUserFDs *fds);
 
 
 /*
@@ -36,6 +43,67 @@ static void vfio_user_shutdown(VFIOProxy *proxy);
 static void vfio_user_shutdown(VFIOProxy *proxy)
 {
     qio_channel_shutdown(proxy->ioc, QIO_CHANNEL_SHUTDOWN_READ, NULL);
+    qio_channel_set_aio_fd_handler(proxy->ioc,
+                                   iothread_get_aio_context(vfio_user_iothread),
+                                   NULL, NULL, NULL);
+}
+
+static void vfio_user_send_locked(VFIOProxy *proxy, VFIOUserHdr *msg,
+                                  VFIOUserFDs *fds)
+{
+    struct iovec iov = {
+        .iov_base = msg,
+        .iov_len = msg->size,
+    };
+    size_t numfds = 0;
+    int msgleft, ret, *fdp = NULL;
+    char *buf;
+    Error *local_err = NULL;
+
+    if (proxy->state != VFIO_PROXY_CONNECTED) {
+        msg->flags |= VFIO_USER_ERROR;
+        msg->error_reply = ECONNRESET;
+        return;
+    }
+
+    if (fds != NULL && fds->send_fds != 0) {
+        numfds = fds->send_fds;
+        fdp = fds->fds;
+    }
+
+    ret = qio_channel_writev_full(proxy->ioc, &iov, 1, fdp, numfds, &local_err);
+    if (ret < 0) {
+        goto err;
+    }
+    if (ret == msg->size) {
+        return;
+    }
+
+    buf = iov.iov_base + ret;
+    msgleft = iov.iov_len - ret;
+    do {
+        ret = qio_channel_write(proxy->ioc, buf, msgleft, &local_err);
+        if (ret < 0) {
+            goto err;
+        }
+        buf += ret;
+        msgleft -= ret;
+    } while (msgleft != 0);
+    return;
+
+err:
+    msg->flags |= VFIO_USER_ERROR;
+    msg->error_reply = EIO;
+    error_report_err(local_err);
+}
+
+static void vfio_user_send(VFIOProxy *proxy, VFIOUserHdr *msg,
+                           VFIOUserFDs *fds)
+{
+
+    qemu_mutex_lock(&proxy->lock);
+    vfio_user_send_locked(proxy, msg, fds);
+    qemu_mutex_unlock(&proxy->lock);
 }
 
 
@@ -43,6 +111,213 @@ static void vfio_user_shutdown(VFIOProxy *proxy)
  * Functions only called by iothread
  */
 
+void vfio_user_send_reply(VFIOProxy *proxy, char *buf, int ret)
+{
+    VFIOUserHdr *hdr = (VFIOUserHdr *)buf;
+
+    /*
+     * convert header to associated reply
+     * positive ret is reply size, negative is error code
+     */
+    hdr->flags = VFIO_USER_REPLY;
+    if (ret >= sizeof(VFIOUserHdr)) {
+        hdr->size = ret;
+    } else if (ret < 0) {
+        hdr->flags |= VFIO_USER_ERROR;
+        hdr->error_reply = -ret;
+        hdr->size = sizeof(*hdr);
+    } else {
+        error_printf("vfio_user_send_reply - size too small\n");
+        return;
+    }
+    vfio_user_send(proxy, hdr, NULL);
+}
+
+void vfio_user_recv(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOProxy *proxy = vbasedev->proxy;
+    VFIOUserReply *reply = NULL;
+    g_autofree int *fdp = NULL;
+    VFIOUserFDs reqfds = { 0, 0, fdp };
+    VFIOUserHdr msg;
+    struct iovec iov = {
+        .iov_base = &msg,
+        .iov_len = sizeof(msg),
+    };
+    bool isreply;
+    int i, ret;
+    size_t msgleft, numfds = 0;
+    char *data = NULL;
+    g_autofree char *buf = NULL;
+    Error *local_err = NULL;
+
+    qemu_mutex_lock(&proxy->lock);
+    if (proxy->state == VFIO_PROXY_CLOSING) {
+        qemu_mutex_unlock(&proxy->lock);
+        return;
+    }
+
+    ret = qio_channel_readv_full(proxy->ioc, &iov, 1, &fdp, &numfds,
+                                 &local_err);
+    if (ret <= 0) {
+        /* read error or other side closed connection */
+        goto fatal;
+    }
+
+    if (ret < sizeof(msg)) {
+        error_setg(&local_err, "vfio_user_recv short read of header");
+        goto err;
+    }
+    if (msg.size < sizeof(VFIOUserHdr)) {
+        error_setg(&local_err, "vfio_user_recv bad header size");
+        goto err;
+    }
+
+    /*
+     * For replies, find the matching pending request
+     */
+    switch (msg.flags & VFIO_USER_TYPE) {
+    case VFIO_USER_REQUEST:
+        isreply = 0;
+        break;
+    case VFIO_USER_REPLY:
+        isreply = 1;
+        break;
+    default:
+        error_setg(&local_err, "vfio_user_recv unknown message type");
+        goto err;
+    }
+
+    if (isreply) {
+        QTAILQ_FOREACH(reply, &proxy->pending, next) {
+            if (msg.id == reply->id) {
+                break;
+            }
+        }
+        if (reply == NULL) {
+            error_setg(&local_err, "vfio_user_recv unexpected reply");
+            goto err;
+        }
+        QTAILQ_REMOVE(&proxy->pending, reply, next);
+
+        /*
+         * Process any received FDs
+         */
+        if (numfds != 0) {
+            if (reply->fds == NULL || reply->fds->recv_fds < numfds) {
+                error_setg(&local_err, "vfio_user_recv unexpected FDs");
+                goto err;
+            }
+            reply->fds->recv_fds = numfds;
+            memcpy(reply->fds->fds, fdp, numfds * sizeof(int));
+        }
+
+    } else {
+        /*
+         * The client doesn't expect any FDs in requests, but
+         * they will be expected on the server
+         */
+        if (numfds != 0 && (proxy->flags & VFIO_PROXY_CLIENT)) {
+            error_setg(&local_err, "vfio_user_recv fd in client reply");
+            goto err;
+        }
+        reqfds.recv_fds = numfds;
+    }
+
+    /*
+     * put the whole message into a single buffer
+     */
+    if (isreply) {
+        if (msg.size > reply->rsize) {
+            error_setg(&local_err,
+                       "vfio_user_recv reply larger than recv buffer");
+            goto fatal;
+        }
+        *reply->msg = msg;
+        data = (char *)reply->msg + sizeof(msg);
+    } else {
+        if (msg.size > max_xfer_size) {
+            error_setg(&local_err, "vfio_user_recv request larger than max");
+            goto fatal;
+        }
+        buf = g_malloc0(msg.size);
+        memcpy(buf, &msg, sizeof(msg));
+        data = buf + sizeof(msg);
+    }
+
+    msgleft = msg.size - sizeof(msg);
+    if (msgleft != 0) {
+        ret = qio_channel_read(proxy->ioc, data, msgleft, &local_err);
+        if (ret < 0) {
+            goto fatal;
+        }
+        if (ret != msgleft) {
+            error_setg(&local_err, "vfio_user_recv short read of msg body");
+            goto err;
+        }
+    }
+
+    /*
+     * Replies signal a waiter, requests get processed by vfio code
+     * that may assume the iothread lock is held.
+     */
+    if (isreply) {
+        reply->complete = 1;
+        if (!reply->nowait) {
+            qemu_cond_signal(&reply->cv);
+        } else {
+            if (msg.flags & VFIO_USER_ERROR) {
+                error_printf("vfio_user_rcv error reply on async request ");
+                error_printf("command %x error %s\n", msg.command,
+                             strerror(msg.error_reply));
+            }
+            /* just free it if no one is waiting */
+            reply->nowait = 0;
+            if (proxy->last_nowait == reply) {
+                proxy->last_nowait = NULL;
+            }
+            g_free(reply->msg);
+            QTAILQ_INSERT_HEAD(&proxy->free, reply, next);
+        }
+        qemu_mutex_unlock(&proxy->lock);
+    } else {
+        qemu_mutex_unlock(&proxy->lock);
+        qemu_mutex_lock_iothread();
+        /*
+         * make sure proxy wasn't closed while we waited
+         * checking state without holding the proxy lock is safe
+         * since it's only set to CLOSING when BQL is held
+         */
+        if (proxy->state != VFIO_PROXY_CLOSING) {
+            ret = proxy->request(proxy->reqarg, buf, &reqfds);
+            if (ret < 0 && !(msg.flags & VFIO_USER_NO_REPLY)) {
+                vfio_user_send_reply(proxy, buf, ret);
+            }
+        }
+        qemu_mutex_unlock_iothread();
+    }
+    return;
+
+fatal:
+    vfio_user_shutdown(proxy);
+    proxy->state = VFIO_PROXY_RECV_ERROR;
+
+err:
+    for (i = 0; i < numfds; i++) {
+        close(fdp[i]);
+    }
+    if (reply != NULL) {
+        /* force an error to keep sending thread from hanging */
+        reply->msg->flags |= VFIO_USER_ERROR;
+        reply->msg->error_reply = EINVAL;
+        reply->complete = 1;
+        qemu_cond_signal(&reply->cv);
+    }
+    qemu_mutex_unlock(&proxy->lock);
+    error_report_err(local_err);
+}
+
 static void vfio_user_cb(void *opaque)
 {
     VFIOProxy *proxy = opaque;
@@ -101,6 +376,20 @@ VFIOProxy *vfio_user_connect_dev(SocketAddress *addr, Error **errp)
     return proxy;
 }
 
+void vfio_user_set_reqhandler(VFIODevice *vbasedev,
+                              int (*handler)(void *opaque, char *buf,
+                                             VFIOUserFDs *fds),
+                              void *reqarg)
+{
+    VFIOProxy *proxy = vbasedev->proxy;
+
+    proxy->request = handler;
+    proxy->reqarg = reqarg;
+    qio_channel_set_aio_fd_handler(proxy->ioc,
+                                   iothread_get_aio_context(vfio_user_iothread),
+                                   vfio_user_recv, NULL, vbasedev);
+}
+
 void vfio_user_disconnect(VFIOProxy *proxy)
 {
     VFIOUserReply *r1, *r2;
diff --git a/MAINTAINERS b/MAINTAINERS
index f429bab391..52d37dd088 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1888,6 +1888,7 @@ S: Supported
 F: docs/devel/vfio-user.rst
 F: hw/vfio/user.c
 F: hw/vfio/user.h
+F: hw/vfio/user-protocol.h
 
 vhost
 M: Michael S. Tsirkin <mst@redhat.com>
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH RFC v2 06/16] vfio-user: negotiate version with remote server
  2021-08-16 16:42 [PATCH RFC v2 00/16] vfio-user implementation Elena Ufimtseva
                   ` (4 preceding siblings ...)
  2021-08-16 16:42 ` [PATCH RFC v2 05/16] vfio-user: define VFIO Proxy and communication functions Elena Ufimtseva
@ 2021-08-16 16:42 ` Elena Ufimtseva
  2021-08-24 15:59   ` Stefan Hajnoczi
  2021-08-16 16:42 ` [PATCH RFC v2 07/16] vfio-user: get device info Elena Ufimtseva
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 108+ messages in thread
From: Elena Ufimtseva @ 2021-08-16 16:42 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha, thanos.makatos

From: John Johnson <john.g.johnson@oracle.com>

Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
---
 hw/vfio/user-protocol.h |  23 ++++
 hw/vfio/user.h          |   1 +
 hw/vfio/pci.c           |   9 ++
 hw/vfio/user.c          | 267 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 300 insertions(+)

diff --git a/hw/vfio/user-protocol.h b/hw/vfio/user-protocol.h
index 27062cb910..14b762d1ad 100644
--- a/hw/vfio/user-protocol.h
+++ b/hw/vfio/user-protocol.h
@@ -52,6 +52,29 @@ enum vfio_user_command {
 #define VFIO_USER_ERROR         0x20
 
 
+/*
+ * VFIO_USER_VERSION
+ */
+typedef struct {
+    VFIOUserHdr hdr;
+    uint16_t major;
+    uint16_t minor;
+    char capabilities[];
+} VFIOUserVersion;
+
+#define VFIO_USER_MAJOR_VER     0
+#define VFIO_USER_MINOR_VER     0
+
+#define VFIO_USER_CAP           "capabilities"
+
+/* "capabilities" members */
+#define VFIO_USER_CAP_MAX_FDS   "max_msg_fds"
+#define VFIO_USER_CAP_MAX_XFER  "max_data_xfer_size"
+#define VFIO_USER_CAP_MIGR      "migration"
+
+/* "migration" member */
+#define VFIO_USER_CAP_PGSIZE    "pgsize"
+
 #define VFIO_USER_DEF_MAX_FDS   8
 #define VFIO_USER_MAX_MAX_FDS   16
 
diff --git a/hw/vfio/user.h b/hw/vfio/user.h
index 905e374e12..cab957ba7a 100644
--- a/hw/vfio/user.h
+++ b/hw/vfio/user.h
@@ -70,5 +70,6 @@ void vfio_user_set_reqhandler(VFIODevice *vbasdev,
                                              VFIOUserFDs *fds),
                                              void *reqarg);
 void vfio_user_send_reply(VFIOProxy *proxy, char *buf, int ret);
+int vfio_user_validate_version(VFIODevice *vbasedev, Error **errp);
 
 #endif /* VFIO_USER_H */
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 7005d9f891..eae33e746f 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3397,6 +3397,12 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
         proxy->flags |= VFIO_PROXY_SECURE;
     }
 
+    vfio_user_validate_version(vbasedev, &err);
+    if (err != NULL) {
+        error_propagate(errp, err);
+        goto error;
+    }
+
     vbasedev->name = g_strdup_printf("VFIO user <%s>", udev->sock_name);
     vbasedev->dev = DEVICE(vdev);
     vbasedev->fd = -1;
@@ -3404,6 +3410,9 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
     vbasedev->no_mmap = false;
     vbasedev->ops = &vfio_user_pci_ops;
 
+error:
+    vfio_user_disconnect(proxy);
+    error_prepend(errp, VFIO_MSG_PREFIX, vdev->vbasedev.name);
 }
 
 static void vfio_user_instance_finalize(Object *obj)
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
index 2fcc77d997..e89464a571 100644
--- a/hw/vfio/user.c
+++ b/hw/vfio/user.c
@@ -23,9 +23,16 @@
 #include "io/channel-socket.h"
 #include "io/channel-util.h"
 #include "sysemu/iothread.h"
+#include "qapi/qmp/qdict.h"
+#include "qapi/qmp/qjson.h"
+#include "qapi/qmp/qnull.h"
+#include "qapi/qmp/qstring.h"
+#include "qapi/qmp/qnum.h"
 #include "user.h"
 
 static uint64_t max_xfer_size = VFIO_USER_DEF_MAX_XFER;
+static uint64_t max_send_fds = VFIO_USER_DEF_MAX_FDS;
+static int wait_time = 1000;   /* wait 1 sec for replies */
 static IOThread *vfio_user_iothread;
 
 static void vfio_user_shutdown(VFIOProxy *proxy);
@@ -34,7 +41,14 @@ static void vfio_user_send_locked(VFIOProxy *proxy, VFIOUserHdr *msg,
                                   VFIOUserFDs *fds);
 static void vfio_user_send(VFIOProxy *proxy, VFIOUserHdr *msg,
                            VFIOUserFDs *fds);
+static void vfio_user_request_msg(VFIOUserHdr *hdr, uint16_t cmd,
+                                  uint32_t size, uint32_t flags);
+static void vfio_user_send_recv(VFIOProxy *proxy, VFIOUserHdr *msg,
+                                VFIOUserFDs *fds, int rsize, int flags);
 
+/* vfio_user_send_recv flags */
+#define NOWAIT          0x1  /* do not wait for reply */
+#define NOIOLOCK        0x2  /* do not drop iolock */
 
 /*
  * Functions called by main, CPU, or iothread threads
@@ -333,6 +347,79 @@ static void vfio_user_cb(void *opaque)
  * Functions called by main or CPU threads
  */
 
+static void vfio_user_send_recv(VFIOProxy *proxy, VFIOUserHdr *msg,
+                                VFIOUserFDs *fds, int rsize, int flags)
+{
+    VFIOUserReply *reply;
+    bool iolock = 0;
+
+    if (msg->flags & VFIO_USER_NO_REPLY) {
+        error_printf("vfio_user_send_recv on async message\n");
+        return;
+    }
+
+    /*
+     * We may block later, so use a per-proxy lock and let
+     * the iothreads run while we sleep unless told no to
+     */
+    QEMU_LOCK_GUARD(&proxy->lock);
+    if (!(flags & NOIOLOCK)) {
+        iolock = qemu_mutex_iothread_locked();
+        if (iolock) {
+            qemu_mutex_unlock_iothread();
+        }
+    }
+
+    reply = QTAILQ_FIRST(&proxy->free);
+    if (reply != NULL) {
+        QTAILQ_REMOVE(&proxy->free, reply, next);
+        reply->complete = 0;
+    } else {
+        reply = g_malloc0(sizeof(*reply));
+        qemu_cond_init(&reply->cv);
+    }
+    reply->msg = msg;
+    reply->fds = fds;
+    reply->id = msg->id;
+    reply->rsize = rsize ? rsize : msg->size;
+    QTAILQ_INSERT_TAIL(&proxy->pending, reply, next);
+
+    vfio_user_send_locked(proxy, msg, fds);
+    if (!(msg->flags & VFIO_USER_ERROR)) {
+        if (!(flags & NOWAIT)) {
+            while (reply->complete == 0) {
+                if (!qemu_cond_timedwait(&reply->cv, &proxy->lock, wait_time)) {
+                    msg->flags |= VFIO_USER_ERROR;
+                    msg->error_reply = ETIMEDOUT;
+                    break;
+                }
+            }
+            QTAILQ_INSERT_HEAD(&proxy->free, reply, next);
+        } else {
+            reply->nowait = 1;
+            proxy->last_nowait = reply;
+        }
+    } else {
+        QTAILQ_INSERT_HEAD(&proxy->free, reply, next);
+    }
+
+    if (iolock) {
+        qemu_mutex_lock_iothread();
+    }
+}
+
+static void vfio_user_request_msg(VFIOUserHdr *hdr, uint16_t cmd,
+                                  uint32_t size, uint32_t flags)
+{
+    static uint16_t next_id;
+
+    hdr->id = qatomic_fetch_inc(&next_id);
+    hdr->command = cmd;
+    hdr->size = size;
+    hdr->flags = (flags & ~VFIO_USER_TYPE) | VFIO_USER_REQUEST;
+    hdr->error_reply = 0;
+}
+
 static QLIST_HEAD(, VFIOProxy) vfio_user_sockets =
     QLIST_HEAD_INITIALIZER(vfio_user_sockets);
 
@@ -447,3 +534,183 @@ void vfio_user_disconnect(VFIOProxy *proxy)
 
     g_free(proxy);
 }
+
+struct cap_entry {
+    const char *name;
+    int (*check)(QObject *qobj, Error **errp);
+};
+
+static int caps_parse(QDict *qdict, struct cap_entry caps[], Error **errp)
+{
+    QObject *qobj;
+    struct cap_entry *p;
+
+    for (p = caps; p->name != NULL; p++) {
+        qobj = qdict_get(qdict, p->name);
+        if (qobj != NULL) {
+            if (p->check(qobj, errp)) {
+                return -1;
+            }
+            qdict_del(qdict, p->name);
+        }
+    }
+
+    /* warning, for now */
+    if (qdict_size(qdict) != 0) {
+        error_printf("spurious capabilities\n");
+    }
+    return 0;
+}
+
+static int check_pgsize(QObject *qobj, Error **errp)
+{
+    QNum *qn = qobject_to(QNum, qobj);
+    uint64_t pgsize;
+
+    if (qn == NULL || !qnum_get_try_uint(qn, &pgsize)) {
+        error_setg(errp, "malformed %s", VFIO_USER_CAP_PGSIZE);
+        return -1;
+    }
+    return pgsize == 4096 ? 0 : -1;
+}
+
+static struct cap_entry caps_migr[] = {
+    { VFIO_USER_CAP_PGSIZE, check_pgsize },
+    { NULL }
+};
+
+static int check_max_fds(QObject *qobj, Error **errp)
+{
+    QNum *qn = qobject_to(QNum, qobj);
+
+    if (qn == NULL || !qnum_get_try_uint(qn, &max_send_fds) ||
+        max_send_fds > VFIO_USER_MAX_MAX_FDS) {
+        error_setg(errp, "malformed %s", VFIO_USER_CAP_MAX_FDS);
+        return -1;
+    }
+    return 0;
+}
+
+static int check_max_xfer(QObject *qobj, Error **errp)
+{
+    QNum *qn = qobject_to(QNum, qobj);
+
+    if (qn == NULL || !qnum_get_try_uint(qn, &max_xfer_size) ||
+        max_xfer_size > VFIO_USER_MAX_MAX_XFER) {
+        error_setg(errp, "malformed %s", VFIO_USER_CAP_MAX_XFER);
+        return -1;
+    }
+    return 0;
+}
+
+static int check_migr(QObject *qobj, Error **errp)
+{
+    QDict *qdict = qobject_to(QDict, qobj);
+
+    if (qdict == NULL || caps_parse(qdict, caps_migr, errp)) {
+        error_setg(errp, "malformed %s", VFIO_USER_CAP_MAX_FDS);
+        return -1;
+    }
+    return 0;
+}
+
+static struct cap_entry caps_cap[] = {
+    { VFIO_USER_CAP_MAX_FDS, check_max_fds },
+    { VFIO_USER_CAP_MAX_XFER, check_max_xfer },
+    { VFIO_USER_CAP_MIGR, check_migr },
+    { NULL }
+};
+
+static int check_cap(QObject *qobj, Error **errp)
+{
+   QDict *qdict = qobject_to(QDict, qobj);
+
+    if (qdict == NULL || caps_parse(qdict, caps_cap, errp)) {
+        error_setg(errp, "malformed %s", VFIO_USER_CAP);
+        return -1;
+    }
+    return 0;
+}
+
+static struct cap_entry ver_0_0[] = {
+    { VFIO_USER_CAP, check_cap },
+    { NULL }
+};
+
+static int caps_check(int minor, const char *caps, Error **errp)
+{
+    QObject *qobj;
+    QDict *qdict;
+    int ret;
+
+    qobj = qobject_from_json(caps, NULL);
+    if (qobj == NULL) {
+        error_setg(errp, "malformed capabilities %s", caps);
+        return -1;
+    }
+    qdict = qobject_to(QDict, qobj);
+    if (qdict == NULL) {
+        error_setg(errp, "capabilities %s not an object", caps);
+        qobject_unref(qobj);
+        return -1;
+    }
+    ret = caps_parse(qdict, ver_0_0, errp);
+
+    qobject_unref(qobj);
+    return ret;
+}
+
+static GString *caps_json(void)
+{
+    QDict *dict = qdict_new();
+    QDict *capdict = qdict_new();
+    QDict *migdict = qdict_new();
+    GString *str;
+
+    qdict_put_int(migdict, VFIO_USER_CAP_PGSIZE, 4096);
+    qdict_put_obj(capdict, VFIO_USER_CAP_MIGR, QOBJECT(migdict));
+
+    qdict_put_int(capdict, VFIO_USER_CAP_MAX_FDS, VFIO_USER_MAX_MAX_FDS);
+    qdict_put_int(capdict, VFIO_USER_CAP_MAX_XFER, VFIO_USER_DEF_MAX_XFER);
+
+    qdict_put_obj(dict, VFIO_USER_CAP, QOBJECT(capdict));
+
+    str = qobject_to_json(QOBJECT(dict));
+    qobject_unref(dict);
+    return str;
+}
+
+int vfio_user_validate_version(VFIODevice *vbasedev, Error **errp)
+{
+    g_autofree VFIOUserVersion *msgp;
+    GString *caps;
+    int size, caplen;
+
+    caps = caps_json();
+    caplen = caps->len + 1;
+    size = sizeof(*msgp) + caplen;
+    msgp = g_malloc0(size);
+
+    vfio_user_request_msg(&msgp->hdr, VFIO_USER_VERSION, size, 0);
+    msgp->major = VFIO_USER_MAJOR_VER;
+    msgp->minor = VFIO_USER_MINOR_VER;
+    memcpy(&msgp->capabilities, caps->str, caplen);
+    g_string_free(caps, true);
+
+    vfio_user_send_recv(vbasedev->proxy, &msgp->hdr, NULL, 0, 0);
+    if (msgp->hdr.flags & VFIO_USER_ERROR) {
+        error_setg_errno(errp, msgp->hdr.error_reply, "version reply");
+        return -1;
+    }
+
+    if (msgp->major != VFIO_USER_MAJOR_VER ||
+        msgp->minor > VFIO_USER_MINOR_VER) {
+        error_setg(errp, "incompatible server version");
+        return -1;
+    }
+    if (caps_check(msgp->minor, (char *)msgp + sizeof(*msgp), errp) != 0) {
+        return -1;
+    }
+
+    return 0;
+}
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH RFC v2 07/16] vfio-user: get device info
  2021-08-16 16:42 [PATCH RFC v2 00/16] vfio-user implementation Elena Ufimtseva
                   ` (5 preceding siblings ...)
  2021-08-16 16:42 ` [PATCH RFC v2 06/16] vfio-user: negotiate version with remote server Elena Ufimtseva
@ 2021-08-16 16:42 ` Elena Ufimtseva
  2021-08-24 16:04   ` Stefan Hajnoczi
  2021-08-16 16:42 ` [PATCH RFC v2 08/16] vfio-user: get region info Elena Ufimtseva
                   ` (9 subsequent siblings)
  16 siblings, 1 reply; 108+ messages in thread
From: Elena Ufimtseva @ 2021-08-16 16:42 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha, thanos.makatos

From: John Johnson <john.g.johnson@oracle.com>

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/vfio/user-protocol.h | 13 +++++++++++++
 hw/vfio/user.h          |  1 +
 hw/vfio/pci.c           | 13 +++++++++++++
 hw/vfio/user.c          | 20 ++++++++++++++++++++
 4 files changed, 47 insertions(+)

diff --git a/hw/vfio/user-protocol.h b/hw/vfio/user-protocol.h
index 14b762d1ad..13e44ebf1c 100644
--- a/hw/vfio/user-protocol.h
+++ b/hw/vfio/user-protocol.h
@@ -82,4 +82,17 @@ typedef struct {
 #define VFIO_USER_MAX_MAX_XFER  (64 * 1024 * 1024)
 
 
+/*
+ * VFIO_USER_DEVICE_GET_INFO
+ * imported from struct_device_info
+ */
+typedef struct {
+    VFIOUserHdr hdr;
+    uint32_t argsz;
+    uint32_t flags;
+    uint32_t num_regions;
+    uint32_t num_irqs;
+    uint32_t cap_offset;
+} VFIOUserDeviceInfo;
+
 #endif /* VFIO_USER_PROTOCOL_H */
diff --git a/hw/vfio/user.h b/hw/vfio/user.h
index cab957ba7a..82044e7e78 100644
--- a/hw/vfio/user.h
+++ b/hw/vfio/user.h
@@ -71,5 +71,6 @@ void vfio_user_set_reqhandler(VFIODevice *vbasdev,
                                              void *reqarg);
 void vfio_user_send_reply(VFIOProxy *proxy, char *buf, int ret);
 int vfio_user_validate_version(VFIODevice *vbasedev, Error **errp);
+int vfio_user_get_info(VFIODevice *vbasedev);
 
 #endif /* VFIO_USER_H */
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index eae33e746f..63aa2441f0 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3369,6 +3369,7 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
     VFIODevice *vbasedev = &vdev->vbasedev;
     SocketAddress addr;
     VFIOProxy *proxy;
+    int ret;
     Error *err = NULL;
 
     /*
@@ -3410,6 +3411,18 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
     vbasedev->no_mmap = false;
     vbasedev->ops = &vfio_user_pci_ops;
 
+    ret = vfio_user_get_info(&vdev->vbasedev);
+    if (ret) {
+        error_setg_errno(errp, -ret, "get info failure");
+        goto error;
+    }
+
+    vfio_populate_device(vdev, &err);
+    if (err) {
+        error_propagate(errp, err);
+        goto error;
+    }
+
 error:
     vfio_user_disconnect(proxy);
     error_prepend(errp, VFIO_MSG_PREFIX, vdev->vbasedev.name);
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
index e89464a571..b584b8e0f2 100644
--- a/hw/vfio/user.c
+++ b/hw/vfio/user.c
@@ -714,3 +714,23 @@ int vfio_user_validate_version(VFIODevice *vbasedev, Error **errp)
 
     return 0;
 }
+
+int vfio_user_get_info(VFIODevice *vbasedev)
+{
+    VFIOUserDeviceInfo msg;
+
+    memset(&msg, 0, sizeof(msg));
+    vfio_user_request_msg(&msg.hdr, VFIO_USER_DEVICE_GET_INFO, sizeof(msg), 0);
+    msg.argsz = sizeof(struct vfio_device_info);
+
+    vfio_user_send_recv(vbasedev->proxy, &msg.hdr, NULL, 0, 0);
+    if (msg.hdr.flags & VFIO_USER_ERROR) {
+        return -msg.hdr.error_reply;
+    }
+
+    vbasedev->num_irqs = msg.num_irqs;
+    vbasedev->num_regions = msg.num_regions;
+    vbasedev->flags = msg.flags;
+    vbasedev->reset_works = !!(msg.flags & VFIO_DEVICE_FLAGS_RESET);
+    return 0;
+}
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH RFC v2 08/16] vfio-user: get region info
  2021-08-16 16:42 [PATCH RFC v2 00/16] vfio-user implementation Elena Ufimtseva
                   ` (6 preceding siblings ...)
  2021-08-16 16:42 ` [PATCH RFC v2 07/16] vfio-user: get device info Elena Ufimtseva
@ 2021-08-16 16:42 ` Elena Ufimtseva
  2021-09-07 14:31   ` Stefan Hajnoczi
  2021-08-16 16:42 ` [PATCH RFC v2 09/16] vfio-user: region read/write Elena Ufimtseva
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 108+ messages in thread
From: Elena Ufimtseva @ 2021-08-16 16:42 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha, thanos.makatos

From: John Johnson <john.g.johnson@oracle.com>

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/vfio/user-protocol.h       | 14 +++++++
 hw/vfio/user.h                |  2 +
 include/hw/vfio/vfio-common.h |  3 ++
 hw/vfio/common.c              | 76 ++++++++++++++++++++++++++++++++++-
 hw/vfio/user.c                | 33 +++++++++++++++
 5 files changed, 126 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/user-protocol.h b/hw/vfio/user-protocol.h
index 13e44ebf1c..104bf4ff31 100644
--- a/hw/vfio/user-protocol.h
+++ b/hw/vfio/user-protocol.h
@@ -95,4 +95,18 @@ typedef struct {
     uint32_t cap_offset;
 } VFIOUserDeviceInfo;
 
+/*
+ * VFIO_USER_DEVICE_GET_REGION_INFO
+ * imported from struct_vfio_region_info
+ */
+typedef struct {
+    VFIOUserHdr hdr;
+    uint32_t argsz;
+    uint32_t flags;
+    uint32_t index;
+    uint32_t cap_offset;
+    uint64_t size;
+    uint64_t offset;
+} VFIOUserRegionInfo;
+
 #endif /* VFIO_USER_PROTOCOL_H */
diff --git a/hw/vfio/user.h b/hw/vfio/user.h
index 82044e7e78..f0122539ba 100644
--- a/hw/vfio/user.h
+++ b/hw/vfio/user.h
@@ -72,5 +72,7 @@ void vfio_user_set_reqhandler(VFIODevice *vbasdev,
 void vfio_user_send_reply(VFIOProxy *proxy, char *buf, int ret);
 int vfio_user_validate_version(VFIODevice *vbasedev, Error **errp);
 int vfio_user_get_info(VFIODevice *vbasedev);
+int vfio_user_get_region_info(VFIODevice *vbasedev, int index,
+                              struct vfio_region_info *info, VFIOUserFDs *fds);
 
 #endif /* VFIO_USER_H */
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index f43dc6e5d0..bdd25a546c 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -56,6 +56,7 @@ typedef struct VFIORegion {
     uint32_t nr_mmaps;
     VFIOMmap *mmaps;
     uint8_t nr; /* cache the region number for debug */
+    int remfd; /* fd if exported from remote process */
 } VFIORegion;
 
 typedef struct VFIOMigration {
@@ -145,6 +146,8 @@ typedef struct VFIODevice {
     Error *migration_blocker;
     OnOffAuto pre_copy_dirty_page_tracking;
     VFIOProxy *proxy;
+    struct vfio_region_info **regions;
+    int *regfds;
 } VFIODevice;
 
 struct VFIODeviceOps {
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 8728d4d5c2..7d667b0533 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -40,6 +40,7 @@
 #include "trace.h"
 #include "qapi/error.h"
 #include "migration/migration.h"
+#include "hw/vfio/user.h"
 
 VFIOGroupList vfio_group_list =
     QLIST_HEAD_INITIALIZER(vfio_group_list);
@@ -1514,6 +1515,16 @@ bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
     return true;
 }
 
+static int vfio_get_region_info_remfd(VFIODevice *vbasedev, int index)
+{
+    struct vfio_region_info *info;
+
+    if (vbasedev->regions == NULL || vbasedev->regions[index] == NULL) {
+        vfio_get_region_info(vbasedev, index, &info);
+    }
+    return vbasedev->regfds != NULL ? vbasedev->regfds[index] : -1;
+}
+
 static int vfio_setup_region_sparse_mmaps(VFIORegion *region,
                                           struct vfio_region_info *info)
 {
@@ -1567,6 +1578,7 @@ int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
     region->size = info->size;
     region->fd_offset = info->offset;
     region->nr = index;
+    region->remfd = vfio_get_region_info_remfd(vbasedev, index);
 
     if (region->size) {
         region->mem = g_new0(MemoryRegion, 1);
@@ -1610,6 +1622,7 @@ int vfio_region_mmap(VFIORegion *region)
 {
     int i, prot = 0;
     char *name;
+    int fd;
 
     if (!region->mem) {
         return 0;
@@ -1618,9 +1631,11 @@ int vfio_region_mmap(VFIORegion *region)
     prot |= region->flags & VFIO_REGION_INFO_FLAG_READ ? PROT_READ : 0;
     prot |= region->flags & VFIO_REGION_INFO_FLAG_WRITE ? PROT_WRITE : 0;
 
+    fd = region->remfd != -1 ? region->remfd : region->vbasedev->fd;
+
     for (i = 0; i < region->nr_mmaps; i++) {
         region->mmaps[i].mmap = mmap(NULL, region->mmaps[i].size, prot,
-                                     MAP_SHARED, region->vbasedev->fd,
+                                     MAP_SHARED, fd,
                                      region->fd_offset +
                                      region->mmaps[i].offset);
         if (region->mmaps[i].mmap == MAP_FAILED) {
@@ -2397,6 +2412,23 @@ int vfio_get_device(VFIOGroup *group, const char *name,
 
 void vfio_put_base_device(VFIODevice *vbasedev)
 {
+    if (vbasedev->regions != NULL) {
+        int i;
+
+        for (i = 0; i < vbasedev->num_regions; i++) {
+            if (vbasedev->regfds != NULL && vbasedev->regfds[i] != -1) {
+                close(vbasedev->regfds[i]);
+            }
+            g_free(vbasedev->regions[i]);
+        }
+        g_free(vbasedev->regions);
+        vbasedev->regions = NULL;
+        if (vbasedev->regfds != NULL) {
+            g_free(vbasedev->regfds);
+            vbasedev->regfds = NULL;
+        }
+    }
+
     if (!vbasedev->group) {
         return;
     }
@@ -2410,6 +2442,24 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
                          struct vfio_region_info **info)
 {
     size_t argsz = sizeof(struct vfio_region_info);
+    int fd = -1;
+    int ret;
+
+    /* create region cache */
+    if (vbasedev->regions == NULL) {
+        vbasedev->regions = g_new0(struct vfio_region_info *,
+                                   vbasedev->num_regions);
+        if (vbasedev->proxy != NULL) {
+            vbasedev->regfds = g_new0(int, vbasedev->num_regions);
+        }
+    }
+    /* check cache */
+    if (vbasedev->regions[index] != NULL) {
+        *info = g_malloc0(vbasedev->regions[index]->argsz);
+        memcpy(*info, vbasedev->regions[index],
+               vbasedev->regions[index]->argsz);
+        return 0;
+    }
 
     *info = g_malloc0(argsz);
 
@@ -2417,7 +2467,17 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
 retry:
     (*info)->argsz = argsz;
 
-    if (ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, *info)) {
+    if (vbasedev->proxy != NULL) {
+        VFIOUserFDs fds = { 0, 1, &fd};
+
+        ret = vfio_user_get_region_info(vbasedev, index, *info, &fds);
+    } else {
+        ret = ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, *info);
+        if (ret < 0) {
+            ret = -errno;
+        }
+    }
+    if (ret != 0) {
         g_free(*info);
         *info = NULL;
         return -errno;
@@ -2426,10 +2486,22 @@ retry:
     if ((*info)->argsz > argsz) {
         argsz = (*info)->argsz;
         *info = g_realloc(*info, argsz);
+        if (fd != -1) {
+            close(fd);
+            fd = -1;
+        }
 
         goto retry;
     }
 
+    /* fill cache */
+    vbasedev->regions[index] = g_malloc0(argsz);
+    memcpy(vbasedev->regions[index], *info, argsz);
+    *vbasedev->regions[index] = **info;
+    if (vbasedev->regfds != NULL) {
+        vbasedev->regfds[index] = fd;
+    }
+
     return 0;
 }
 
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
index b584b8e0f2..91b51f37df 100644
--- a/hw/vfio/user.c
+++ b/hw/vfio/user.c
@@ -734,3 +734,36 @@ int vfio_user_get_info(VFIODevice *vbasedev)
     vbasedev->reset_works = !!(msg.flags & VFIO_DEVICE_FLAGS_RESET);
     return 0;
 }
+
+int vfio_user_get_region_info(VFIODevice *vbasedev, int index,
+                              struct vfio_region_info *info, VFIOUserFDs *fds)
+{
+    g_autofree VFIOUserRegionInfo *msgp = NULL;
+    int size;
+
+    /* data returned can be larger than vfio_region_info */
+    if (info->argsz < sizeof(*info)) {
+        error_printf("vfio_user_get_region_info argsz too small\n");
+        return -EINVAL;
+    }
+    if (fds != NULL && fds->send_fds != 0) {
+        error_printf("vfio_user_get_region_info can't send FDs\n");
+        return -EINVAL;
+    }
+
+    size = info->argsz + sizeof(VFIOUserHdr);
+    msgp = g_malloc0(size);
+
+    vfio_user_request_msg(&msgp->hdr, VFIO_USER_DEVICE_GET_REGION_INFO,
+                          sizeof(*msgp), 0);
+    msgp->argsz = info->argsz;
+    msgp->index = info->index;
+
+    vfio_user_send_recv(vbasedev->proxy, &msgp->hdr, fds, size, 0);
+    if (msgp->hdr.flags & VFIO_USER_ERROR) {
+        return -msgp->hdr.error_reply;
+    }
+
+    memcpy(info, &msgp->argsz, info->argsz);
+    return 0;
+}
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH RFC v2 09/16] vfio-user: region read/write
  2021-08-16 16:42 [PATCH RFC v2 00/16] vfio-user implementation Elena Ufimtseva
                   ` (7 preceding siblings ...)
  2021-08-16 16:42 ` [PATCH RFC v2 08/16] vfio-user: get region info Elena Ufimtseva
@ 2021-08-16 16:42 ` Elena Ufimtseva
  2021-09-07 14:41   ` Stefan Hajnoczi
  2021-09-07 17:24   ` John Levon
  2021-08-16 16:42 ` [PATCH RFC v2 10/16] vfio-user: pci_user_realize PCI setup Elena Ufimtseva
                   ` (7 subsequent siblings)
  16 siblings, 2 replies; 108+ messages in thread
From: Elena Ufimtseva @ 2021-08-16 16:42 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha, thanos.makatos

From: John Johnson <john.g.johnson@oracle.com>

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/vfio/user-protocol.h | 12 ++++++++++++
 hw/vfio/user.h          |  4 ++++
 hw/vfio/common.c        | 16 +++++++++++++--
 hw/vfio/user.c          | 43 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 73 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/user-protocol.h b/hw/vfio/user-protocol.h
index 104bf4ff31..56904cf872 100644
--- a/hw/vfio/user-protocol.h
+++ b/hw/vfio/user-protocol.h
@@ -109,4 +109,16 @@ typedef struct {
     uint64_t offset;
 } VFIOUserRegionInfo;
 
+/*
+ * VFIO_USER_REGION_READ
+ * VFIO_USER_REGION_WRITE
+ */
+typedef struct {
+    VFIOUserHdr hdr;
+    uint64_t offset;
+    uint32_t region;
+    uint32_t count;
+    char data[];
+} VFIOUserRegionRW;
+
 #endif /* VFIO_USER_PROTOCOL_H */
diff --git a/hw/vfio/user.h b/hw/vfio/user.h
index f0122539ba..02f832a173 100644
--- a/hw/vfio/user.h
+++ b/hw/vfio/user.h
@@ -74,5 +74,9 @@ int vfio_user_validate_version(VFIODevice *vbasedev, Error **errp);
 int vfio_user_get_info(VFIODevice *vbasedev);
 int vfio_user_get_region_info(VFIODevice *vbasedev, int index,
                               struct vfio_region_info *info, VFIOUserFDs *fds);
+int vfio_user_region_read(VFIODevice *vbasedev, uint32_t index, uint64_t offset,
+                          uint32_t count, void *data);
+int vfio_user_region_write(VFIODevice *vbasedev, uint32_t index,
+                           uint64_t offset, uint32_t count, void *data);
 
 #endif /* VFIO_USER_H */
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 7d667b0533..a8b1ea9358 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -215,6 +215,7 @@ void vfio_region_write(void *opaque, hwaddr addr,
         uint32_t dword;
         uint64_t qword;
     } buf;
+    int ret;
 
     switch (size) {
     case 1:
@@ -234,7 +235,12 @@ void vfio_region_write(void *opaque, hwaddr addr,
         break;
     }
 
-    if (pwrite(vbasedev->fd, &buf, size, region->fd_offset + addr) != size) {
+    if (vbasedev->proxy != NULL) {
+        ret = vfio_user_region_write(vbasedev, region->nr, addr, size, &data);
+    } else {
+        ret = pwrite(vbasedev->fd, &buf, size, region->fd_offset + addr);
+    }
+    if (ret != size) {
         error_report("%s(%s:region%d+0x%"HWADDR_PRIx", 0x%"PRIx64
                      ",%d) failed: %m",
                      __func__, vbasedev->name, region->nr,
@@ -266,8 +272,14 @@ uint64_t vfio_region_read(void *opaque,
         uint64_t qword;
     } buf;
     uint64_t data = 0;
+    int ret;
 
-    if (pread(vbasedev->fd, &buf, size, region->fd_offset + addr) != size) {
+    if (vbasedev->proxy != NULL) {
+        ret = vfio_user_region_read(vbasedev, region->nr, addr, size, &buf);
+    } else {
+        ret = pread(vbasedev->fd, &buf, size, region->fd_offset + addr);
+    }
+    if (ret != size) {
         error_report("%s(%s:region%d+0x%"HWADDR_PRIx", %d) failed: %m",
                      __func__, vbasedev->name, region->nr,
                      addr, size);
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
index 91b51f37df..83235b2411 100644
--- a/hw/vfio/user.c
+++ b/hw/vfio/user.c
@@ -767,3 +767,46 @@ int vfio_user_get_region_info(VFIODevice *vbasedev, int index,
     memcpy(info, &msgp->argsz, info->argsz);
     return 0;
 }
+
+int vfio_user_region_read(VFIODevice *vbasedev, uint32_t index, uint64_t offset,
+                                 uint32_t count, void *data)
+{
+    g_autofree VFIOUserRegionRW *msgp = NULL;
+    int size = sizeof(*msgp) + count;
+
+    msgp = g_malloc0(size);
+    vfio_user_request_msg(&msgp->hdr, VFIO_USER_REGION_READ, sizeof(*msgp), 0);
+    msgp->offset = offset;
+    msgp->region = index;
+    msgp->count = count;
+
+    vfio_user_send_recv(vbasedev->proxy, &msgp->hdr, NULL, size, 0);
+    if (msgp->hdr.flags & VFIO_USER_ERROR) {
+        return -msgp->hdr.error_reply;
+    } else if (msgp->count > count) {
+        return -E2BIG;
+    } else {
+        memcpy(data, &msgp->data, msgp->count);
+    }
+
+    return msgp->count;
+}
+
+int vfio_user_region_write(VFIODevice *vbasedev, uint32_t index,
+                           uint64_t offset, uint32_t count, void *data)
+{
+    g_autofree VFIOUserRegionRW *msgp = NULL;
+    int size = sizeof(*msgp) + count;
+
+    msgp = g_malloc0(size);
+    vfio_user_request_msg(&msgp->hdr, VFIO_USER_REGION_WRITE, size,
+                          VFIO_USER_NO_REPLY);
+    msgp->offset = offset;
+    msgp->region = index;
+    msgp->count = count;
+    memcpy(&msgp->data, data, count);
+
+    vfio_user_send(vbasedev->proxy, &msgp->hdr, NULL);
+
+    return count;
+}
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH RFC v2 10/16] vfio-user: pci_user_realize PCI setup
  2021-08-16 16:42 [PATCH RFC v2 00/16] vfio-user implementation Elena Ufimtseva
                   ` (8 preceding siblings ...)
  2021-08-16 16:42 ` [PATCH RFC v2 09/16] vfio-user: region read/write Elena Ufimtseva
@ 2021-08-16 16:42 ` Elena Ufimtseva
  2021-09-07 15:00   ` Stefan Hajnoczi
  2021-08-16 16:42 ` [PATCH RFC v2 11/16] vfio-user: get and set IRQs Elena Ufimtseva
                   ` (6 subsequent siblings)
  16 siblings, 1 reply; 108+ messages in thread
From: Elena Ufimtseva @ 2021-08-16 16:42 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha, thanos.makatos

From: John Johnson <john.g.johnson@oracle.com>

PCI BARs read from remote device
PCI config reads/writes sent to remote server

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/vfio/pci.c | 210 +++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 175 insertions(+), 35 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 63aa2441f0..ea0df8be65 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -807,8 +807,14 @@ static void vfio_pci_load_rom(VFIOPCIDevice *vdev)
     memset(vdev->rom, 0xff, size);
 
     while (size) {
-        bytes = pread(vdev->vbasedev.fd, vdev->rom + off,
-                      size, vdev->rom_offset + off);
+        if (vdev->vbasedev.proxy != NULL) {
+            bytes = vfio_user_region_read(&vdev->vbasedev,
+                                          VFIO_PCI_ROM_REGION_INDEX,
+                                          off, size, vdev->rom + off);
+        } else {
+            bytes = pread(vdev->vbasedev.fd, vdev->rom + off,
+                          size, vdev->rom_offset + off);
+        }
         if (bytes == 0) {
             break;
         } else if (bytes > 0) {
@@ -927,12 +933,28 @@ static void vfio_pci_size_rom(VFIOPCIDevice *vdev)
      * Use the same size ROM BAR as the physical device.  The contents
      * will get filled in later when the guest tries to read it.
      */
-    if (pread(fd, &orig, 4, offset) != 4 ||
-        pwrite(fd, &size, 4, offset) != 4 ||
-        pread(fd, &size, 4, offset) != 4 ||
-        pwrite(fd, &orig, 4, offset) != 4) {
-        error_report("%s(%s) failed: %m", __func__, vdev->vbasedev.name);
-        return;
+    if (vdev->vbasedev.proxy != NULL) {
+        if (vfio_user_region_read(&vdev->vbasedev, VFIO_PCI_CONFIG_REGION_INDEX,
+                                  PCI_ROM_ADDRESS, 4, &orig) != 4 ||
+            vfio_user_region_write(&vdev->vbasedev,
+                                   VFIO_PCI_CONFIG_REGION_INDEX,
+                                   PCI_ROM_ADDRESS, 4, &size) != 4 ||
+            vfio_user_region_read(&vdev->vbasedev, VFIO_PCI_CONFIG_REGION_INDEX,
+                                  PCI_ROM_ADDRESS, 4, &size) != 4 ||
+            vfio_user_region_write(&vdev->vbasedev,
+                                   VFIO_PCI_CONFIG_REGION_INDEX,
+                                   PCI_ROM_ADDRESS, 4, &orig) != 4) {
+            error_report("%s(%s) failed: %m", __func__, vdev->vbasedev.name);
+            return;
+        }
+    } else {
+        if (pread(fd, &orig, 4, offset) != 4 ||
+            pwrite(fd, &size, 4, offset) != 4 ||
+            pread(fd, &size, 4, offset) != 4 ||
+            pwrite(fd, &orig, 4, offset) != 4) {
+            error_report("%s(%s) failed: %m", __func__, vdev->vbasedev.name);
+            return;
+        }
     }
 
     size = ~(le32_to_cpu(size) & PCI_ROM_ADDRESS_MASK) + 1;
@@ -1123,8 +1145,14 @@ uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len)
     if (~emu_bits & (0xffffffffU >> (32 - len * 8))) {
         ssize_t ret;
 
-        ret = pread(vdev->vbasedev.fd, &phys_val, len,
-                    vdev->config_offset + addr);
+        if (vdev->vbasedev.proxy != NULL) {
+            ret = vfio_user_region_read(&vdev->vbasedev,
+                                        VFIO_PCI_CONFIG_REGION_INDEX,
+                                        addr, len, &phys_val);
+        } else {
+            ret = pread(vdev->vbasedev.fd, &phys_val, len,
+                        vdev->config_offset + addr);
+        }
         if (ret != len) {
             error_report("%s(%s, 0x%x, 0x%x) failed: %m",
                          __func__, vdev->vbasedev.name, addr, len);
@@ -1145,12 +1173,20 @@ void vfio_pci_write_config(PCIDevice *pdev,
 {
     VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
     uint32_t val_le = cpu_to_le32(val);
+    int ret;
 
     trace_vfio_pci_write_config(vdev->vbasedev.name, addr, val, len);
 
     /* Write everything to VFIO, let it filter out what we can't write */
-    if (pwrite(vdev->vbasedev.fd, &val_le, len, vdev->config_offset + addr)
-                != len) {
+    if (vdev->vbasedev.proxy != NULL) {
+        ret = vfio_user_region_write(&vdev->vbasedev,
+                                     VFIO_PCI_CONFIG_REGION_INDEX,
+                                     addr, len, &val_le);
+    } else {
+        ret = pwrite(vdev->vbasedev.fd, &val_le, len,
+                     vdev->config_offset + addr);
+    }
+    if (ret != len) {
         error_report("%s(%s, 0x%x, 0x%x, 0x%x) failed: %m",
                      __func__, vdev->vbasedev.name, addr, val, len);
     }
@@ -1240,10 +1276,15 @@ static int vfio_msi_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
     int ret, entries;
     Error *err = NULL;
 
-    if (pread(vdev->vbasedev.fd, &ctrl, sizeof(ctrl),
-              vdev->config_offset + pos + PCI_CAP_FLAGS) != sizeof(ctrl)) {
-        error_setg_errno(errp, errno, "failed reading MSI PCI_CAP_FLAGS");
-        return -errno;
+    if (vdev->vbasedev.proxy != NULL) {
+        /* during setup, config space was initialized from remote */
+        memcpy(&ctrl, vdev->pdev.config + pos + PCI_CAP_FLAGS, sizeof(ctrl));
+    } else {
+        if (pread(vdev->vbasedev.fd, &ctrl, sizeof(ctrl),
+                  vdev->config_offset + pos + PCI_CAP_FLAGS) != sizeof(ctrl)) {
+            error_setg_errno(errp, errno, "failed reading MSI PCI_CAP_FLAGS");
+            return -errno;
+        }
     }
     ctrl = le16_to_cpu(ctrl);
 
@@ -1456,22 +1497,30 @@ static void vfio_msix_early_setup(VFIOPCIDevice *vdev, Error **errp)
         return;
     }
 
-    if (pread(fd, &ctrl, sizeof(ctrl),
-              vdev->config_offset + pos + PCI_MSIX_FLAGS) != sizeof(ctrl)) {
-        error_setg_errno(errp, errno, "failed to read PCI MSIX FLAGS");
-        return;
-    }
+    if (vdev->vbasedev.proxy != NULL) {
+        /* during setup, config space was initialized from remote */
+        memcpy(&ctrl, vdev->pdev.config + pos + PCI_MSIX_FLAGS, sizeof(ctrl));
+        memcpy(&table, vdev->pdev.config + pos + PCI_MSIX_TABLE, sizeof(table));
+        memcpy(&pba, vdev->pdev.config + pos + PCI_MSIX_PBA, sizeof(pba));
+    } else {
+        if (pread(fd, &ctrl, sizeof(ctrl),
+                  vdev->config_offset + pos + PCI_MSIX_FLAGS) != sizeof(ctrl)) {
+            error_setg_errno(errp, errno, "failed to read PCI MSIX FLAGS");
+            return;
+        }
 
-    if (pread(fd, &table, sizeof(table),
-              vdev->config_offset + pos + PCI_MSIX_TABLE) != sizeof(table)) {
-        error_setg_errno(errp, errno, "failed to read PCI MSIX TABLE");
-        return;
-    }
+        if (pread(fd, &table, sizeof(table),
+                  vdev->config_offset + pos +
+                  PCI_MSIX_TABLE) != sizeof(table)) {
+            error_setg_errno(errp, errno, "failed to read PCI MSIX TABLE");
+            return;
+        }
 
-    if (pread(fd, &pba, sizeof(pba),
-              vdev->config_offset + pos + PCI_MSIX_PBA) != sizeof(pba)) {
-        error_setg_errno(errp, errno, "failed to read PCI MSIX PBA");
-        return;
+        if (pread(fd, &pba, sizeof(pba),
+                  vdev->config_offset + pos + PCI_MSIX_PBA) != sizeof(pba)) {
+            error_setg_errno(errp, errno, "failed to read PCI MSIX PBA");
+            return;
+        }
     }
 
     ctrl = le16_to_cpu(ctrl);
@@ -1619,11 +1668,17 @@ static void vfio_bar_prepare(VFIOPCIDevice *vdev, int nr)
     }
 
     /* Determine what type of BAR this is for registration */
-    ret = pread(vdev->vbasedev.fd, &pci_bar, sizeof(pci_bar),
-                vdev->config_offset + PCI_BASE_ADDRESS_0 + (4 * nr));
-    if (ret != sizeof(pci_bar)) {
-        error_report("vfio: Failed to read BAR %d (%m)", nr);
-        return;
+    if (vdev->vbasedev.proxy != NULL) {
+        /* during setup, config space was initialized from remote */
+        memcpy(&pci_bar, vdev->pdev.config + PCI_BASE_ADDRESS_0 + (4 * nr),
+               sizeof(pci_bar));
+    } else {
+        ret = pread(vdev->vbasedev.fd, &pci_bar, sizeof(pci_bar),
+                    vdev->config_offset + PCI_BASE_ADDRESS_0 + (4 * nr));
+        if (ret != sizeof(pci_bar)) {
+            error_report("vfio: Failed to read BAR %d (%m)", nr);
+            return;
+        }
     }
 
     pci_bar = le32_to_cpu(pci_bar);
@@ -3423,6 +3478,91 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
         goto error;
     }
 
+    /* Get a copy of config space */
+    ret = vfio_user_region_read(vbasedev, VFIO_PCI_CONFIG_REGION_INDEX, 0,
+                                MIN(pci_config_size(pdev), vdev->config_size),
+                                pdev->config);
+    if (ret < (int)MIN(pci_config_size(&vdev->pdev), vdev->config_size)) {
+        error_setg_errno(errp, -ret, "failed to read device config space");
+        goto error;
+    }
+
+    /* vfio emulates a lot for us, but some bits need extra love */
+    vdev->emulated_config_bits = g_malloc0(vdev->config_size);
+
+    /* QEMU can choose to expose the ROM or not */
+    memset(vdev->emulated_config_bits + PCI_ROM_ADDRESS, 0xff, 4);
+    /* QEMU can also add or extend BARs */
+    memset(vdev->emulated_config_bits + PCI_BASE_ADDRESS_0, 0xff, 6 * 4);
+    vdev->vendor_id = pci_get_word(pdev->config + PCI_VENDOR_ID);
+    vdev->device_id = pci_get_word(pdev->config + PCI_DEVICE_ID);
+
+    /* QEMU can change multi-function devices to single function, or reverse */
+    vdev->emulated_config_bits[PCI_HEADER_TYPE] =
+                                              PCI_HEADER_TYPE_MULTI_FUNCTION;
+
+    /* Restore or clear multifunction, this is always controlled by QEMU */
+    if (vdev->pdev.cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
+        vdev->pdev.config[PCI_HEADER_TYPE] |= PCI_HEADER_TYPE_MULTI_FUNCTION;
+    } else {
+        vdev->pdev.config[PCI_HEADER_TYPE] &= ~PCI_HEADER_TYPE_MULTI_FUNCTION;
+    }
+
+    /*
+     * Clear host resource mapping info.  If we choose not to register a
+     * BAR, such as might be the case with the option ROM, we can get
+     * confusing, unwritable, residual addresses from the host here.
+     */
+    memset(&vdev->pdev.config[PCI_BASE_ADDRESS_0], 0, 24);
+    memset(&vdev->pdev.config[PCI_ROM_ADDRESS], 0, 4);
+
+    vfio_pci_size_rom(vdev);
+
+    vfio_bars_prepare(vdev);
+
+    vfio_msix_early_setup(vdev, &err);
+    if (err) {
+        error_propagate(errp, err);
+        goto error;
+    }
+
+    vfio_bars_register(vdev);
+
+    ret = vfio_add_capabilities(vdev, errp);
+    if (ret) {
+        goto out_teardown;
+    }
+
+    /* QEMU emulates all of MSI & MSIX */
+    if (pdev->cap_present & QEMU_PCI_CAP_MSIX) {
+        memset(vdev->emulated_config_bits + pdev->msix_cap, 0xff,
+               MSIX_CAP_LENGTH);
+    }
+
+    if (pdev->cap_present & QEMU_PCI_CAP_MSI) {
+        memset(vdev->emulated_config_bits + pdev->msi_cap, 0xff,
+               vdev->msi_cap_size);
+    }
+
+    if (vdev->pdev.config[PCI_INTERRUPT_PIN] != 0) {
+        vdev->intx.mmap_timer = timer_new_ms(QEMU_CLOCK_VIRTUAL,
+                                             vfio_intx_mmap_enable, vdev);
+        pci_device_set_intx_routing_notifier(&vdev->pdev,
+                                             vfio_intx_routing_notifier);
+        vdev->irqchip_change_notifier.notify = vfio_irqchip_change;
+        kvm_irqchip_add_change_notifier(&vdev->irqchip_change_notifier);
+        ret = vfio_intx_enable(vdev, errp);
+        if (ret) {
+            goto out_deregister;
+        }
+    }
+
+out_deregister:
+    pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
+    kvm_irqchip_remove_change_notifier(&vdev->irqchip_change_notifier);
+out_teardown:
+    vfio_teardown_msi(vdev);
+    vfio_bars_exit(vdev);
 error:
     vfio_user_disconnect(proxy);
     error_prepend(errp, VFIO_MSG_PREFIX, vdev->vbasedev.name);
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH RFC v2 11/16] vfio-user: get and set IRQs
  2021-08-16 16:42 [PATCH RFC v2 00/16] vfio-user implementation Elena Ufimtseva
                   ` (9 preceding siblings ...)
  2021-08-16 16:42 ` [PATCH RFC v2 10/16] vfio-user: pci_user_realize PCI setup Elena Ufimtseva
@ 2021-08-16 16:42 ` Elena Ufimtseva
  2021-09-07 15:14   ` Stefan Hajnoczi
  2021-08-16 16:42 ` [PATCH RFC v2 12/16] vfio-user: proxy container connect/disconnect Elena Ufimtseva
                   ` (5 subsequent siblings)
  16 siblings, 1 reply; 108+ messages in thread
From: Elena Ufimtseva @ 2021-08-16 16:42 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha, thanos.makatos

From: John Johnson <john.g.johnson@oracle.com>

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/vfio/user-protocol.h |  25 ++++++++++
 hw/vfio/user.h          |   2 +
 hw/vfio/common.c        |  26 ++++++++--
 hw/vfio/pci.c           |  31 ++++++++++--
 hw/vfio/user.c          | 106 ++++++++++++++++++++++++++++++++++++++++
 5 files changed, 181 insertions(+), 9 deletions(-)

diff --git a/hw/vfio/user-protocol.h b/hw/vfio/user-protocol.h
index 56904cf872..5614efa0a4 100644
--- a/hw/vfio/user-protocol.h
+++ b/hw/vfio/user-protocol.h
@@ -109,6 +109,31 @@ typedef struct {
     uint64_t offset;
 } VFIOUserRegionInfo;
 
+/*
+ * VFIO_USER_DEVICE_GET_IRQ_INFO
+ * imported from struct vfio_irq_info
+ */
+typedef struct {
+    VFIOUserHdr hdr;
+    uint32_t argsz;
+    uint32_t flags;
+    uint32_t index;
+    uint32_t count;
+} VFIOUserIRQInfo;
+
+/*
+ * VFIO_USER_DEVICE_SET_IRQS
+ * imported from struct vfio_irq_set
+ */
+typedef struct {
+    VFIOUserHdr hdr;
+    uint32_t argsz;
+    uint32_t flags;
+    uint32_t index;
+    uint32_t start;
+    uint32_t count;
+} VFIOUserIRQSet;
+
 /*
  * VFIO_USER_REGION_READ
  * VFIO_USER_REGION_WRITE
diff --git a/hw/vfio/user.h b/hw/vfio/user.h
index 02f832a173..248ad80943 100644
--- a/hw/vfio/user.h
+++ b/hw/vfio/user.h
@@ -74,6 +74,8 @@ int vfio_user_validate_version(VFIODevice *vbasedev, Error **errp);
 int vfio_user_get_info(VFIODevice *vbasedev);
 int vfio_user_get_region_info(VFIODevice *vbasedev, int index,
                               struct vfio_region_info *info, VFIOUserFDs *fds);
+int vfio_user_get_irq_info(VFIODevice *vbasedev, struct vfio_irq_info *info);
+int vfio_user_set_irqs(VFIODevice *vbasedev, struct vfio_irq_set *irq);
 int vfio_user_region_read(VFIODevice *vbasedev, uint32_t index, uint64_t offset,
                           uint32_t count, void *data);
 int vfio_user_region_write(VFIODevice *vbasedev, uint32_t index,
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index a8b1ea9358..9fe3e05dc6 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -71,7 +71,11 @@ void vfio_disable_irqindex(VFIODevice *vbasedev, int index)
         .count = 0,
     };
 
-    ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
+    if (vbasedev->proxy != NULL) {
+        vfio_user_set_irqs(vbasedev, &irq_set);
+    } else {
+        ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
+    }
 }
 
 void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index)
@@ -84,7 +88,11 @@ void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index)
         .count = 1,
     };
 
-    ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
+    if (vbasedev->proxy != NULL) {
+        vfio_user_set_irqs(vbasedev, &irq_set);
+    } else {
+        ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
+    }
 }
 
 void vfio_mask_single_irqindex(VFIODevice *vbasedev, int index)
@@ -97,7 +105,11 @@ void vfio_mask_single_irqindex(VFIODevice *vbasedev, int index)
         .count = 1,
     };
 
-    ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
+    if (vbasedev->proxy != NULL) {
+        vfio_user_set_irqs(vbasedev, &irq_set);
+    } else {
+        ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
+    }
 }
 
 static inline const char *action_to_str(int action)
@@ -178,8 +190,12 @@ int vfio_set_irq_signaling(VFIODevice *vbasedev, int index, int subindex,
     pfd = (int32_t *)&irq_set->data;
     *pfd = fd;
 
-    if (ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, irq_set)) {
-        ret = -errno;
+    if (vbasedev->proxy != NULL) {
+        ret = vfio_user_set_irqs(vbasedev, irq_set);
+    } else {
+        if (ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, irq_set)) {
+            ret = -errno;
+        }
     }
     g_free(irq_set);
 
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index ea0df8be65..282de6a30b 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -403,7 +403,11 @@ static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool msix)
         fds[i] = fd;
     }
 
-    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
+    if (vdev->vbasedev.proxy != NULL) {
+        ret = vfio_user_set_irqs(&vdev->vbasedev, irq_set);
+    } else {
+        ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
+    }
 
     g_free(irq_set);
 
@@ -2675,7 +2679,13 @@ static void vfio_populate_device(VFIOPCIDevice *vdev, Error **errp)
 
     irq_info.index = VFIO_PCI_ERR_IRQ_INDEX;
 
-    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info);
+    if (vbasedev->proxy != NULL) {
+        ret = vfio_user_get_irq_info(vbasedev, &irq_info);
+    } else {
+        ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info);
+    }
+
+
     if (ret) {
         /* This can fail for an old kernel or legacy PCI dev */
         trace_vfio_populate_device_get_irq_info_failure(strerror(errno));
@@ -2794,8 +2804,16 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
         return;
     }
 
-    if (ioctl(vdev->vbasedev.fd,
-              VFIO_DEVICE_GET_IRQ_INFO, &irq_info) < 0 || irq_info.count < 1) {
+    if (vdev->vbasedev.proxy != NULL) {
+        if (vfio_user_get_irq_info(&vdev->vbasedev, &irq_info) < 0) {
+            return;
+        }
+    } else {
+        if (ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info) < 0) {
+            return;
+        }
+    }
+    if (irq_info.count < 1) {
         return;
     }
 
@@ -3557,6 +3575,11 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
         }
     }
 
+    vfio_register_err_notifier(vdev);
+    vfio_register_req_notifier(vdev);
+
+    return;
+
 out_deregister:
     pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
     kvm_irqchip_remove_change_notifier(&vdev->irqchip_change_notifier);
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
index 83235b2411..b68ca1279d 100644
--- a/hw/vfio/user.c
+++ b/hw/vfio/user.c
@@ -768,6 +768,112 @@ int vfio_user_get_region_info(VFIODevice *vbasedev, int index,
     return 0;
 }
 
+int vfio_user_get_irq_info(VFIODevice *vbasedev, struct vfio_irq_info *info)
+{
+    VFIOUserIRQInfo msg;
+
+    memset(&msg, 0, sizeof(msg));
+    vfio_user_request_msg(&msg.hdr, VFIO_USER_DEVICE_GET_IRQ_INFO,
+                          sizeof(msg), 0);
+    msg.argsz = info->argsz;
+    msg.index = info->index;
+
+    vfio_user_send_recv(vbasedev->proxy, &msg.hdr, NULL, 0, 0);
+    if (msg.hdr.flags & VFIO_USER_ERROR) {
+        return -msg.hdr.error_reply;
+    }
+
+    memcpy(info, &msg.argsz, sizeof(*info));
+    return 0;
+}
+
+static int irq_howmany(int *fdp, int cur, int max)
+{
+    int n = 0;
+
+    if (fdp[cur] != -1) {
+        do {
+            n++;
+        } while (n < max && fdp[cur + n] != -1 && n < max_send_fds);
+    } else {
+        do {
+            n++;
+        } while (n < max && fdp[cur + n] == -1 && n < max_send_fds);
+    }
+
+    return n;
+}
+
+int vfio_user_set_irqs(VFIODevice *vbasedev, struct vfio_irq_set *irq)
+{
+    g_autofree VFIOUserIRQSet *msgp = NULL;
+    uint32_t size, nfds, send_fds, sent_fds;
+
+    if (irq->argsz < sizeof(*irq)) {
+        error_printf("vfio_user_set_irqs argsz too small\n");
+        return -EINVAL;
+    }
+
+    /*
+     * Handle simple case
+     */
+    if ((irq->flags & VFIO_IRQ_SET_DATA_EVENTFD) == 0) {
+        size = sizeof(VFIOUserHdr) + irq->argsz;
+        msgp = g_malloc0(size);
+
+        vfio_user_request_msg(&msgp->hdr, VFIO_USER_DEVICE_SET_IRQS, size, 0);
+        msgp->argsz = irq->argsz;
+        msgp->flags = irq->flags;
+        msgp->index = irq->index;
+        msgp->start = irq->start;
+        msgp->count = irq->count;
+
+        vfio_user_send_recv(vbasedev->proxy, &msgp->hdr, NULL, 0, 0);
+        if (msgp->hdr.flags & VFIO_USER_ERROR) {
+            return -msgp->hdr.error_reply;
+        }
+
+        return 0;
+    }
+
+    /*
+     * Calculate the number of FDs to send
+     * and adjust argsz
+     */
+    nfds = (irq->argsz - sizeof(*irq)) / sizeof(int);
+    irq->argsz = sizeof(*irq);
+    msgp = g_malloc0(sizeof(*msgp));
+    /*
+     * Send in chunks if over max_send_fds
+     */
+    for (sent_fds = 0; nfds > sent_fds; sent_fds += send_fds) {
+        VFIOUserFDs *arg_fds, loop_fds;
+
+        /* must send all valid FDs or all invalid FDs in single msg */
+        send_fds = irq_howmany((int *)irq->data, sent_fds, nfds - sent_fds);
+
+        vfio_user_request_msg(&msgp->hdr, VFIO_USER_DEVICE_SET_IRQS,
+                              sizeof(*msgp), 0);
+        msgp->argsz = irq->argsz;
+        msgp->flags = irq->flags;
+        msgp->index = irq->index;
+        msgp->start = irq->start + sent_fds;
+        msgp->count = send_fds;
+
+        loop_fds.send_fds = send_fds;
+        loop_fds.recv_fds = 0;
+        loop_fds.fds = (int *)irq->data + sent_fds;
+        arg_fds = loop_fds.fds[0] != -1 ? &loop_fds : NULL;
+
+        vfio_user_send_recv(vbasedev->proxy, &msgp->hdr, arg_fds, 0, 0);
+        if (msgp->hdr.flags & VFIO_USER_ERROR) {
+            return -msgp->hdr.error_reply;
+        }
+    }
+
+    return 0;
+}
+
 int vfio_user_region_read(VFIODevice *vbasedev, uint32_t index, uint64_t offset,
                                  uint32_t count, void *data)
 {
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH RFC v2 12/16] vfio-user: proxy container connect/disconnect
  2021-08-16 16:42 [PATCH RFC v2 00/16] vfio-user implementation Elena Ufimtseva
                   ` (10 preceding siblings ...)
  2021-08-16 16:42 ` [PATCH RFC v2 11/16] vfio-user: get and set IRQs Elena Ufimtseva
@ 2021-08-16 16:42 ` Elena Ufimtseva
  2021-09-08  8:30   ` Stefan Hajnoczi
  2021-08-16 16:42 ` [PATCH RFC v2 13/16] vfio-user: dma map/unmap operations Elena Ufimtseva
                   ` (4 subsequent siblings)
  16 siblings, 1 reply; 108+ messages in thread
From: Elena Ufimtseva @ 2021-08-16 16:42 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha, thanos.makatos

From: John Johnson <john.g.johnson@oracle.com>

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 include/hw/vfio/vfio-common.h |  3 ++
 hw/vfio/common.c              | 84 +++++++++++++++++++++++++++++++++++
 hw/vfio/pci.c                 | 22 +++++++++
 3 files changed, 109 insertions(+)

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index bdd25a546c..688660c28d 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -91,6 +91,7 @@ typedef struct VFIOContainer {
     uint64_t max_dirty_bitmap_size;
     unsigned long pgsizes;
     unsigned int dma_max_mappings;
+    VFIOProxy *proxy;
     QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
     QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
     QLIST_HEAD(, VFIOGroup) group_list;
@@ -217,6 +218,8 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
 void vfio_put_group(VFIOGroup *group);
 int vfio_get_device(VFIOGroup *group, const char *name,
                     VFIODevice *vbasedev, Error **errp);
+void vfio_connect_proxy(VFIOProxy *proxy, VFIOGroup *group, AddressSpace *as);
+void vfio_disconnect_proxy(VFIOGroup *group);
 
 extern const MemoryRegionOps vfio_region_ops;
 typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 9fe3e05dc6..57b9e111e6 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -2249,6 +2249,55 @@ put_space_exit:
     return ret;
 }
 
+void vfio_connect_proxy(VFIOProxy *proxy, VFIOGroup *group, AddressSpace *as)
+{
+    VFIOAddressSpace *space;
+    VFIOContainer *container;
+
+    if (QLIST_EMPTY(&vfio_group_list)) {
+        qemu_register_reset(vfio_reset_handler, NULL);
+    }
+
+    QLIST_INSERT_HEAD(&vfio_group_list, group, next);
+
+    /*
+     * try to mirror vfio_connect_container()
+     * as much as possible
+     */
+
+    space = vfio_get_address_space(as);
+
+    container = g_malloc0(sizeof(*container));
+    container->space = space;
+    container->fd = -1;
+    QLIST_INIT(&container->giommu_list);
+    QLIST_INIT(&container->hostwin_list);
+    container->proxy = proxy;
+
+    /*
+     * The proxy uses a SW IOMMU in lieu of the HW one
+     * used in the ioctl() version.  Use TYPE1 with the
+     * target's page size for maximum capatibility
+     */
+    container->iommu_type = VFIO_TYPE1_IOMMU;
+    vfio_host_win_add(container, 0, (hwaddr)-1, TARGET_PAGE_SIZE);
+    container->pgsizes = TARGET_PAGE_SIZE;
+
+    container->dirty_pages_supported = true;
+    container->max_dirty_bitmap_size = VFIO_USER_DEF_MAX_XFER;
+    container->dirty_pgsizes = TARGET_PAGE_SIZE;
+
+    QLIST_INIT(&container->group_list);
+    QLIST_INSERT_HEAD(&space->containers, container, next);
+
+    group->container = container;
+    QLIST_INSERT_HEAD(&container->group_list, group, container_next);
+
+    container->listener = vfio_memory_listener;
+    memory_listener_register(&container->listener, container->space->as);
+    container->initialized = true;
+}
+
 static void vfio_disconnect_container(VFIOGroup *group)
 {
     VFIOContainer *container = group->container;
@@ -2291,6 +2340,41 @@ static void vfio_disconnect_container(VFIOGroup *group)
     }
 }
 
+void vfio_disconnect_proxy(VFIOGroup *group)
+{
+    VFIOContainer *container = group->container;
+    VFIOAddressSpace *space = container->space;
+    VFIOGuestIOMMU *giommu, *tmp;
+
+    /*
+     * try to mirror vfio_disconnect_container()
+     * as much as possible, knowing each device
+     * is in one group and one container
+     */
+
+    QLIST_REMOVE(group, container_next);
+    group->container = NULL;
+
+    /*
+     * Explicitly release the listener first before unset container,
+     * since unset may destroy the backend container if it's the last
+     * group.
+     */
+    memory_listener_unregister(&container->listener);
+
+    QLIST_REMOVE(container, next);
+
+    QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
+        memory_region_unregister_iommu_notifier(
+            MEMORY_REGION(giommu->iommu), &giommu->n);
+        QLIST_REMOVE(giommu, giommu_next);
+        g_free(giommu);
+    }
+
+    g_free(container);
+    vfio_put_address_space(space);
+}
+
 VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
 {
     VFIOGroup *group;
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 282de6a30b..2c9fcb2fa9 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3442,6 +3442,7 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
     VFIODevice *vbasedev = &vdev->vbasedev;
     SocketAddress addr;
     VFIOProxy *proxy;
+    VFIOGroup *group = NULL;
     int ret;
     Error *err = NULL;
 
@@ -3484,6 +3485,19 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
     vbasedev->no_mmap = false;
     vbasedev->ops = &vfio_user_pci_ops;
 
+    /*
+     * each device gets its own group and container
+     * make them unrelated to any host IOMMU groupings
+     */
+    group = g_malloc0(sizeof(*group));
+    group->fd = -1;
+    group->groupid = -1;
+    QLIST_INIT(&group->device_list);
+    QLIST_INSERT_HEAD(&group->device_list, vbasedev, next);
+    vbasedev->group = group;
+
+    vfio_connect_proxy(proxy, group, pci_device_iommu_address_space(pdev));
+
     ret = vfio_user_get_info(&vdev->vbasedev);
     if (ret) {
         error_setg_errno(errp, -ret, "get info failure");
@@ -3587,6 +3601,9 @@ out_teardown:
     vfio_teardown_msi(vdev);
     vfio_bars_exit(vdev);
 error:
+    if (group != NULL) {
+        vfio_disconnect_proxy(group);
+    }
     vfio_user_disconnect(proxy);
     error_prepend(errp, VFIO_MSG_PREFIX, vdev->vbasedev.name);
 }
@@ -3595,6 +3612,11 @@ static void vfio_user_instance_finalize(Object *obj)
 {
     VFIOPCIDevice *vdev = VFIO_PCI_BASE(obj);
     VFIODevice *vbasedev = &vdev->vbasedev;
+    VFIOGroup *group = vbasedev->group;
+
+    vfio_disconnect_proxy(group);
+    g_free(group);
+    vbasedev->group = NULL;
 
     vfio_put_device(vdev);
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH RFC v2 13/16] vfio-user: dma map/unmap operations
  2021-08-16 16:42 [PATCH RFC v2 00/16] vfio-user implementation Elena Ufimtseva
                   ` (11 preceding siblings ...)
  2021-08-16 16:42 ` [PATCH RFC v2 12/16] vfio-user: proxy container connect/disconnect Elena Ufimtseva
@ 2021-08-16 16:42 ` Elena Ufimtseva
  2021-09-08  9:16   ` Stefan Hajnoczi
  2021-08-16 16:42 ` [PATCH RFC v2 14/16] vfio-user: dma read/write operations Elena Ufimtseva
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 108+ messages in thread
From: Elena Ufimtseva @ 2021-08-16 16:42 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha, thanos.makatos

From: John Johnson <john.g.johnson@oracle.com>

Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
---
 hw/vfio/user-protocol.h       |  32 +++++++++
 hw/vfio/user.h                |   6 ++
 include/hw/vfio/vfio-common.h |   1 +
 hw/vfio/common.c              |  71 ++++++++++++++++---
 hw/vfio/user.c                | 124 ++++++++++++++++++++++++++++++++++
 5 files changed, 226 insertions(+), 8 deletions(-)

diff --git a/hw/vfio/user-protocol.h b/hw/vfio/user-protocol.h
index 5614efa0a4..ca53fce5f4 100644
--- a/hw/vfio/user-protocol.h
+++ b/hw/vfio/user-protocol.h
@@ -82,6 +82,31 @@ typedef struct {
 #define VFIO_USER_MAX_MAX_XFER  (64 * 1024 * 1024)
 
 
+/*
+ * VFIO_USER_DMA_MAP
+ * imported from struct vfio_iommu_type1_dma_map
+ */
+typedef struct {
+    VFIOUserHdr hdr;
+    uint32_t argsz;
+    uint32_t flags;
+    uint64_t offset;    /* FD offset */
+    uint64_t iova;
+    uint64_t size;
+} VFIOUserDMAMap;
+
+/*
+ * VFIO_USER_DMA_UNMAP
+ * imported from struct vfio_iommu_type1_dma_unmap
+ */
+typedef struct {
+    VFIOUserHdr hdr;
+    uint32_t argsz;
+    uint32_t flags;
+    uint64_t iova;
+    uint64_t size;
+} VFIOUserDMAUnmap;
+
 /*
  * VFIO_USER_DEVICE_GET_INFO
  * imported from struct_device_info
@@ -146,4 +171,11 @@ typedef struct {
     char data[];
 } VFIOUserRegionRW;
 
+/*imported from struct vfio_bitmap */
+typedef struct {
+    uint64_t pgsize;
+    uint64_t size;
+    char data[];
+} VFIOUserBitmap;
+
 #endif /* VFIO_USER_PROTOCOL_H */
diff --git a/hw/vfio/user.h b/hw/vfio/user.h
index 248ad80943..7786ab57c5 100644
--- a/hw/vfio/user.h
+++ b/hw/vfio/user.h
@@ -71,6 +71,11 @@ void vfio_user_set_reqhandler(VFIODevice *vbasdev,
                                              void *reqarg);
 void vfio_user_send_reply(VFIOProxy *proxy, char *buf, int ret);
 int vfio_user_validate_version(VFIODevice *vbasedev, Error **errp);
+int vfio_user_dma_map(VFIOProxy *proxy, struct vfio_iommu_type1_dma_map *map,
+                      VFIOUserFDs *fds, bool will_commit);
+int vfio_user_dma_unmap(VFIOProxy *proxy,
+                        struct vfio_iommu_type1_dma_unmap *unmap,
+                        struct vfio_bitmap *bitmap, bool will_commit);
 int vfio_user_get_info(VFIODevice *vbasedev);
 int vfio_user_get_region_info(VFIODevice *vbasedev, int index,
                               struct vfio_region_info *info, VFIOUserFDs *fds);
@@ -80,5 +85,6 @@ int vfio_user_region_read(VFIODevice *vbasedev, uint32_t index, uint64_t offset,
                           uint32_t count, void *data);
 int vfio_user_region_write(VFIODevice *vbasedev, uint32_t index,
                            uint64_t offset, uint32_t count, void *data);
+void vfio_user_drain_reqs(VFIOProxy *proxy);
 
 #endif /* VFIO_USER_H */
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 688660c28d..13d1d14c3b 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -87,6 +87,7 @@ typedef struct VFIOContainer {
     Error *error;
     bool initialized;
     bool dirty_pages_supported;
+    bool will_commit;
     uint64_t dirty_pgsizes;
     uint64_t max_dirty_bitmap_size;
     unsigned long pgsizes;
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 57b9e111e6..a532e52bcf 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -427,6 +427,7 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
     struct vfio_iommu_type1_dma_unmap *unmap;
     struct vfio_bitmap *bitmap;
     uint64_t pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size;
+    bool will_commit = container->will_commit;
     int ret;
 
     unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
@@ -460,7 +461,11 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
         goto unmap_exit;
     }
 
-    ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap);
+    if (container->proxy != NULL) {
+        ret = vfio_user_dma_unmap(container->proxy, unmap, bitmap, will_commit);
+    } else {
+        ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap);
+    }
     if (!ret) {
         cpu_physical_memory_set_dirty_lebitmap((unsigned long *)bitmap->data,
                 iotlb->translated_addr, pages);
@@ -487,12 +492,17 @@ static int vfio_dma_unmap(VFIOContainer *container,
         .iova = iova,
         .size = size,
     };
+    bool will_commit = container->will_commit;
 
     if (iotlb && container->dirty_pages_supported &&
         vfio_devices_all_running_and_saving(container)) {
         return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
     }
 
+    if (container->proxy != NULL) {
+        return vfio_user_dma_unmap(container->proxy, &unmap, NULL, will_commit);
+    }
+
     while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
         /*
          * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
@@ -519,7 +529,7 @@ static int vfio_dma_unmap(VFIOContainer *container,
     return 0;
 }
 
-static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
+static int vfio_dma_map(VFIOContainer *container, MemoryRegion *mr, hwaddr iova,
                         ram_addr_t size, void *vaddr, bool readonly)
 {
     struct vfio_iommu_type1_dma_map map = {
@@ -529,11 +539,30 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
         .iova = iova,
         .size = size,
     };
+    bool will_commit = container->will_commit;
 
     if (!readonly) {
         map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
     }
 
+    if (container->proxy != NULL) {
+        VFIOUserFDs fds;
+        int fd;
+
+        fd = memory_region_get_fd(mr);
+        if (fd != -1 && !(container->proxy->flags & VFIO_PROXY_SECURE)) {
+            fds.send_fds = 1;
+            fds.recv_fds = 0;
+            fds.fds = &fd;
+            map.vaddr = qemu_ram_block_host_offset(mr->ram_block, vaddr);
+
+            return vfio_user_dma_map(container->proxy, &map, &fds, will_commit);
+        } else {
+            map.vaddr = 0;
+            return vfio_user_dma_map(container->proxy, &map, NULL, will_commit);
+        }
+    }
+
     /*
      * Try the mapping, if it fails with EBUSY, unmap the region and try
      * again.  This shouldn't be necessary, but we sometimes see it in
@@ -602,7 +631,8 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
 
 /* Called with rcu_read_lock held.  */
 static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
-                               ram_addr_t *ram_addr, bool *read_only)
+                               ram_addr_t *ram_addr, bool *read_only,
+                               MemoryRegion **mrp)
 {
     MemoryRegion *mr;
     hwaddr xlat;
@@ -683,6 +713,10 @@ static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
         *read_only = !writable || mr->readonly;
     }
 
+    if (mrp != NULL) {
+        *mrp = mr;
+    }
+
     return true;
 }
 
@@ -690,6 +724,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
 {
     VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
     VFIOContainer *container = giommu->container;
+    MemoryRegion *mr;
     hwaddr iova = iotlb->iova + giommu->iommu_offset;
     void *vaddr;
     int ret;
@@ -708,7 +743,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
         bool read_only;
 
-        if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only)) {
+        if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only, &mr)) {
             goto out;
         }
         /*
@@ -718,7 +753,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
          * of vaddr will always be there, even if the memory object is
          * destroyed and its backing memory munmap-ed.
          */
-        ret = vfio_dma_map(container, iova,
+        ret = vfio_dma_map(container, mr, iova,
                            iotlb->addr_mask + 1, vaddr,
                            read_only);
         if (ret) {
@@ -780,7 +815,7 @@ static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
                section->offset_within_address_space;
         vaddr = memory_region_get_ram_ptr(section->mr) + start;
 
-        ret = vfio_dma_map(vrdl->container, iova, next - start,
+        ret = vfio_dma_map(vrdl->container, section->mr, iova, next - start,
                            vaddr, section->readonly);
         if (ret) {
             /* Rollback */
@@ -888,6 +923,24 @@ static void vfio_unregister_ram_discard_listener(VFIOContainer *container,
     g_free(vrdl);
 }
 
+static void vfio_listener_begin(MemoryListener *listener)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+
+    container->will_commit = 1;
+}
+
+static void vfio_listener_commit(MemoryListener *listener)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+
+    /* wait for any async requests sent during the transaction */
+    if (container->proxy != NULL) {
+        vfio_user_drain_reqs(container->proxy);
+    }
+    container->will_commit = 0;
+}
+
 static void vfio_listener_region_add(MemoryListener *listener,
                                      MemoryRegionSection *section)
 {
@@ -1080,7 +1133,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
         }
     }
 
-    ret = vfio_dma_map(container, iova, int128_get64(llsize),
+    ret = vfio_dma_map(container, section->mr, iova, int128_get64(llsize),
                        vaddr, section->readonly);
     if (ret) {
         error_setg(&err, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
@@ -1346,7 +1399,7 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     }
 
     rcu_read_lock();
-    if (vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL)) {
+    if (vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL, NULL)) {
         int ret;
 
         ret = vfio_get_dirty_bitmap(container, iova, iotlb->addr_mask + 1,
@@ -1463,6 +1516,8 @@ static void vfio_listener_log_sync(MemoryListener *listener,
 }
 
 static const MemoryListener vfio_memory_listener = {
+    .begin = vfio_listener_begin,
+    .commit = vfio_listener_commit,
     .region_add = vfio_listener_region_add,
     .region_del = vfio_listener_region_del,
     .log_global_start = vfio_listener_log_global_start,
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
index b68ca1279d..06bcd46e60 100644
--- a/hw/vfio/user.c
+++ b/hw/vfio/user.c
@@ -408,6 +408,47 @@ static void vfio_user_send_recv(VFIOProxy *proxy, VFIOUserHdr *msg,
     }
 }
 
+void vfio_user_drain_reqs(VFIOProxy *proxy)
+{
+    VFIOUserReply *reply;
+    bool iolock = 0;
+
+    /*
+     * Any DMA map/unmap requests sent in the middle
+     * of a memory region transaction were sent async.
+     * Wait for them here.
+     */
+    QEMU_LOCK_GUARD(&proxy->lock);
+    if (proxy->last_nowait != NULL) {
+        iolock = qemu_mutex_iothread_locked();
+        if (iolock) {
+            qemu_mutex_unlock_iothread();
+        }
+
+        reply = proxy->last_nowait;
+        reply->nowait = 0;
+        while (reply->complete == 0) {
+            if (!qemu_cond_timedwait(&reply->cv, &proxy->lock, wait_time)) {
+                error_printf("vfio_drain_reqs - timed out\n");
+                break;
+            }
+        }
+
+        if (reply->msg->flags & VFIO_USER_ERROR) {
+            error_printf("vfio_user_rcv error reply on async request ");
+            error_printf("command %x error %s\n", reply->msg->command,
+                         strerror(reply->msg->error_reply));
+        }
+        proxy->last_nowait = NULL;
+        g_free(reply->msg);
+        QTAILQ_INSERT_HEAD(&proxy->free, reply, next);
+    }
+
+    if (iolock) {
+        qemu_mutex_lock_iothread();
+    }
+}
+
 static void vfio_user_request_msg(VFIOUserHdr *hdr, uint16_t cmd,
                                   uint32_t size, uint32_t flags)
 {
@@ -715,6 +756,89 @@ int vfio_user_validate_version(VFIODevice *vbasedev, Error **errp)
     return 0;
 }
 
+int vfio_user_dma_map(VFIOProxy *proxy, struct vfio_iommu_type1_dma_map *map,
+                      VFIOUserFDs *fds, bool will_commit)
+{
+    VFIOUserDMAMap *msgp = g_malloc(sizeof(*msgp));
+    int ret, flags;
+
+    /* commit will wait, so send async without dropping BQL */
+    flags = will_commit ? (NOIOLOCK | NOWAIT) : 0;
+
+    vfio_user_request_msg(&msgp->hdr, VFIO_USER_DMA_MAP, sizeof(*msgp), 0);
+    msgp->argsz = map->argsz;
+    msgp->flags = map->flags;
+    msgp->offset = map->vaddr;
+    msgp->iova = map->iova;
+    msgp->size = map->size;
+
+    vfio_user_send_recv(proxy, &msgp->hdr, fds, 0, flags);
+    ret = (msgp->hdr.flags & VFIO_USER_ERROR) ? -msgp->hdr.error_reply : 0;
+
+    if (!(flags & NOWAIT)) {
+        g_free(msgp);
+    }
+    return ret;
+}
+
+int vfio_user_dma_unmap(VFIOProxy *proxy,
+                        struct vfio_iommu_type1_dma_unmap *unmap,
+                        struct vfio_bitmap *bitmap, bool will_commit)
+{
+    struct {
+        VFIOUserDMAUnmap msg;
+        VFIOUserBitmap bitmap;
+    } *msgp = NULL;
+    int msize, rsize, flags;
+
+    if (bitmap == NULL && (unmap->flags &
+                           VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP)) {
+        error_printf("vfio_user_dma_unmap mismatched flags and bitmap\n");
+        return -EINVAL;
+    }
+
+    /* can't drop BQL until commit */
+    flags = will_commit ? NOIOLOCK : 0;
+
+    /*
+     * If a dirty bitmap is returned, allocate extra space for it
+     * otherwise, just send the unmap request
+     */
+    if (bitmap != NULL) {
+        msize = sizeof(*msgp);
+        rsize = msize + bitmap->size;
+        msgp = g_malloc0(rsize);
+        msgp->bitmap.pgsize = bitmap->pgsize;
+        msgp->bitmap.size = bitmap->size;
+    } else {
+        /* can only send async if no bitmap returned */
+        flags |= will_commit ? NOWAIT : 0;
+        msize = rsize = sizeof(VFIOUserDMAUnmap);
+        msgp = g_malloc0(rsize);
+    }
+
+    vfio_user_request_msg(&msgp->msg.hdr, VFIO_USER_DMA_UNMAP, msize, flags);
+    msgp->msg.argsz = unmap->argsz;
+    msgp->msg.flags = unmap->flags;
+    msgp->msg.iova = unmap->iova;
+    msgp->msg.size = unmap->size;
+
+    vfio_user_send_recv(proxy, &msgp->msg.hdr, NULL, rsize, flags);
+    if (msgp->msg.hdr.flags & VFIO_USER_ERROR) {
+        g_free(msgp);
+        return -msgp->msg.hdr.error_reply;
+    }
+
+    if (bitmap != NULL) {
+        memcpy(bitmap->data, &msgp->bitmap.data, bitmap->size);
+    }
+    if (!(flags & NOWAIT)) {
+        g_free(msgp);
+    }
+
+    return 0;
+}
+
 int vfio_user_get_info(VFIODevice *vbasedev)
 {
     VFIOUserDeviceInfo msg;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH RFC v2 14/16] vfio-user: dma read/write operations
  2021-08-16 16:42 [PATCH RFC v2 00/16] vfio-user implementation Elena Ufimtseva
                   ` (12 preceding siblings ...)
  2021-08-16 16:42 ` [PATCH RFC v2 13/16] vfio-user: dma map/unmap operations Elena Ufimtseva
@ 2021-08-16 16:42 ` Elena Ufimtseva
  2021-09-08  9:51   ` Stefan Hajnoczi
  2021-08-16 16:42 ` [PATCH RFC v2 15/16] vfio-user: pci reset Elena Ufimtseva
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 108+ messages in thread
From: Elena Ufimtseva @ 2021-08-16 16:42 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha, thanos.makatos

From: John Johnson <john.g.johnson@oracle.com>

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/vfio/user-protocol.h | 11 +++++++
 hw/vfio/user.h          |  1 +
 hw/vfio/pci.c           | 63 ++++++++++++++++++++++++++++++++++++++++-
 hw/vfio/user.c          |  7 ++++-
 4 files changed, 80 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/user-protocol.h b/hw/vfio/user-protocol.h
index ca53fce5f4..c5d9473f8f 100644
--- a/hw/vfio/user-protocol.h
+++ b/hw/vfio/user-protocol.h
@@ -171,6 +171,17 @@ typedef struct {
     char data[];
 } VFIOUserRegionRW;
 
+/*
+ * VFIO_USER_DMA_READ
+ * VFIO_USER_DMA_WRITE
+ */
+typedef struct {
+    VFIOUserHdr hdr;
+    uint64_t offset;
+    uint32_t count;
+    char data[];
+} VFIOUserDMARW;
+
 /*imported from struct vfio_bitmap */
 typedef struct {
     uint64_t pgsize;
diff --git a/hw/vfio/user.h b/hw/vfio/user.h
index 7786ab57c5..32e8b70d28 100644
--- a/hw/vfio/user.h
+++ b/hw/vfio/user.h
@@ -65,6 +65,7 @@ typedef struct VFIOProxy {
 
 VFIOProxy *vfio_user_connect_dev(SocketAddress *addr, Error **errp);
 void vfio_user_disconnect(VFIOProxy *proxy);
+uint64_t vfio_user_max_xfer(void);
 void vfio_user_set_reqhandler(VFIODevice *vbasdev,
                               int (*handler)(void *opaque, char *buf,
                                              VFIOUserFDs *fds),
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 2c9fcb2fa9..29a874c066 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3406,11 +3406,72 @@ type_init(register_vfio_pci_dev_type)
  * vfio-user routines.
  */
 
-static int vfio_user_pci_process_req(void *opaque, char *buf, VFIOUserFDs *fds)
+static int vfio_user_dma_read(VFIOPCIDevice *vdev, VFIOUserDMARW *msg)
 {
+    PCIDevice *pdev = &vdev->pdev;
+    char *buf;
+    int size = msg->count + sizeof(VFIOUserDMARW);
+
+    if (msg->hdr.flags & VFIO_USER_NO_REPLY) {
+        return -EINVAL;
+    }
+    if (msg->count > vfio_user_max_xfer()) {
+        return -E2BIG;
+    }
+
+    buf = g_malloc0(size);
+    memcpy(buf, msg, sizeof(*msg));
+
+    pci_dma_read(pdev, msg->offset, buf + sizeof(*msg), msg->count);
+
+    vfio_user_send_reply(vdev->vbasedev.proxy, buf, size);
+    g_free(buf);
     return 0;
 }
 
+static int vfio_user_dma_write(VFIOPCIDevice *vdev,
+                               VFIOUserDMARW *msg)
+{
+    PCIDevice *pdev = &vdev->pdev;
+    char *buf = (char *)msg + sizeof(*msg);
+
+    /* make sure transfer count isn't larger than the message data */
+    if (msg->count > msg->hdr.size - sizeof(*msg)) {
+        return -E2BIG;
+    }
+
+    pci_dma_write(pdev, msg->offset, buf, msg->count);
+
+    if ((msg->hdr.flags & VFIO_USER_NO_REPLY) == 0) {
+        vfio_user_send_reply(vdev->vbasedev.proxy, (char *)msg,
+                             sizeof(msg->hdr));
+    }
+    return 0;
+}
+
+static int vfio_user_pci_process_req(void *opaque, char *buf, VFIOUserFDs *fds)
+{
+    VFIOPCIDevice *vdev = opaque;
+    VFIOUserHdr *hdr = (VFIOUserHdr *)buf;
+    int ret;
+
+    if (fds->recv_fds != 0) {
+        return -EINVAL;
+    }
+    switch (hdr->command) {
+    case VFIO_USER_DMA_READ:
+        ret = vfio_user_dma_read(vdev, (VFIOUserDMARW *)hdr);
+        break;
+    case VFIO_USER_DMA_WRITE:
+        ret = vfio_user_dma_write(vdev, (VFIOUserDMARW *)hdr);
+        break;
+    default:
+        error_printf("vfio_user_process_req unknown cmd %d\n", hdr->command);
+        ret = -ENOSYS;
+    }
+    return ret;
+}
+
 /*
  * Emulated devices don't use host hot reset
  */
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
index 06bcd46e60..fcc041959c 100644
--- a/hw/vfio/user.c
+++ b/hw/vfio/user.c
@@ -54,6 +54,11 @@ static void vfio_user_send_recv(VFIOProxy *proxy, VFIOUserHdr *msg,
  * Functions called by main, CPU, or iothread threads
  */
 
+uint64_t vfio_user_max_xfer(void)
+{
+    return max_xfer_size;
+}
+
 static void vfio_user_shutdown(VFIOProxy *proxy)
 {
     qio_channel_shutdown(proxy->ioc, QIO_CHANNEL_SHUTDOWN_READ, NULL);
@@ -251,7 +256,7 @@ void vfio_user_recv(void *opaque)
         *reply->msg = msg;
         data = (char *)reply->msg + sizeof(msg);
     } else {
-        if (msg.size > max_xfer_size) {
+        if (msg.size > max_xfer_size + sizeof(VFIOUserDMARW)) {
             error_setg(&local_err, "vfio_user_recv request larger than max");
             goto fatal;
         }
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH RFC v2 15/16] vfio-user: pci reset
  2021-08-16 16:42 [PATCH RFC v2 00/16] vfio-user implementation Elena Ufimtseva
                   ` (13 preceding siblings ...)
  2021-08-16 16:42 ` [PATCH RFC v2 14/16] vfio-user: dma read/write operations Elena Ufimtseva
@ 2021-08-16 16:42 ` Elena Ufimtseva
  2021-09-08  9:56   ` Stefan Hajnoczi
  2021-08-16 16:42 ` [PATCH RFC v2 16/16] vfio-user: migration support Elena Ufimtseva
  2021-08-27 17:53 ` [PATCH RFC server v2 00/11] vfio-user server in QEMU Jagannathan Raman
  16 siblings, 1 reply; 108+ messages in thread
From: Elena Ufimtseva @ 2021-08-16 16:42 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha, thanos.makatos

From: John Johnson <john.g.johnson@oracle.com>

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/vfio/user.h |  1 +
 hw/vfio/pci.c  | 29 ++++++++++++++++++++++++++---
 hw/vfio/user.c | 12 ++++++++++++
 3 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/hw/vfio/user.h b/hw/vfio/user.h
index 32e8b70d28..5d4d0a43ba 100644
--- a/hw/vfio/user.h
+++ b/hw/vfio/user.h
@@ -86,6 +86,7 @@ int vfio_user_region_read(VFIODevice *vbasedev, uint32_t index, uint64_t offset,
                           uint32_t count, void *data);
 int vfio_user_region_write(VFIODevice *vbasedev, uint32_t index,
                            uint64_t offset, uint32_t count, void *data);
+void vfio_user_reset(VFIODevice *vbasedev);
 void vfio_user_drain_reqs(VFIOProxy *proxy);
 
 #endif /* VFIO_USER_H */
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 29a874c066..4b933ed10f 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2229,8 +2229,9 @@ static void vfio_pci_pre_reset(VFIOPCIDevice *vdev)
 
 static void vfio_pci_post_reset(VFIOPCIDevice *vdev)
 {
+    VFIODevice *vbasedev = &vdev->vbasedev;
     Error *err = NULL;
-    int nr;
+    int ret, nr;
 
     vfio_intx_enable(vdev, &err);
     if (err) {
@@ -2238,11 +2239,18 @@ static void vfio_pci_post_reset(VFIOPCIDevice *vdev)
     }
 
     for (nr = 0; nr < PCI_NUM_REGIONS - 1; ++nr) {
-        off_t addr = vdev->config_offset + PCI_BASE_ADDRESS_0 + (4 * nr);
+        off_t addr = PCI_BASE_ADDRESS_0 + (4 * nr);
         uint32_t val = 0;
         uint32_t len = sizeof(val);
 
-        if (pwrite(vdev->vbasedev.fd, &val, len, addr) != len) {
+        if (vbasedev->proxy != NULL) {
+            ret = vfio_user_region_write(vbasedev, VFIO_PCI_CONFIG_REGION_INDEX,
+                                         addr, len, &val);
+        } else {
+            ret = pwrite(vdev->vbasedev.fd, &val, len,
+                         vdev->config_offset + addr);
+        }
+        if (ret != len) {
             error_report("%s(%s) reset bar %d failed: %m", __func__,
                          vdev->vbasedev.name, nr);
         }
@@ -3684,6 +3692,20 @@ static void vfio_user_instance_finalize(Object *obj)
     vfio_user_disconnect(vbasedev->proxy);
 }
 
+static void vfio_user_pci_reset(DeviceState *dev)
+{
+    VFIOPCIDevice *vdev = VFIO_PCI_BASE(dev);
+    VFIODevice *vbasedev = &vdev->vbasedev;
+
+    vfio_pci_pre_reset(vdev);
+
+    if (vbasedev->reset_works) {
+        vfio_user_reset(vbasedev);
+    }
+
+    vfio_pci_post_reset(vdev);
+}
+
 static Property vfio_user_pci_dev_properties[] = {
     DEFINE_PROP_STRING("socket", VFIOUserPCIDevice, sock_name),
     DEFINE_PROP_BOOL("secure-dma", VFIOUserPCIDevice, secure_dma, false),
@@ -3695,6 +3717,7 @@ static void vfio_user_pci_dev_class_init(ObjectClass *klass, void *data)
     DeviceClass *dc = DEVICE_CLASS(klass);
     PCIDeviceClass *pdc = PCI_DEVICE_CLASS(klass);
 
+    dc->reset = vfio_user_pci_reset;
     device_class_set_props(dc, vfio_user_pci_dev_properties);
     dc->desc = "VFIO over socket PCI device assignment";
     pdc->realize = vfio_user_pci_realize;
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
index fcc041959c..7de2125346 100644
--- a/hw/vfio/user.c
+++ b/hw/vfio/user.c
@@ -1045,3 +1045,15 @@ int vfio_user_region_write(VFIODevice *vbasedev, uint32_t index,
 
     return count;
 }
+
+void vfio_user_reset(VFIODevice *vbasedev)
+{
+    VFIOUserHdr msg;
+
+    vfio_user_request_msg(&msg, VFIO_USER_DEVICE_RESET, sizeof(msg), 0);
+
+    vfio_user_send_recv(vbasedev->proxy, &msg, NULL, 0, 0);
+    if (msg.flags & VFIO_USER_ERROR) {
+        error_printf("reset reply error %d\n", msg.error_reply);
+    }
+}
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH RFC v2 16/16] vfio-user: migration support
  2021-08-16 16:42 [PATCH RFC v2 00/16] vfio-user implementation Elena Ufimtseva
                   ` (14 preceding siblings ...)
  2021-08-16 16:42 ` [PATCH RFC v2 15/16] vfio-user: pci reset Elena Ufimtseva
@ 2021-08-16 16:42 ` Elena Ufimtseva
  2021-09-08 10:04   ` Stefan Hajnoczi
  2021-08-27 17:53 ` [PATCH RFC server v2 00/11] vfio-user server in QEMU Jagannathan Raman
  16 siblings, 1 reply; 108+ messages in thread
From: Elena Ufimtseva @ 2021-08-16 16:42 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, jag.raman, swapnil.ingle,
	john.levon, alex.williamson, stefanha, thanos.makatos

From: John Johnson <john.g.johnson@oracle.com>

Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/vfio/user-protocol.h | 18 +++++++++++++++++
 hw/vfio/user.h          |  3 +++
 hw/vfio/common.c        | 23 ++++++++++++++++-----
 hw/vfio/migration.c     | 34 +++++++++++++++++--------------
 hw/vfio/pci.c           | 12 +++++++++++
 hw/vfio/user.c          | 45 +++++++++++++++++++++++++++++++++++++++++
 6 files changed, 115 insertions(+), 20 deletions(-)

diff --git a/hw/vfio/user-protocol.h b/hw/vfio/user-protocol.h
index c5d9473f8f..bad067a570 100644
--- a/hw/vfio/user-protocol.h
+++ b/hw/vfio/user-protocol.h
@@ -182,6 +182,10 @@ typedef struct {
     char data[];
 } VFIOUserDMARW;
 
+/*
+ * VFIO_USER_DIRTY_PAGES
+ */
+
 /*imported from struct vfio_bitmap */
 typedef struct {
     uint64_t pgsize;
@@ -189,4 +193,18 @@ typedef struct {
     char data[];
 } VFIOUserBitmap;
 
+/* imported from struct vfio_iommu_type1_dirty_bitmap_get */
+typedef struct {
+    uint64_t iova;
+    uint64_t size;
+    VFIOUserBitmap bitmap;
+} VFIOUserBitmapRange;
+
+/* imported from struct vfio_iommu_type1_dirty_bitmap */
+typedef struct {
+    VFIOUserHdr hdr;
+    uint32_t argsz;
+    uint32_t flags;
+} VFIOUserDirtyPages;
+
 #endif /* VFIO_USER_PROTOCOL_H */
diff --git a/hw/vfio/user.h b/hw/vfio/user.h
index 5d4d0a43ba..905e0ee28d 100644
--- a/hw/vfio/user.h
+++ b/hw/vfio/user.h
@@ -87,6 +87,9 @@ int vfio_user_region_read(VFIODevice *vbasedev, uint32_t index, uint64_t offset,
 int vfio_user_region_write(VFIODevice *vbasedev, uint32_t index,
                            uint64_t offset, uint32_t count, void *data);
 void vfio_user_reset(VFIODevice *vbasedev);
+int vfio_user_dirty_bitmap(VFIOProxy *proxy,
+                           struct vfio_iommu_type1_dirty_bitmap *bitmap,
+                           struct vfio_iommu_type1_dirty_bitmap_get *range);
 void vfio_user_drain_reqs(VFIOProxy *proxy);
 
 #endif /* VFIO_USER_H */
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index a532e52bcf..09d0147df2 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1303,10 +1303,19 @@ static void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
         dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP;
     }
 
-    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty);
-    if (ret) {
-        error_report("Failed to set dirty tracking flag 0x%x errno: %d",
-                     dirty.flags, errno);
+    if (container->proxy != NULL) {
+        ret = vfio_user_dirty_bitmap(container->proxy, &dirty, NULL);
+        if (ret) {
+            error_report("Failed to set dirty tracking flag 0x%x errno: %d",
+                         dirty.flags, -ret);
+        }
+    } else {
+        ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty);
+        if (ret) {
+            error_report("Failed to set dirty tracking flag 0x%x errno: %d",
+                         dirty.flags, errno);
+            ret = -errno;
+        }
     }
 }
 
@@ -1356,7 +1365,11 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
         goto err_out;
     }
 
-    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
+    if (container->proxy != NULL) {
+        ret = vfio_user_dirty_bitmap(container->proxy, dbitmap, range);
+    } else {
+        ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
+    }
     if (ret) {
         error_report("Failed to get dirty bitmap for iova: 0x%"PRIx64
                 " size: 0x%"PRIx64" err: %d", (uint64_t)range->iova,
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 82f654afb6..89926a3b01 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -27,6 +27,7 @@
 #include "pci.h"
 #include "trace.h"
 #include "hw/hw.h"
+#include "user.h"
 
 /*
  * Flags to be used as unique delimiters for VFIO devices in the migration
@@ -49,10 +50,18 @@ static int64_t bytes_transferred;
 static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
                                   off_t off, bool iswrite)
 {
+    VFIORegion *region = &vbasedev->migration->region;
     int ret;
 
-    ret = iswrite ? pwrite(vbasedev->fd, val, count, off) :
-                    pread(vbasedev->fd, val, count, off);
+    if (vbasedev->proxy != NULL) {
+        ret = iswrite ?
+            vfio_user_region_write(vbasedev, region->nr, off, count, val) :
+            vfio_user_region_read(vbasedev, region->nr, off, count, val);
+    } else {
+        off += region->fd_offset;
+        ret = iswrite ? pwrite(vbasedev->fd, val, count, off) :
+                        pread(vbasedev->fd, val, count, off);
+    }
     if (ret < count) {
         error_report("vfio_mig_%s %d byte %s: failed at offset 0x%"
                      HWADDR_PRIx", err: %s", iswrite ? "write" : "read", count,
@@ -111,9 +120,7 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
                                     uint32_t value)
 {
     VFIOMigration *migration = vbasedev->migration;
-    VFIORegion *region = &migration->region;
-    off_t dev_state_off = region->fd_offset +
-                          VFIO_MIG_STRUCT_OFFSET(device_state);
+    off_t dev_state_off = VFIO_MIG_STRUCT_OFFSET(device_state);
     uint32_t device_state;
     int ret;
 
@@ -201,13 +208,13 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t *size)
     int ret;
 
     ret = vfio_mig_read(vbasedev, &data_offset, sizeof(data_offset),
-                      region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_offset));
+                        VFIO_MIG_STRUCT_OFFSET(data_offset));
     if (ret < 0) {
         return ret;
     }
 
     ret = vfio_mig_read(vbasedev, &data_size, sizeof(data_size),
-                        region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_size));
+                        VFIO_MIG_STRUCT_OFFSET(data_size));
     if (ret < 0) {
         return ret;
     }
@@ -233,8 +240,7 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t *size)
             }
             buf_allocated = true;
 
-            ret = vfio_mig_read(vbasedev, buf, sec_size,
-                                region->fd_offset + data_offset);
+            ret = vfio_mig_read(vbasedev, buf, sec_size, data_offset);
             if (ret < 0) {
                 g_free(buf);
                 return ret;
@@ -269,7 +275,7 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
 
     do {
         ret = vfio_mig_read(vbasedev, &data_offset, sizeof(data_offset),
-                      region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_offset));
+                            VFIO_MIG_STRUCT_OFFSET(data_offset));
         if (ret < 0) {
             return ret;
         }
@@ -309,8 +315,7 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
             qemu_get_buffer(f, buf, sec_size);
 
             if (buf_alloc) {
-                ret = vfio_mig_write(vbasedev, buf, sec_size,
-                        region->fd_offset + data_offset);
+                ret = vfio_mig_write(vbasedev, buf, sec_size, data_offset);
                 g_free(buf);
 
                 if (ret < 0) {
@@ -322,7 +327,7 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
         }
 
         ret = vfio_mig_write(vbasedev, &report_size, sizeof(report_size),
-                        region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_size));
+                             VFIO_MIG_STRUCT_OFFSET(data_size));
         if (ret < 0) {
             return ret;
         }
@@ -334,12 +339,11 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
 static int vfio_update_pending(VFIODevice *vbasedev)
 {
     VFIOMigration *migration = vbasedev->migration;
-    VFIORegion *region = &migration->region;
     uint64_t pending_bytes = 0;
     int ret;
 
     ret = vfio_mig_read(vbasedev, &pending_bytes, sizeof(pending_bytes),
-                    region->fd_offset + VFIO_MIG_STRUCT_OFFSET(pending_bytes));
+                        VFIO_MIG_STRUCT_OFFSET(pending_bytes));
     if (ret < 0) {
         migration->pending_bytes = 0;
         return ret;
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 4b933ed10f..976fb89786 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3658,6 +3658,13 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
         }
     }
 
+    if (!pdev->failover_pair_id) {
+        ret = vfio_migration_probe(&vdev->vbasedev, errp);
+        if (ret) {
+            error_report("%s: Migration disabled", vdev->vbasedev.name);
+        }
+    }
+
     vfio_register_err_notifier(vdev);
     vfio_register_req_notifier(vdev);
 
@@ -3709,6 +3716,11 @@ static void vfio_user_pci_reset(DeviceState *dev)
 static Property vfio_user_pci_dev_properties[] = {
     DEFINE_PROP_STRING("socket", VFIOUserPCIDevice, sock_name),
     DEFINE_PROP_BOOL("secure-dma", VFIOUserPCIDevice, secure_dma, false),
+    DEFINE_PROP_BOOL("x-enable-migration", VFIOPCIDevice,
+                     vbasedev.enable_migration, false),
+    DEFINE_PROP_ON_OFF_AUTO("x-pre-copy-dirty-page-tracking", VFIOPCIDevice,
+                            vbasedev.pre_copy_dirty_page_tracking,
+                            ON_OFF_AUTO_ON),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
index 7de2125346..486f7c0fe7 100644
--- a/hw/vfio/user.c
+++ b/hw/vfio/user.c
@@ -1057,3 +1057,48 @@ void vfio_user_reset(VFIODevice *vbasedev)
         error_printf("reset reply error %d\n", msg.error_reply);
     }
 }
+
+int vfio_user_dirty_bitmap(VFIOProxy *proxy,
+                           struct vfio_iommu_type1_dirty_bitmap *cmd,
+                           struct vfio_iommu_type1_dirty_bitmap_get *dbitmap)
+{
+    g_autofree struct {
+        VFIOUserDirtyPages msg;
+        VFIOUserBitmapRange range;
+    } *msgp = NULL;
+    int msize, rsize;
+
+    /*
+     * If just the command is sent, the returned bitmap isn't needed.
+     * The bitmap structs are different from the ioctl() versions,
+     * ioctl() returns the bitmap in a local VA
+     */
+    if (dbitmap != NULL) {
+        msize = sizeof(*msgp);
+        rsize = msize + dbitmap->bitmap.size;
+        msgp = g_malloc0(rsize);
+        msgp->range.iova = dbitmap->iova;
+        msgp->range.size = dbitmap->size;
+        msgp->range.bitmap.pgsize = dbitmap->bitmap.pgsize;
+        msgp->range.bitmap.size = dbitmap->bitmap.size;
+    } else {
+        msize = rsize = sizeof(VFIOUserDirtyPages);
+        msgp = g_malloc0(rsize);
+    }
+
+    vfio_user_request_msg(&msgp->msg.hdr, VFIO_USER_DIRTY_PAGES, msize, 0);
+    msgp->msg.argsz = msize - sizeof(msgp->msg.hdr);
+    msgp->msg.flags = cmd->flags;
+
+    vfio_user_send_recv(proxy, &msgp->msg.hdr, NULL, rsize, 0);
+    if (msgp->msg.hdr.flags & VFIO_USER_ERROR) {
+        return -msgp->msg.hdr.error_reply;
+    }
+
+    if (dbitmap != NULL) {
+        memcpy(dbitmap->bitmap.data, &msgp->range.bitmap.data,
+               dbitmap->bitmap.size);
+    }
+
+    return 0;
+}
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 01/16] vfio-user: introduce vfio-user protocol specification
  2021-08-16 16:42 ` [PATCH RFC v2 01/16] vfio-user: introduce vfio-user protocol specification Elena Ufimtseva
@ 2021-08-17 23:04   ` Alex Williamson
  2021-08-19  9:28     ` Swapnil Ingle
  2021-08-19 15:32     ` John Johnson
  0 siblings, 2 replies; 108+ messages in thread
From: Alex Williamson @ 2021-08-17 23:04 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: john.g.johnson, jag.raman, swapnil.ingle, john.levon, qemu-devel,
	stefanha, thanos.makatos

On Mon, 16 Aug 2021 09:42:34 -0700
Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote:
> +Authentication
> +--------------
> +
> +For ``AF_UNIX``, we rely on OS mandatory access controls on the socket files,
> +therefore it is up to the management layer to set up the socket as required.
> +Socket types than span guests or hosts will require a proper authentication

s/than/that/

...
> +``VFIO_USER_DMA_UNMAP``
> +-----------------------
> +
> +This command message is sent by the client to the server to inform it that a
> +DMA region, previously made available via a ``VFIO_USER_DMA_MAP`` command
> +message, is no longer available for DMA. It typically occurs when memory is
> +subtracted from the client or if the client uses a vIOMMU. The DMA region is
> +described by the following structure:
> +
> +Request
> +^^^^^^^
> +
> +The request payload for this message is a structure of the following format:
> +
> ++--------------+--------+------------------------+
> +| Name         | Offset | Size                   |
> ++==============+========+========================+
> +| argsz        | 0      | 4                      |
> ++--------------+--------+------------------------+
> +| flags        | 4      | 4                      |
> ++--------------+--------+------------------------+
> +|              | +-----+-----------------------+ |
> +|              | | Bit | Definition            | |
> +|              | +=====+=======================+ |
> +|              | | 0   | get dirty page bitmap | |
> +|              | +-----+-----------------------+ |
> ++--------------+--------+------------------------+
> +| address      | 8      | 8                      |
> ++--------------+--------+------------------------+
> +| size         | 16     | 8                      |
> ++--------------+--------+------------------------+
> +
> +* *argsz* is the maximum size of the reply payload.
> +* *flags* contains the following DMA region attributes:
> +
> +  * *get dirty page bitmap* indicates that a dirty page bitmap must be
> +    populated before unmapping the DMA region. The client must provide a
> +    `VFIO Bitmap`_ structure, explained below, immediately following this
> +    entry.
> +
> +* *address* is the base DMA address of the DMA region.
> +* *size* is the size of the DMA region.
> +
> +The address and size of the DMA region being unmapped must match exactly a
> +previous mapping. The size of request message depends on whether or not the
> +*get dirty page bitmap* bit is set in Flags:
> +
> +* If not set, the size of the total request message is: 16 + 24.
> +
> +* If set, the size of the total request message is: 16 + 24 + 16.

The address/size paradigm falls into the same issues as the vfio kernel
interface where we can't map or unmap the entire 64-bit address space,
ie. size is limited to 2^64 - 1.  The kernel interface also requires
PAGE_SIZE granularity for the DMA, which means the practical limit is
2^64 - PAGE_SIZE.  If we had a redo on the kernel interface we'd use
start/end so we can express a size of (end - start + 1).

Is following the vfio kernel interface sufficiently worthwhile for
compatibility to incur this same limitation?  I don't recall if we've
already discussed this, but perhaps worth a note in this design doc if
similarity to the kernel interface is being favored here.  See for
example QEMU commit 1b296c3def4b ("vfio: Don't issue full 2^64 unmap").
Thanks,

Alex



^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 04/16] vfio-user: connect vfio proxy to remote server
  2021-08-16 16:42 ` [PATCH RFC v2 04/16] vfio-user: connect vfio proxy to remote server Elena Ufimtseva
@ 2021-08-18 18:47   ` Alex Williamson
  2021-08-19 14:10     ` John Johnson
  2021-08-24 14:15   ` Stefan Hajnoczi
  1 sibling, 1 reply; 108+ messages in thread
From: Alex Williamson @ 2021-08-18 18:47 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: john.g.johnson, jag.raman, swapnil.ingle, john.levon, qemu-devel,
	stefanha, thanos.makatos

On Mon, 16 Aug 2021 09:42:37 -0700
Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote:

> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
> index da9af297a0..739b30be73 100644
> --- a/hw/vfio/meson.build
> +++ b/hw/vfio/meson.build
> @@ -8,6 +8,7 @@ vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
>    'display.c',
>    'pci-quirks.c',
>    'pci.c',
> +  'user.c',
>  ))
>  vfio_ss.add(when: 'CONFIG_VFIO_CCW', if_true: files('ccw.c'))
>  vfio_ss.add(when: 'CONFIG_VFIO_PLATFORM', if_true: files('platform.c'))

Wouldn't it make sense to be able to configure QEMU with any
combination of vfio-pci and/or vfio-user-pci support rather than
statically tying vfio-user-pci to vfio-pci?  Not to mention that doing
so would help to more formally define the interface operations between
kernel and user options, for example fewer tests of vbasedev->proxy and
perhaps more abstraction through ops structures.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 01/16] vfio-user: introduce vfio-user protocol specification
  2021-08-17 23:04   ` Alex Williamson
@ 2021-08-19  9:28     ` Swapnil Ingle
  2021-08-19 15:32     ` John Johnson
  1 sibling, 0 replies; 108+ messages in thread
From: Swapnil Ingle @ 2021-08-19  9:28 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Elena Ufimtseva, john.g.johnson, jag.raman, John Levon,
	qemu-devel, stefanha, Thanos Makatos

[-- Attachment #1: Type: text/plain, Size: 3726 bytes --]



On 18. Aug 2021, at 01:04, Alex Williamson <alex.williamson@redhat.com<mailto:alex.williamson@redhat.com>> wrote:

On Mon, 16 Aug 2021 09:42:34 -0700
Elena Ufimtseva <elena.ufimtseva@oracle.com<mailto:elena.ufimtseva@oracle.com>> wrote:
+Authentication
+--------------
+
+For ``AF_UNIX``, we rely on OS mandatory access controls on the socket files,
+therefore it is up to the management layer to set up the socket as required.
+Socket types than span guests or hosts will require a proper authentication

s/than/that/
Ack

...
+``VFIO_USER_DMA_UNMAP``
+-----------------------
+
+This command message is sent by the client to the server to inform it that a
+DMA region, previously made available via a ``VFIO_USER_DMA_MAP`` command
+message, is no longer available for DMA. It typically occurs when memory is
+subtracted from the client or if the client uses a vIOMMU. The DMA region is
+described by the following structure:
+
+Request
+^^^^^^^
+
+The request payload for this message is a structure of the following format:
+
++--------------+--------+------------------------+
+| Name         | Offset | Size                   |
++==============+========+========================+
+| argsz        | 0      | 4                      |
++--------------+--------+------------------------+
+| flags        | 4      | 4                      |
++--------------+--------+------------------------+
+|              | +-----+-----------------------+ |
+|              | | Bit | Definition            | |
+|              | +=====+=======================+ |
+|              | | 0   | get dirty page bitmap | |
+|              | +-----+-----------------------+ |
++--------------+--------+------------------------+
+| address      | 8      | 8                      |
++--------------+--------+------------------------+
+| size         | 16     | 8                      |
++--------------+--------+------------------------+
+
+* *argsz* is the maximum size of the reply payload.
+* *flags* contains the following DMA region attributes:
+
+  * *get dirty page bitmap* indicates that a dirty page bitmap must be
+    populated before unmapping the DMA region. The client must provide a
+    `VFIO Bitmap`_ structure, explained below, immediately following this
+    entry.
+
+* *address* is the base DMA address of the DMA region.
+* *size* is the size of the DMA region.
+
+The address and size of the DMA region being unmapped must match exactly a
+previous mapping. The size of request message depends on whether or not the
+*get dirty page bitmap* bit is set in Flags:
+
+* If not set, the size of the total request message is: 16 + 24.
+
+* If set, the size of the total request message is: 16 + 24 + 16.

The address/size paradigm falls into the same issues as the vfio kernel
interface where we can't map or unmap the entire 64-bit address space,
ie. size is limited to 2^64 - 1.  The kernel interface also requires
PAGE_SIZE granularity for the DMA, which means the practical limit is
2^64 - PAGE_SIZE.  If we had a redo on the kernel interface we'd use
start/end so we can express a size of (end - start + 1).

Is following the vfio kernel interface sufficiently worthwhile for
compatibility to incur this same limitation?  I don't recall if we've
already discussed this, but perhaps worth a note in this design doc if
similarity to the kernel interface is being favored here.  See for
example QEMU commit 1b296c3def4b ("vfio: Don't issue full 2^64 unmap”).
I cannot think of any reason that we need to have this limitation in vfio-user.
Opened https://github.com/nutanix/libvfio-user/issues/593 to address this issue.

Thanks,
-Swapnil



[-- Attachment #2: Type: text/html, Size: 6920 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 04/16] vfio-user: connect vfio proxy to remote server
  2021-08-18 18:47   ` Alex Williamson
@ 2021-08-19 14:10     ` John Johnson
  0 siblings, 0 replies; 108+ messages in thread
From: John Johnson @ 2021-08-19 14:10 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Elena Ufimtseva, Jag Raman, Swapnil Ingle, John Levon,
	QEMU Devel Mailing List, Stefan Hajnoczi, thanos.makatos



> On Aug 18, 2021, at 2:47 PM, Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> On Mon, 16 Aug 2021 09:42:37 -0700
> Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote:
> 
>> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
>> index da9af297a0..739b30be73 100644
>> --- a/hw/vfio/meson.build
>> +++ b/hw/vfio/meson.build
>> @@ -8,6 +8,7 @@ vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
>>   'display.c',
>>   'pci-quirks.c',
>>   'pci.c',
>> +  'user.c',
>> ))
>> vfio_ss.add(when: 'CONFIG_VFIO_CCW', if_true: files('ccw.c'))
>> vfio_ss.add(when: 'CONFIG_VFIO_PLATFORM', if_true: files('platform.c'))
> 
> Wouldn't it make sense to be able to configure QEMU with any
> combination of vfio-pci and/or vfio-user-pci support rather than
> statically tying vfio-user-pci to vfio-pci?  Not to mention that doing
> so would help to more formally define the interface operations between
> kernel and user options, for example fewer tests of vbasedev->proxy and
> perhaps more abstraction through ops structures.  Thanks,
> 

	We can certainly add another config option for vfio-user.

	As far as an ops vector vs vbasedev->proxy tests, it’s a
matter personal preference.  I prefer the changes inline when they
are this small, but we can make a vector if that’s what you want.

							JJ


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 01/16] vfio-user: introduce vfio-user protocol specification
  2021-08-17 23:04   ` Alex Williamson
  2021-08-19  9:28     ` Swapnil Ingle
@ 2021-08-19 15:32     ` John Johnson
  2021-08-19 16:26       ` Alex Williamson
  1 sibling, 1 reply; 108+ messages in thread
From: John Johnson @ 2021-08-19 15:32 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Elena Ufimtseva, Jag Raman, Swapnil Ingle, John Levon,
	QEMU Devel Mailing List, Stefan Hajnoczi, thanos.makatos



> On Aug 17, 2021, at 7:04 PM, Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> 
> The address/size paradigm falls into the same issues as the vfio kernel
> interface where we can't map or unmap the entire 64-bit address space,
> ie. size is limited to 2^64 - 1.  The kernel interface also requires
> PAGE_SIZE granularity for the DMA, which means the practical limit is
> 2^64 - PAGE_SIZE.  If we had a redo on the kernel interface we'd use
> start/end so we can express a size of (end - start + 1).
> 
> Is following the vfio kernel interface sufficiently worthwhile for
> compatibility to incur this same limitation?  I don't recall if we've
> already discussed this, but perhaps worth a note in this design doc if
> similarity to the kernel interface is being favored here.  See for
> example QEMU commit 1b296c3def4b ("vfio: Don't issue full 2^64 unmap").
> Thanks,
> 


	I’d prefer to stay as close to the kernel i/f as we can.
An earlier version of the spec used a vhost-user derived structure
for MAP & UNMAP.  This made it more difficult to add the bitmap
field when vfio added migration capability, so we switched to the
ioctl() structure.

	It looks like vfio_dma_unmap() takes a 64b ‘size’ arg
(ram_addr_t)  How did you unmap an entire 64b address space?  The
comment there mentions a bug where iova+size wraps the end of the
64b space.

							JJ

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 01/16] vfio-user: introduce vfio-user protocol specification
  2021-08-19 15:32     ` John Johnson
@ 2021-08-19 16:26       ` Alex Williamson
  0 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2021-08-19 16:26 UTC (permalink / raw)
  To: John Johnson
  Cc: Elena Ufimtseva, Jag Raman, Swapnil Ingle, John Levon,
	QEMU Devel Mailing List, Stefan Hajnoczi, thanos.makatos

On Thu, 19 Aug 2021 15:32:16 +0000
John Johnson <john.g.johnson@oracle.com> wrote:

> > On Aug 17, 2021, at 7:04 PM, Alex Williamson <alex.williamson@redhat.com> wrote:
> > 
> > 
> > The address/size paradigm falls into the same issues as the vfio kernel
> > interface where we can't map or unmap the entire 64-bit address space,
> > ie. size is limited to 2^64 - 1.  The kernel interface also requires
> > PAGE_SIZE granularity for the DMA, which means the practical limit is
> > 2^64 - PAGE_SIZE.  If we had a redo on the kernel interface we'd use
> > start/end so we can express a size of (end - start + 1).
> > 
> > Is following the vfio kernel interface sufficiently worthwhile for
> > compatibility to incur this same limitation?  I don't recall if we've
> > already discussed this, but perhaps worth a note in this design doc if
> > similarity to the kernel interface is being favored here.  See for
> > example QEMU commit 1b296c3def4b ("vfio: Don't issue full 2^64 unmap").
> > Thanks,
> >   
> 
> 
> 	I’d prefer to stay as close to the kernel i/f as we can.
> An earlier version of the spec used a vhost-user derived structure
> for MAP & UNMAP.  This made it more difficult to add the bitmap
> field when vfio added migration capability, so we switched to the
> ioctl() structure.
> 
> 	It looks like vfio_dma_unmap() takes a 64b ‘size’ arg
> (ram_addr_t)  How did you unmap an entire 64b address space?

It's called from the MemoryListener which operates on
MemoryRegionSections, which uses Int128 that get's chunked to
ram_addr_t for vfio_dma_unmap().  We do now have
VFIO_DMA_UNMAP_FLAG_ALL in the kernel API which gives us an option to
clear the whole 64bit address space in one ioctl, but it's not a high
priority to make use of in QEMU since it still needs to handle older
kernels.

> The comment there mentions a bug where iova+size wraps the end of the
> 64b space.

Right, that's a separate issue that's just a bug in the kernel.  That's
been fixed, but the QEMU code exists for now as a workaround for any
broken kernels in the wild.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 03/16] vfio-user: Define type vfio_user_pci_dev_info
  2021-08-16 16:42 ` [PATCH RFC v2 03/16] vfio-user: Define type vfio_user_pci_dev_info Elena Ufimtseva
@ 2021-08-24 13:52   ` Stefan Hajnoczi
  0 siblings, 0 replies; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-08-24 13:52 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: john.g.johnson, jag.raman, swapnil.ingle, john.levon, qemu-devel,
	alex.williamson, thanos.makatos

[-- Attachment #1: Type: text/plain, Size: 435 bytes --]

On Mon, Aug 16, 2021 at 09:42:36AM -0700, Elena Ufimtseva wrote:
> +static Property vfio_user_pci_dev_properties[] = {
> +    DEFINE_PROP_STRING("socket", VFIOUserPCIDevice, sock_name),
> +    DEFINE_PROP_BOOL("secure-dma", VFIOUserPCIDevice, secure_dma, false),
> +    DEFINE_PROP_END_OF_LIST(),
> +};

Are we missing out on properties that could be common for all VFIO PCI
devices like x-pci-vendor-id, x-pci-device-id, etc?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 04/16] vfio-user: connect vfio proxy to remote server
  2021-08-16 16:42 ` [PATCH RFC v2 04/16] vfio-user: connect vfio proxy to remote server Elena Ufimtseva
  2021-08-18 18:47   ` Alex Williamson
@ 2021-08-24 14:15   ` Stefan Hajnoczi
  2021-08-30  3:00     ` John Johnson
  1 sibling, 1 reply; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-08-24 14:15 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: john.g.johnson, jag.raman, swapnil.ingle, john.levon, qemu-devel,
	alex.williamson, thanos.makatos

[-- Attachment #1: Type: text/plain, Size: 4065 bytes --]

On Mon, Aug 16, 2021 at 09:42:37AM -0700, Elena Ufimtseva wrote:
> @@ -3361,13 +3362,35 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
>      VFIOUserPCIDevice *udev = VFIO_USER_PCI(pdev);
>      VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
>      VFIODevice *vbasedev = &vdev->vbasedev;
> +    SocketAddress addr;
> +    VFIOProxy *proxy;
> +    Error *err = NULL;
>  
> +    /*
> +     * TODO: make option parser understand SocketAddress
> +     * and use that instead of having scaler options

s/scaler/scalar/

> +VFIOProxy *vfio_user_connect_dev(SocketAddress *addr, Error **errp)
> +{
> +    VFIOProxy *proxy;
> +    QIOChannelSocket *sioc;
> +    QIOChannel *ioc;
> +    char *sockname;
> +
> +    if (addr->type != SOCKET_ADDRESS_TYPE_UNIX) {
> +        error_setg(errp, "vfio_user_connect - bad address family");
> +        return NULL;
> +    }
> +    sockname = addr->u.q_unix.path;
> +
> +    sioc = qio_channel_socket_new();
> +    ioc = QIO_CHANNEL(sioc);
> +    if (qio_channel_socket_connect_sync(sioc, addr, errp)) {
> +        object_unref(OBJECT(ioc));
> +        return NULL;
> +    }
> +    qio_channel_set_blocking(ioc, true, NULL);
> +
> +    proxy = g_malloc0(sizeof(VFIOProxy));
> +    proxy->sockname = sockname;

sockname is addr->u.q_unix.path, so there's an assumption that the
lifetime of the addr argument is at least as long as the proxy object's
lifetime. This doesn't seem to be the case in vfio_user_pci_realize()
since the SocketAddress variable is declared on the stack.

I suggest making SocketAddress *addr const so it's obvious that this
function just reads it (doesn't take ownership of the pointer) and
copying the UNIX domain socket path with g_strdup() to avoid the
dangling pointer.

> +    proxy->ioc = ioc;
> +    proxy->flags = VFIO_PROXY_CLIENT;
> +    proxy->state = VFIO_PROXY_CONNECTED;
> +    qemu_cond_init(&proxy->close_cv);
> +
> +    if (vfio_user_iothread == NULL) {
> +        vfio_user_iothread = iothread_create("VFIO user", errp);
> +    }

Why is a dedicated IOThread needed for VFIO user?

> +
> +    qemu_mutex_init(&proxy->lock);
> +    QTAILQ_INIT(&proxy->free);
> +    QTAILQ_INIT(&proxy->pending);
> +    QLIST_INSERT_HEAD(&vfio_user_sockets, proxy, next);
> +
> +    return proxy;
> +}
> +

/* Called with the BQL */
> +void vfio_user_disconnect(VFIOProxy *proxy)
> +{
> +    VFIOUserReply *r1, *r2;
> +
> +    qemu_mutex_lock(&proxy->lock);
> +
> +    /* our side is quitting */
> +    if (proxy->state == VFIO_PROXY_CONNECTED) {
> +        vfio_user_shutdown(proxy);
> +        if (!QTAILQ_EMPTY(&proxy->pending)) {
> +            error_printf("vfio_user_disconnect: outstanding requests\n");
> +        }
> +    }
> +    object_unref(OBJECT(proxy->ioc));
> +    proxy->ioc = NULL;
> +
> +    proxy->state = VFIO_PROXY_CLOSING;
> +    QTAILQ_FOREACH_SAFE(r1, &proxy->pending, next, r2) {
> +        qemu_cond_destroy(&r1->cv);
> +        QTAILQ_REMOVE(&proxy->pending, r1, next);
> +        g_free(r1);
> +    }
> +    QTAILQ_FOREACH_SAFE(r1, &proxy->free, next, r2) {
> +        qemu_cond_destroy(&r1->cv);
> +        QTAILQ_REMOVE(&proxy->free, r1, next);
> +        g_free(r1);
> +    }
> +
> +    /*
> +     * Make sure the iothread isn't blocking anywhere
> +     * with a ref to this proxy by waiting for a BH
> +     * handler to run after the proxy fd handlers were
> +     * deleted above.
> +     */
> +    proxy->close_wait = 1;

Please use true. '1' is shorter but it's less obvious to the reader (I
had to jump to the definition to check whether this field was bool or
int).

> +    aio_bh_schedule_oneshot(iothread_get_aio_context(vfio_user_iothread),
> +                            vfio_user_cb, proxy);
> +
> +    /* drop locks so the iothread can make progress */
> +    qemu_mutex_unlock_iothread();

Why does the BQL needs to be dropped so vfio_user_iothread can make
progress?

> +    qemu_cond_wait(&proxy->close_cv, &proxy->lock);

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 05/16] vfio-user: define VFIO Proxy and communication functions
  2021-08-16 16:42 ` [PATCH RFC v2 05/16] vfio-user: define VFIO Proxy and communication functions Elena Ufimtseva
@ 2021-08-24 15:14   ` Stefan Hajnoczi
  2021-08-30  3:04     ` John Johnson
  0 siblings, 1 reply; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-08-24 15:14 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: john.g.johnson, jag.raman, swapnil.ingle, john.levon, qemu-devel,
	alex.williamson, thanos.makatos

[-- Attachment #1: Type: text/plain, Size: 4790 bytes --]

On Mon, Aug 16, 2021 at 09:42:38AM -0700, Elena Ufimtseva wrote:
> @@ -62,5 +65,10 @@ typedef struct VFIOProxy {
>  
>  VFIOProxy *vfio_user_connect_dev(SocketAddress *addr, Error **errp);
>  void vfio_user_disconnect(VFIOProxy *proxy);
> +void vfio_user_set_reqhandler(VFIODevice *vbasdev,

"vbasedev" for consistency?

> +                              int (*handler)(void *opaque, char *buf,
> +                                             VFIOUserFDs *fds),
> +                                             void *reqarg);

The handler callback is undocumented. What context does it run in, what
do the arguments mean, and what should the function return? Please
document it so it's easy for others to modify this code in the future
without reverse-engineering the assumptions behind it.

> +void vfio_user_recv(void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOProxy *proxy = vbasedev->proxy;
> +    VFIOUserReply *reply = NULL;
> +    g_autofree int *fdp = NULL;
> +    VFIOUserFDs reqfds = { 0, 0, fdp };
> +    VFIOUserHdr msg;
> +    struct iovec iov = {
> +        .iov_base = &msg,
> +        .iov_len = sizeof(msg),
> +    };
> +    bool isreply;
> +    int i, ret;
> +    size_t msgleft, numfds = 0;
> +    char *data = NULL;
> +    g_autofree char *buf = NULL;
> +    Error *local_err = NULL;
> +
> +    qemu_mutex_lock(&proxy->lock);
> +    if (proxy->state == VFIO_PROXY_CLOSING) {
> +        qemu_mutex_unlock(&proxy->lock);
> +        return;
> +    }
> +
> +    ret = qio_channel_readv_full(proxy->ioc, &iov, 1, &fdp, &numfds,
> +                                 &local_err);

This is a blocking call. My understanding is that the IOThread is shared
by all vfio-user devices, so other devices will have to wait if one of
them is acting up (e.g. the device emulation process sent less than
sizeof(msg) bytes).

While we're blocked in this function the proxy device cannot be
hot-removed since proxy->lock is held.

It would more robust to use of the event loop to avoid blocking. There
could be a per-connection receiver coroutine that calls
qio_channel_readv_full_all_eof() (it yields the coroutine if reading
would block).

> +    /*
> +     * Replies signal a waiter, requests get processed by vfio code
> +     * that may assume the iothread lock is held.
> +     */
> +    if (isreply) {
> +        reply->complete = 1;
> +        if (!reply->nowait) {
> +            qemu_cond_signal(&reply->cv);
> +        } else {
> +            if (msg.flags & VFIO_USER_ERROR) {
> +                error_printf("vfio_user_rcv error reply on async request ");
> +                error_printf("command %x error %s\n", msg.command,
> +                             strerror(msg.error_reply));
> +            }
> +            /* just free it if no one is waiting */
> +            reply->nowait = 0;
> +            if (proxy->last_nowait == reply) {
> +                proxy->last_nowait = NULL;
> +            }
> +            g_free(reply->msg);
> +            QTAILQ_INSERT_HEAD(&proxy->free, reply, next);
> +        }
> +        qemu_mutex_unlock(&proxy->lock);
> +    } else {
> +        qemu_mutex_unlock(&proxy->lock);
> +        qemu_mutex_lock_iothread();

The fact that proxy->request() runs with the BQL suggests that VFIO
communication should take place in the main event loop thread instead of
a separate IOThread.

> +        /*
> +         * make sure proxy wasn't closed while we waited
> +         * checking state without holding the proxy lock is safe
> +         * since it's only set to CLOSING when BQL is held
> +         */
> +        if (proxy->state != VFIO_PROXY_CLOSING) {
> +            ret = proxy->request(proxy->reqarg, buf, &reqfds);

The request() callback in an earlier patch is a noop for the client
implementation. Who frees passed fds?

> +            if (ret < 0 && !(msg.flags & VFIO_USER_NO_REPLY)) {
> +                vfio_user_send_reply(proxy, buf, ret);
> +            }
> +        }
> +        qemu_mutex_unlock_iothread();
> +    }
> +    return;
> +
> +fatal:
> +    vfio_user_shutdown(proxy);
> +    proxy->state = VFIO_PROXY_RECV_ERROR;
> +
> +err:
> +    for (i = 0; i < numfds; i++) {
> +        close(fdp[i]);
> +    }
> +    if (reply != NULL) {
> +        /* force an error to keep sending thread from hanging */
> +        reply->msg->flags |= VFIO_USER_ERROR;
> +        reply->msg->error_reply = EINVAL;
> +        reply->complete = 1;
> +        qemu_cond_signal(&reply->cv);

What about fd passing? The actual fds have been closed already in fdp[]
but reply has a copy too.

What about the nowait case? If no one is waiting on reply->cv so this
reply will be leaked?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 06/16] vfio-user: negotiate version with remote server
  2021-08-16 16:42 ` [PATCH RFC v2 06/16] vfio-user: negotiate version with remote server Elena Ufimtseva
@ 2021-08-24 15:59   ` Stefan Hajnoczi
  2021-08-30  3:08     ` John Johnson
  0 siblings, 1 reply; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-08-24 15:59 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: john.g.johnson, jag.raman, swapnil.ingle, john.levon, qemu-devel,
	alex.williamson, thanos.makatos

[-- Attachment #1: Type: text/plain, Size: 4808 bytes --]

On Mon, Aug 16, 2021 at 09:42:39AM -0700, Elena Ufimtseva wrote:
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 7005d9f891..eae33e746f 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3397,6 +3397,12 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
>          proxy->flags |= VFIO_PROXY_SECURE;
>      }
>  
> +    vfio_user_validate_version(vbasedev, &err);
> +    if (err != NULL) {
> +        error_propagate(errp, err);
> +        goto error;
> +    }
> +
>      vbasedev->name = g_strdup_printf("VFIO user <%s>", udev->sock_name);
>      vbasedev->dev = DEVICE(vdev);
>      vbasedev->fd = -1;
> @@ -3404,6 +3410,9 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
>      vbasedev->no_mmap = false;
>      vbasedev->ops = &vfio_user_pci_ops;
>  
> +error:

Missing return before error label? We shouldn't disconnect in the
success case.

> +    vfio_user_disconnect(proxy);
> +    error_prepend(errp, VFIO_MSG_PREFIX, vdev->vbasedev.name);
>  }
>  
>  static void vfio_user_instance_finalize(Object *obj)
> diff --git a/hw/vfio/user.c b/hw/vfio/user.c
> index 2fcc77d997..e89464a571 100644
> --- a/hw/vfio/user.c
> +++ b/hw/vfio/user.c
> @@ -23,9 +23,16 @@
>  #include "io/channel-socket.h"
>  #include "io/channel-util.h"
>  #include "sysemu/iothread.h"
> +#include "qapi/qmp/qdict.h"
> +#include "qapi/qmp/qjson.h"
> +#include "qapi/qmp/qnull.h"
> +#include "qapi/qmp/qstring.h"
> +#include "qapi/qmp/qnum.h"
>  #include "user.h"
>  
>  static uint64_t max_xfer_size = VFIO_USER_DEF_MAX_XFER;
> +static uint64_t max_send_fds = VFIO_USER_DEF_MAX_FDS;
> +static int wait_time = 1000;   /* wait 1 sec for replies */
>  static IOThread *vfio_user_iothread;
>  
>  static void vfio_user_shutdown(VFIOProxy *proxy);
> @@ -34,7 +41,14 @@ static void vfio_user_send_locked(VFIOProxy *proxy, VFIOUserHdr *msg,
>                                    VFIOUserFDs *fds);
>  static void vfio_user_send(VFIOProxy *proxy, VFIOUserHdr *msg,
>                             VFIOUserFDs *fds);
> +static void vfio_user_request_msg(VFIOUserHdr *hdr, uint16_t cmd,
> +                                  uint32_t size, uint32_t flags);
> +static void vfio_user_send_recv(VFIOProxy *proxy, VFIOUserHdr *msg,
> +                                VFIOUserFDs *fds, int rsize, int flags);
>  
> +/* vfio_user_send_recv flags */
> +#define NOWAIT          0x1  /* do not wait for reply */
> +#define NOIOLOCK        0x2  /* do not drop iolock */

Please use "BQL", it's a widely used term while "iolock" isn't used:
s/IOLOCK/BQL/

>  
>  /*
>   * Functions called by main, CPU, or iothread threads
> @@ -333,6 +347,79 @@ static void vfio_user_cb(void *opaque)
>   * Functions called by main or CPU threads
>   */
>  
> +static void vfio_user_send_recv(VFIOProxy *proxy, VFIOUserHdr *msg,
> +                                VFIOUserFDs *fds, int rsize, int flags)
> +{
> +    VFIOUserReply *reply;
> +    bool iolock = 0;
> +
> +    if (msg->flags & VFIO_USER_NO_REPLY) {
> +        error_printf("vfio_user_send_recv on async message\n");
> +        return;
> +    }
> +
> +    /*
> +     * We may block later, so use a per-proxy lock and let
> +     * the iothreads run while we sleep unless told no to

s/no/not/

> +int vfio_user_validate_version(VFIODevice *vbasedev, Error **errp)
> +{
> +    g_autofree VFIOUserVersion *msgp;
> +    GString *caps;
> +    int size, caplen;
> +
> +    caps = caps_json();
> +    caplen = caps->len + 1;
> +    size = sizeof(*msgp) + caplen;
> +    msgp = g_malloc0(size);
> +
> +    vfio_user_request_msg(&msgp->hdr, VFIO_USER_VERSION, size, 0);
> +    msgp->major = VFIO_USER_MAJOR_VER;
> +    msgp->minor = VFIO_USER_MINOR_VER;
> +    memcpy(&msgp->capabilities, caps->str, caplen);
> +    g_string_free(caps, true);
> +
> +    vfio_user_send_recv(vbasedev->proxy, &msgp->hdr, NULL, 0, 0);
> +    if (msgp->hdr.flags & VFIO_USER_ERROR) {
> +        error_setg_errno(errp, msgp->hdr.error_reply, "version reply");
> +        return -1;
> +    }
> +
> +    if (msgp->major != VFIO_USER_MAJOR_VER ||
> +        msgp->minor > VFIO_USER_MINOR_VER) {
> +        error_setg(errp, "incompatible server version");
> +        return -1;
> +    }
> +    if (caps_check(msgp->minor, (char *)msgp + sizeof(*msgp), errp) != 0) {

The reply is untrusted so we cannot treat it as a NUL-terminated string
yet. The final byte msgp->capabilities[] needs to be checked first.

Please be careful about input validation, I might miss something so it's
best if you audit the patches too. QEMU must not trust the device
emulation process and vice versa.

> +        return -1;
> +    }
> +
> +    return 0;
> +}
> -- 
> 2.25.1
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 07/16] vfio-user: get device info
  2021-08-16 16:42 ` [PATCH RFC v2 07/16] vfio-user: get device info Elena Ufimtseva
@ 2021-08-24 16:04   ` Stefan Hajnoczi
  2021-08-30  3:11     ` John Johnson
  0 siblings, 1 reply; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-08-24 16:04 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: john.g.johnson, jag.raman, swapnil.ingle, john.levon, qemu-devel,
	alex.williamson, thanos.makatos

[-- Attachment #1: Type: text/plain, Size: 881 bytes --]

On Mon, Aug 16, 2021 at 09:42:40AM -0700, Elena Ufimtseva wrote:
> +int vfio_user_get_info(VFIODevice *vbasedev)
> +{
> +    VFIOUserDeviceInfo msg;
> +
> +    memset(&msg, 0, sizeof(msg));
> +    vfio_user_request_msg(&msg.hdr, VFIO_USER_DEVICE_GET_INFO, sizeof(msg), 0);
> +    msg.argsz = sizeof(struct vfio_device_info);
> +
> +    vfio_user_send_recv(vbasedev->proxy, &msg.hdr, NULL, 0, 0);
> +    if (msg.hdr.flags & VFIO_USER_ERROR) {
> +        return -msg.hdr.error_reply;
> +    }
> +
> +    vbasedev->num_irqs = msg.num_irqs;
> +    vbasedev->num_regions = msg.num_regions;
> +    vbasedev->flags = msg.flags;
> +    vbasedev->reset_works = !!(msg.flags & VFIO_DEVICE_FLAGS_RESET);

No input validation. I haven't checked what happens when num_irqs,
num_regions, or flags are bogus but it's a little concerning. Unlike
kernel VFIO, we do not trust these values.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [PATCH RFC server v2 00/11] vfio-user server in QEMU
  2021-08-16 16:42 [PATCH RFC v2 00/16] vfio-user implementation Elena Ufimtseva
                   ` (15 preceding siblings ...)
  2021-08-16 16:42 ` [PATCH RFC v2 16/16] vfio-user: migration support Elena Ufimtseva
@ 2021-08-27 17:53 ` Jagannathan Raman
  2021-08-27 17:53   ` [PATCH RFC server v2 01/11] vfio-user: build library Jagannathan Raman
                     ` (12 more replies)
  16 siblings, 13 replies; 108+ messages in thread
From: Jagannathan Raman @ 2021-08-27 17:53 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, thuth, jag.raman, swapnil.ingle,
	john.levon, philmd, alex.williamson, marcandre.lureau, stefanha,
	thanos.makatos, alex.bennee

Hi,

This series depends on the following series from
Elena Ufimtseva <elena.ufimtseva@oracle.com>:
[PATCH RFC v2 00/16] vfio-user implementation

Thank you for your feedback for the v1 patches!
https://www.mail-archive.com/qemu-devel@nongnu.org/msg825021.html

We have incorporated the following feedback from v1 of the
review cycle:

[PATCH RFC server v2 01/11] vfio-user: build library
  - Using cmake subproject to build libvfio-user

[PATCH RFC server v2 02/11] vfio-user: define vfio-user object
  - Added check to confirm that TYPE_REMOTE_MACHINE is used
    with TYPE_VFU_OBJECT

[PATCH RFC server v2 04/11] vfio-user: find and init PCI device
  - Removed call to vfu_pci_set_id()
  - Added check to confirm that TYPE_PCI_DEVICE is used with
    TYPE_VFU_OBJECT

[PATCH RFC server v2 05/11] vfio-user: run vfio-user context
  - Using QEMU main-loop to drive the vfu_ctx (using
    vfu_get_poll_fd() & qemu_set_fd_handler())
  - Set vfu_ctx to non-blocking mode (LIBVFIO_USER_FLAG_ATTACH_NB)
  - Modified how QEMU attaches to the vfu_ctx

[PATCH RFC server v2 06/11] handle PCI config space accesses
  - Broke-up PCI config space access to 4-byte accesses

[PATCH RFC server v2 07/11] vfio-user: handle DMA mappings
  - Received feedback to assert that vfu_dma_info_t->vaddr is not
    NULL - unable to do it as it appears to be a valid case.

[PATCH RFC server v2 10/11] register handlers to facilitate migration
  - Migrate only one device's data per contect

Would appreciate if you could kindly review this v2 series. Looking
forward to your comments.

Thank you!

Jagannathan Raman (11):
  vfio-user: build library
  vfio-user: define vfio-user object
  vfio-user: instantiate vfio-user context
  vfio-user: find and init PCI device
  vfio-user: run vfio-user context
  vfio-user: handle PCI config space accesses
  vfio-user: handle DMA mappings
  vfio-user: handle PCI BAR accesses
  vfio-user: handle device interrupts
  vfio-user: register handlers to facilitate migration
  vfio-user: acceptance test

 configure                     |  11 +
 meson.build                   |  28 ++
 qapi/qom.json                 |  20 +-
 include/hw/remote/iohub.h     |   2 +
 migration/savevm.h            |   2 +
 hw/remote/iohub.c             |   5 +
 hw/remote/vfio-user-obj.c     | 803 ++++++++++++++++++++++++++++++++++++++++++
 migration/savevm.c            |  73 ++++
 .gitmodules                   |   3 +
 MAINTAINERS                   |   9 +
 hw/remote/meson.build         |   3 +
 hw/remote/trace-events        |  10 +
 subprojects/libvfio-user      |   1 +
 tests/acceptance/vfio-user.py |  94 +++++
 14 files changed, 1062 insertions(+), 2 deletions(-)
 create mode 100644 hw/remote/vfio-user-obj.c
 create mode 160000 subprojects/libvfio-user
 create mode 100644 tests/acceptance/vfio-user.py

-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 108+ messages in thread

* [PATCH RFC server v2 01/11] vfio-user: build library
  2021-08-27 17:53 ` [PATCH RFC server v2 00/11] vfio-user server in QEMU Jagannathan Raman
@ 2021-08-27 17:53   ` Jagannathan Raman
  2021-08-27 18:05     ` Jag Raman
                       ` (2 more replies)
  2021-08-27 17:53   ` [PATCH RFC server v2 02/11] vfio-user: define vfio-user object Jagannathan Raman
                     ` (11 subsequent siblings)
  12 siblings, 3 replies; 108+ messages in thread
From: Jagannathan Raman @ 2021-08-27 17:53 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, thuth, jag.raman, swapnil.ingle,
	john.levon, philmd, alex.williamson, marcandre.lureau, stefanha,
	thanos.makatos, alex.bennee

add the libvfio-user library as a submodule. build it as a cmake
subproject.

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 configure                | 11 +++++++++++
 meson.build              | 28 ++++++++++++++++++++++++++++
 .gitmodules              |  3 +++
 MAINTAINERS              |  7 +++++++
 hw/remote/meson.build    |  2 ++
 subprojects/libvfio-user |  1 +
 6 files changed, 52 insertions(+)
 create mode 160000 subprojects/libvfio-user

diff --git a/configure b/configure
index 9a79a00..794e900 100755
--- a/configure
+++ b/configure
@@ -4291,6 +4291,17 @@ but not implemented on your system"
 fi
 
 ##########################################
+# check for multiprocess
+
+case "$multiprocess" in
+  auto | enabled )
+    if test "$git_submodules_action" != "ignore"; then
+      git_submodules="${git_submodules} libvfio-user"
+    fi
+    ;;
+esac
+
+##########################################
 # End of CC checks
 # After here, no more $cc or $ld runs
 
diff --git a/meson.build b/meson.build
index bf63784..2b2d5c2 100644
--- a/meson.build
+++ b/meson.build
@@ -1898,6 +1898,34 @@ if get_option('cfi') and slirp_opt == 'system'
          + ' Please configure with --enable-slirp=git')
 endif
 
+vfiouser = not_found
+if have_system and multiprocess_allowed
+  have_internal = fs.exists(meson.current_source_dir() / 'subprojects/libvfio-user/Makefile')
+
+  if not have_internal
+    error('libvfio-user source not found - please pull git submodule')
+  endif
+
+  json_c = dependency('json-c', required: false)
+    if not json_c.found()
+      json_c = dependency('libjson-c')
+  endif
+
+  cmake = import('cmake')
+
+  vfiouser_subproj = cmake.subproject('libvfio-user')
+
+  vfiouser_sl = vfiouser_subproj.dependency('vfio-user-static')
+
+  # Although cmake links the json-c library with vfio-user-static
+  # target, that info is not available to meson via cmake.subproject.
+  # As such, we have to separately declare the json-c dependency here.
+  # This appears to be a current limitation of using cmake inside meson.
+  # libvfio-user is planning a switch to meson in the future, which
+  # would address this item automatically.
+  vfiouser = declare_dependency(dependencies: [vfiouser_sl, json_c])
+endif
+
 fdt = not_found
 fdt_opt = get_option('fdt')
 if have_system
diff --git a/.gitmodules b/.gitmodules
index 08b1b48..cfeea7c 100644
--- a/.gitmodules
+++ b/.gitmodules
@@ -64,3 +64,6 @@
 [submodule "roms/vbootrom"]
 	path = roms/vbootrom
 	url = https://gitlab.com/qemu-project/vbootrom.git
+[submodule "subprojects/libvfio-user"]
+	path = subprojects/libvfio-user
+	url = https://github.com/nutanix/libvfio-user.git
diff --git a/MAINTAINERS b/MAINTAINERS
index 4039d3c..0c5a18e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3361,6 +3361,13 @@ F: semihosting/
 F: include/semihosting/
 F: tests/tcg/multiarch/arm-compat-semi/
 
+libvfio-user Library
+M: Thanos Makatos <thanos.makatos@nutanix.com>
+M: John Levon <john.levon@nutanix.com>
+T: https://github.com/nutanix/libvfio-user.git
+S: Maintained
+F: subprojects/libvfio-user/*
+
 Multi-process QEMU
 M: Elena Ufimtseva <elena.ufimtseva@oracle.com>
 M: Jagannathan Raman <jag.raman@oracle.com>
diff --git a/hw/remote/meson.build b/hw/remote/meson.build
index e6a5574..fb35fb8 100644
--- a/hw/remote/meson.build
+++ b/hw/remote/meson.build
@@ -7,6 +7,8 @@ remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('remote-obj.c'))
 remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('proxy.c'))
 remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('iohub.c'))
 
+remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: vfiouser)
+
 specific_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('memory.c'))
 specific_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('proxy-memory-listener.c'))
 
diff --git a/subprojects/libvfio-user b/subprojects/libvfio-user
new file mode 160000
index 0000000..647c934
--- /dev/null
+++ b/subprojects/libvfio-user
@@ -0,0 +1 @@
+Subproject commit 647c9341d2e06266a710ddd075f69c95dd3b8446
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH RFC server v2 02/11] vfio-user: define vfio-user object
  2021-08-27 17:53 ` [PATCH RFC server v2 00/11] vfio-user server in QEMU Jagannathan Raman
  2021-08-27 17:53   ` [PATCH RFC server v2 01/11] vfio-user: build library Jagannathan Raman
@ 2021-08-27 17:53   ` Jagannathan Raman
  2021-09-08 12:37     ` Stefan Hajnoczi
  2021-08-27 17:53   ` [PATCH RFC server v2 03/11] vfio-user: instantiate vfio-user context Jagannathan Raman
                     ` (10 subsequent siblings)
  12 siblings, 1 reply; 108+ messages in thread
From: Jagannathan Raman @ 2021-08-27 17:53 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, thuth, jag.raman, swapnil.ingle,
	john.levon, philmd, alex.williamson, marcandre.lureau, stefanha,
	thanos.makatos, alex.bennee

Define vfio-user object which is remote process server for QEMU. Setup
object initialization functions and properties necessary to instantiate
the object

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 qapi/qom.json             |  20 ++++++-
 hw/remote/vfio-user-obj.c | 145 ++++++++++++++++++++++++++++++++++++++++++++++
 MAINTAINERS               |   1 +
 hw/remote/meson.build     |   1 +
 hw/remote/trace-events    |   3 +
 5 files changed, 168 insertions(+), 2 deletions(-)
 create mode 100644 hw/remote/vfio-user-obj.c

diff --git a/qapi/qom.json b/qapi/qom.json
index a25616b..3e941ee 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -689,6 +689,20 @@
   'data': { 'fd': 'str', 'devid': 'str' } }
 
 ##
+# @VfioUserProperties:
+#
+# Properties for vfio-user objects.
+#
+# @socket: path to be used as socket by the libvfiouser library
+#
+# @devid: the id of the device to be associated with the file descriptor
+#
+# Since: 6.0
+##
+{ 'struct': 'VfioUserProperties',
+  'data': { 'socket': 'str', 'devid': 'str' } }
+
+##
 # @RngProperties:
 #
 # Properties for objects of classes derived from rng.
@@ -812,7 +826,8 @@
     'tls-creds-psk',
     'tls-creds-x509',
     'tls-cipher-suites',
-    'x-remote-object'
+    'x-remote-object',
+    'vfio-user'
   ] }
 
 ##
@@ -868,7 +883,8 @@
       'tls-creds-psk':              'TlsCredsPskProperties',
       'tls-creds-x509':             'TlsCredsX509Properties',
       'tls-cipher-suites':          'TlsCredsProperties',
-      'x-remote-object':            'RemoteObjectProperties'
+      'x-remote-object':            'RemoteObjectProperties',
+      'vfio-user':                  'VfioUserProperties'
   } }
 
 ##
diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
new file mode 100644
index 0000000..4a1e297
--- /dev/null
+++ b/hw/remote/vfio-user-obj.c
@@ -0,0 +1,145 @@
+/**
+ * QEMU vfio-user server object
+ *
+ * Copyright © 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL-v2, version 2 or later.
+ *
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+/**
+ * Usage: add options:
+ *     -machine x-remote
+ *     -device <PCI-device>,id=<pci-dev-id>
+ *     -object vfio-user,id=<id>,socket=<socket-path>,devid=<pci-dev-id>
+ *
+ * Note that vfio-user object must be used with x-remote machine only. This
+ * server could only support PCI devices for now.
+ *
+ * socket is path to a file. This file will be created by the server. It is
+ * a required option
+ *
+ * devid is the id of a PCI device on the server. It is also a required option.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+
+#include "qom/object.h"
+#include "qom/object_interfaces.h"
+#include "qemu/error-report.h"
+#include "trace.h"
+#include "sysemu/runstate.h"
+
+#define TYPE_VFU_OBJECT "vfio-user"
+OBJECT_DECLARE_TYPE(VfuObject, VfuObjectClass, VFU_OBJECT)
+
+struct VfuObjectClass {
+    ObjectClass parent_class;
+
+    unsigned int nr_devs;
+
+    /* Maximum number of devices the server could support */
+    unsigned int max_devs;
+};
+
+struct VfuObject {
+    /* private */
+    Object parent;
+
+    char *socket;
+    char *devid;
+};
+
+static void vfu_object_set_socket(Object *obj, const char *str, Error **errp)
+{
+    VfuObject *o = VFU_OBJECT(obj);
+
+    g_free(o->socket);
+
+    o->socket = g_strdup(str);
+
+    trace_vfu_prop("socket", str);
+}
+
+static void vfu_object_set_devid(Object *obj, const char *str, Error **errp)
+{
+    VfuObject *o = VFU_OBJECT(obj);
+
+    g_free(o->devid);
+
+    o->devid = g_strdup(str);
+
+    trace_vfu_prop("devid", str);
+}
+
+static void vfu_object_init(Object *obj)
+{
+    VfuObjectClass *k = VFU_OBJECT_GET_CLASS(obj);
+
+    if (!object_dynamic_cast(OBJECT(current_machine), TYPE_REMOTE_MACHINE)) {
+        error_report("vfu: %s only compatible with %s machine",
+                     TYPE_VFU_OBJECT, TYPE_REMOTE_MACHINE);
+        return;
+    }
+
+    if (k->nr_devs >= k->max_devs) {
+        error_report("Reached maximum number of vfio-user devices: %u",
+                     k->max_devs);
+        return;
+    }
+
+    k->nr_devs++;
+}
+
+static void vfu_object_finalize(Object *obj)
+{
+    VfuObjectClass *k = VFU_OBJECT_GET_CLASS(obj);
+    VfuObject *o = VFU_OBJECT(obj);
+
+    k->nr_devs--;
+
+    g_free(o->socket);
+    g_free(o->devid);
+
+    if (k->nr_devs == 0) {
+        qemu_system_shutdown_request(SHUTDOWN_CAUSE_GUEST_SHUTDOWN);
+    }
+}
+
+static void vfu_object_class_init(ObjectClass *klass, void *data)
+{
+    VfuObjectClass *k = VFU_OBJECT_CLASS(klass);
+
+    /* Limiting maximum number of devices to 1 until IOMMU support is added */
+    k->max_devs = 1;
+    k->nr_devs = 0;
+
+    object_class_property_add_str(klass, "socket", NULL,
+                                  vfu_object_set_socket);
+    object_class_property_add_str(klass, "devid", NULL,
+                                  vfu_object_set_devid);
+}
+
+static const TypeInfo vfu_object_info = {
+    .name = TYPE_VFU_OBJECT,
+    .parent = TYPE_OBJECT,
+    .instance_size = sizeof(VfuObject),
+    .instance_init = vfu_object_init,
+    .instance_finalize = vfu_object_finalize,
+    .class_size = sizeof(VfuObjectClass),
+    .class_init = vfu_object_class_init,
+    .interfaces = (InterfaceInfo[]) {
+        { TYPE_USER_CREATABLE },
+        { }
+    }
+};
+
+static void vfu_register_types(void)
+{
+    type_register_static(&vfu_object_info);
+}
+
+type_init(vfu_register_types);
diff --git a/MAINTAINERS b/MAINTAINERS
index 0c5a18e..f9d8092 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3391,6 +3391,7 @@ F: hw/remote/proxy-memory-listener.c
 F: include/hw/remote/proxy-memory-listener.h
 F: hw/remote/iohub.c
 F: include/hw/remote/iohub.h
+F: hw/remote/vfio-user-obj.c
 
 EBPF:
 M: Jason Wang <jasowang@redhat.com>
diff --git a/hw/remote/meson.build b/hw/remote/meson.build
index fb35fb8..cd44dfc 100644
--- a/hw/remote/meson.build
+++ b/hw/remote/meson.build
@@ -6,6 +6,7 @@ remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('message.c'))
 remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('remote-obj.c'))
 remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('proxy.c'))
 remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('iohub.c'))
+remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('vfio-user-obj.c'))
 
 remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: vfiouser)
 
diff --git a/hw/remote/trace-events b/hw/remote/trace-events
index 0b23974..7da12f0 100644
--- a/hw/remote/trace-events
+++ b/hw/remote/trace-events
@@ -2,3 +2,6 @@
 
 mpqemu_send_io_error(int cmd, int size, int nfds) "send command %d size %d, %d file descriptors to remote process"
 mpqemu_recv_io_error(int cmd, int size, int nfds) "failed to receive %d size %d, %d file descriptors to remote process"
+
+# vfio-user-obj.c
+vfu_prop(const char *prop, const char *val) "vfu: setting %s as %s"
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH RFC server v2 03/11] vfio-user: instantiate vfio-user context
  2021-08-27 17:53 ` [PATCH RFC server v2 00/11] vfio-user server in QEMU Jagannathan Raman
  2021-08-27 17:53   ` [PATCH RFC server v2 01/11] vfio-user: build library Jagannathan Raman
  2021-08-27 17:53   ` [PATCH RFC server v2 02/11] vfio-user: define vfio-user object Jagannathan Raman
@ 2021-08-27 17:53   ` Jagannathan Raman
  2021-09-08 12:40     ` Stefan Hajnoczi
  2021-08-27 17:53   ` [PATCH RFC server v2 04/11] vfio-user: find and init PCI device Jagannathan Raman
                     ` (9 subsequent siblings)
  12 siblings, 1 reply; 108+ messages in thread
From: Jagannathan Raman @ 2021-08-27 17:53 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, thuth, jag.raman, swapnil.ingle,
	john.levon, philmd, alex.williamson, marcandre.lureau, stefanha,
	thanos.makatos, alex.bennee

create a context with the vfio-user library to run a PCI device

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/remote/vfio-user-obj.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 4a1e297..99d3dd1 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -27,11 +27,17 @@
 #include "qemu/osdep.h"
 #include "qemu-common.h"
 
+#include <errno.h>
+
 #include "qom/object.h"
 #include "qom/object_interfaces.h"
 #include "qemu/error-report.h"
 #include "trace.h"
 #include "sysemu/runstate.h"
+#include "qemu/notify.h"
+#include "qapi/error.h"
+#include "sysemu/sysemu.h"
+#include "libvfio-user.h"
 
 #define TYPE_VFU_OBJECT "vfio-user"
 OBJECT_DECLARE_TYPE(VfuObject, VfuObjectClass, VFU_OBJECT)
@@ -51,6 +57,10 @@ struct VfuObject {
 
     char *socket;
     char *devid;
+
+    Notifier machine_done;
+
+    vfu_ctx_t *vfu_ctx;
 };
 
 static void vfu_object_set_socket(Object *obj, const char *str, Error **errp)
@@ -75,9 +85,23 @@ static void vfu_object_set_devid(Object *obj, const char *str, Error **errp)
     trace_vfu_prop("devid", str);
 }
 
+static void vfu_object_machine_done(Notifier *notifier, void *data)
+{
+    VfuObject *o = container_of(notifier, VfuObject, machine_done);
+
+    o->vfu_ctx = vfu_create_ctx(VFU_TRANS_SOCK, o->socket, 0,
+                                o, VFU_DEV_TYPE_PCI);
+    if (o->vfu_ctx == NULL) {
+        error_setg(&error_abort, "vfu: Failed to create context - %s",
+                   strerror(errno));
+        return;
+    }
+}
+
 static void vfu_object_init(Object *obj)
 {
     VfuObjectClass *k = VFU_OBJECT_GET_CLASS(obj);
+    VfuObject *o = VFU_OBJECT(obj);
 
     if (!object_dynamic_cast(OBJECT(current_machine), TYPE_REMOTE_MACHINE)) {
         error_report("vfu: %s only compatible with %s machine",
@@ -92,6 +116,9 @@ static void vfu_object_init(Object *obj)
     }
 
     k->nr_devs++;
+
+    o->machine_done.notify = vfu_object_machine_done;
+    qemu_add_machine_init_done_notifier(&o->machine_done);
 }
 
 static void vfu_object_finalize(Object *obj)
@@ -101,6 +128,8 @@ static void vfu_object_finalize(Object *obj)
 
     k->nr_devs--;
 
+    vfu_destroy_ctx(o->vfu_ctx);
+
     g_free(o->socket);
     g_free(o->devid);
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH RFC server v2 04/11] vfio-user: find and init PCI device
  2021-08-27 17:53 ` [PATCH RFC server v2 00/11] vfio-user server in QEMU Jagannathan Raman
                     ` (2 preceding siblings ...)
  2021-08-27 17:53   ` [PATCH RFC server v2 03/11] vfio-user: instantiate vfio-user context Jagannathan Raman
@ 2021-08-27 17:53   ` Jagannathan Raman
  2021-09-08 12:43     ` Stefan Hajnoczi
  2021-08-27 17:53   ` [PATCH RFC server v2 05/11] vfio-user: run vfio-user context Jagannathan Raman
                     ` (8 subsequent siblings)
  12 siblings, 1 reply; 108+ messages in thread
From: Jagannathan Raman @ 2021-08-27 17:53 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, thuth, jag.raman, swapnil.ingle,
	john.levon, philmd, alex.williamson, marcandre.lureau, stefanha,
	thanos.makatos, alex.bennee

Find the PCI device with specified id. Initialize the device context
with the QEMU PCI device

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/remote/vfio-user-obj.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 99d3dd1..5ae0991 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -38,6 +38,8 @@
 #include "qapi/error.h"
 #include "sysemu/sysemu.h"
 #include "libvfio-user.h"
+#include "hw/qdev-core.h"
+#include "hw/pci/pci.h"
 
 #define TYPE_VFU_OBJECT "vfio-user"
 OBJECT_DECLARE_TYPE(VfuObject, VfuObjectClass, VFU_OBJECT)
@@ -61,6 +63,8 @@ struct VfuObject {
     Notifier machine_done;
 
     vfu_ctx_t *vfu_ctx;
+
+    PCIDevice *pci_dev;
 };
 
 static void vfu_object_set_socket(Object *obj, const char *str, Error **errp)
@@ -88,6 +92,8 @@ static void vfu_object_set_devid(Object *obj, const char *str, Error **errp)
 static void vfu_object_machine_done(Notifier *notifier, void *data)
 {
     VfuObject *o = container_of(notifier, VfuObject, machine_done);
+    DeviceState *dev = NULL;
+    int ret;
 
     o->vfu_ctx = vfu_create_ctx(VFU_TRANS_SOCK, o->socket, 0,
                                 o, VFU_DEV_TYPE_PCI);
@@ -96,6 +102,28 @@ static void vfu_object_machine_done(Notifier *notifier, void *data)
                    strerror(errno));
         return;
     }
+
+    dev = qdev_find_recursive(sysbus_get_default(), o->devid);
+    if (dev == NULL) {
+        error_setg(&error_abort, "vfu: Device %s not found", o->devid);
+        return;
+    }
+
+    if (!object_dynamic_cast(OBJECT(dev), TYPE_PCI_DEVICE)) {
+        error_setg(&error_abort, "vfu: %s not a PCI devices", o->devid);
+        return;
+    }
+
+    o->pci_dev = PCI_DEVICE(dev);
+
+    ret = vfu_pci_init(o->vfu_ctx, VFU_PCI_TYPE_CONVENTIONAL,
+                       PCI_HEADER_TYPE_NORMAL, 0);
+    if (ret < 0) {
+        error_setg(&error_abort,
+                   "vfu: Failed to attach PCI device %s to context - %s",
+                   o->devid, strerror(errno));
+        return;
+    }
 }
 
 static void vfu_object_init(Object *obj)
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH RFC server v2 05/11] vfio-user: run vfio-user context
  2021-08-27 17:53 ` [PATCH RFC server v2 00/11] vfio-user server in QEMU Jagannathan Raman
                     ` (3 preceding siblings ...)
  2021-08-27 17:53   ` [PATCH RFC server v2 04/11] vfio-user: find and init PCI device Jagannathan Raman
@ 2021-08-27 17:53   ` Jagannathan Raman
  2021-09-08 12:58     ` Stefan Hajnoczi
  2021-08-27 17:53   ` [PATCH RFC server v2 06/11] vfio-user: handle PCI config space accesses Jagannathan Raman
                     ` (7 subsequent siblings)
  12 siblings, 1 reply; 108+ messages in thread
From: Jagannathan Raman @ 2021-08-27 17:53 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, thuth, jag.raman, swapnil.ingle,
	john.levon, philmd, alex.williamson, marcandre.lureau, stefanha,
	thanos.makatos, alex.bennee

Setup a handler to run vfio-user context. The context is driven by
messages to the file descriptor associated with it - get the fd for
the context and hook up the handler with it

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/remote/vfio-user-obj.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 70 insertions(+), 1 deletion(-)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 5ae0991..0726eb9 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -35,6 +35,7 @@
 #include "trace.h"
 #include "sysemu/runstate.h"
 #include "qemu/notify.h"
+#include "qemu/thread.h"
 #include "qapi/error.h"
 #include "sysemu/sysemu.h"
 #include "libvfio-user.h"
@@ -65,6 +66,8 @@ struct VfuObject {
     vfu_ctx_t *vfu_ctx;
 
     PCIDevice *pci_dev;
+
+    int vfu_poll_fd;
 };
 
 static void vfu_object_set_socket(Object *obj, const char *str, Error **errp)
@@ -89,13 +92,67 @@ static void vfu_object_set_devid(Object *obj, const char *str, Error **errp)
     trace_vfu_prop("devid", str);
 }
 
+static void vfu_object_ctx_run(void *opaque)
+{
+    VfuObject *o = opaque;
+    int ret = -1;
+
+    while (ret != 0) {
+        ret = vfu_run_ctx(o->vfu_ctx);
+        if (ret < 0) {
+            if (errno == EINTR) {
+                continue;
+            } else if (errno == ENOTCONN) {
+                qemu_set_fd_handler(o->vfu_poll_fd, NULL, NULL, NULL);
+                o->vfu_poll_fd = -1;
+                object_unparent(OBJECT(o));
+                break;
+            } else {
+                error_setg(&error_abort, "vfu: Failed to run device %s - %s",
+                           o->devid, strerror(errno));
+                 break;
+            }
+        }
+    }
+}
+
+static void *vfu_object_attach_ctx(void *opaque)
+{
+    VfuObject *o = opaque;
+    int ret;
+
+retry_attach:
+    ret = vfu_attach_ctx(o->vfu_ctx);
+    if (ret < 0 && (errno == EAGAIN || errno == EWOULDBLOCK)) {
+        goto retry_attach;
+    } else if (ret < 0) {
+        error_setg(&error_abort,
+                   "vfu: Failed to attach device %s to context - %s",
+                   o->devid, strerror(errno));
+        return NULL;
+    }
+
+    o->vfu_poll_fd = vfu_get_poll_fd(o->vfu_ctx);
+    if (o->vfu_poll_fd < 0) {
+        error_setg(&error_abort, "vfu: Failed to get poll fd %s", o->devid);
+        return NULL;
+    }
+
+    qemu_set_fd_handler(o->vfu_poll_fd, vfu_object_ctx_run,
+                        NULL, o);
+
+    return NULL;
+}
+
 static void vfu_object_machine_done(Notifier *notifier, void *data)
 {
     VfuObject *o = container_of(notifier, VfuObject, machine_done);
     DeviceState *dev = NULL;
+    QemuThread thread;
     int ret;
 
-    o->vfu_ctx = vfu_create_ctx(VFU_TRANS_SOCK, o->socket, 0,
+    o->vfu_ctx = vfu_create_ctx(VFU_TRANS_SOCK, o->socket,
+                                LIBVFIO_USER_FLAG_ATTACH_NB,
                                 o, VFU_DEV_TYPE_PCI);
     if (o->vfu_ctx == NULL) {
         error_setg(&error_abort, "vfu: Failed to create context - %s",
@@ -124,6 +181,16 @@ static void vfu_object_machine_done(Notifier *notifier, void *data)
                    o->devid, strerror(errno));
         return;
     }
+
+    ret = vfu_realize_ctx(o->vfu_ctx);
+    if (ret < 0) {
+        error_setg(&error_abort, "vfu: Failed to realize device %s- %s",
+                   o->devid, strerror(errno));
+        return;
+    }
+
+    qemu_thread_create(&thread, o->socket, vfu_object_attach_ctx, o,
+                       QEMU_THREAD_DETACHED);
 }
 
 static void vfu_object_init(Object *obj)
@@ -147,6 +214,8 @@ static void vfu_object_init(Object *obj)
 
     o->machine_done.notify = vfu_object_machine_done;
     qemu_add_machine_init_done_notifier(&o->machine_done);
+
+    o->vfu_poll_fd = -1;
 }
 
 static void vfu_object_finalize(Object *obj)
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH RFC server v2 06/11] vfio-user: handle PCI config space accesses
  2021-08-27 17:53 ` [PATCH RFC server v2 00/11] vfio-user server in QEMU Jagannathan Raman
                     ` (4 preceding siblings ...)
  2021-08-27 17:53   ` [PATCH RFC server v2 05/11] vfio-user: run vfio-user context Jagannathan Raman
@ 2021-08-27 17:53   ` Jagannathan Raman
  2021-09-09  7:27     ` Stefan Hajnoczi
  2021-08-27 17:53   ` [PATCH RFC server v2 07/11] vfio-user: handle DMA mappings Jagannathan Raman
                     ` (6 subsequent siblings)
  12 siblings, 1 reply; 108+ messages in thread
From: Jagannathan Raman @ 2021-08-27 17:53 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, thuth, jag.raman, swapnil.ingle,
	john.levon, philmd, alex.williamson, marcandre.lureau, stefanha,
	thanos.makatos, alex.bennee

Define and register handlers for PCI config space accesses

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/remote/vfio-user-obj.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
 hw/remote/trace-events    |  2 ++
 2 files changed, 46 insertions(+)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 0726eb9..13011ce 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -36,6 +36,7 @@
 #include "sysemu/runstate.h"
 #include "qemu/notify.h"
 #include "qemu/thread.h"
+#include "qemu/main-loop.h"
 #include "qapi/error.h"
 #include "sysemu/sysemu.h"
 #include "libvfio-user.h"
@@ -144,6 +145,38 @@ retry_attach:
     return NULL;
 }
 
+static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, char * const buf,
+                                     size_t count, loff_t offset,
+                                     const bool is_write)
+{
+    VfuObject *o = vfu_get_private(vfu_ctx);
+    uint32_t pci_access_width = sizeof(uint32_t);
+    size_t bytes = count;
+    uint32_t val = 0;
+    char *ptr = buf;
+    int len;
+
+    while (bytes > 0) {
+        len = (bytes > pci_access_width) ? pci_access_width : bytes;
+        if (is_write) {
+            memcpy(&val, ptr, len);
+            pci_default_write_config(PCI_DEVICE(o->pci_dev),
+                                     offset, val, len);
+            trace_vfu_cfg_write(offset, val);
+        } else {
+            val = pci_default_read_config(PCI_DEVICE(o->pci_dev),
+                                          offset, len);
+            memcpy(ptr, &val, len);
+            trace_vfu_cfg_read(offset, val);
+        }
+        offset += len;
+        ptr += len;
+        bytes -= len;
+    }
+
+    return count;
+}
+
 static void vfu_object_machine_done(Notifier *notifier, void *data)
 {
     VfuObject *o = container_of(notifier, VfuObject, machine_done);
@@ -182,6 +215,17 @@ static void vfu_object_machine_done(Notifier *notifier, void *data)
         return;
     }
 
+    ret = vfu_setup_region(o->vfu_ctx, VFU_PCI_DEV_CFG_REGION_IDX,
+                           pci_config_size(o->pci_dev), &vfu_object_cfg_access,
+                           VFU_REGION_FLAG_RW | VFU_REGION_FLAG_ALWAYS_CB,
+                           NULL, 0, -1, 0);
+    if (ret < 0) {
+        error_setg(&error_abort,
+                   "vfu: Failed to setup config space handlers for %s- %s",
+                   o->devid, strerror(errno));
+        return;
+    }
+
     ret = vfu_realize_ctx(o->vfu_ctx);
     if (ret < 0) {
         error_setg(&error_abort, "vfu: Failed to realize device %s- %s",
diff --git a/hw/remote/trace-events b/hw/remote/trace-events
index 7da12f0..2ef7884 100644
--- a/hw/remote/trace-events
+++ b/hw/remote/trace-events
@@ -5,3 +5,5 @@ mpqemu_recv_io_error(int cmd, int size, int nfds) "failed to receive %d size %d,
 
 # vfio-user-obj.c
 vfu_prop(const char *prop, const char *val) "vfu: setting %s as %s"
+vfu_cfg_read(uint32_t offset, uint32_t val) "vfu: cfg: 0x%u -> 0x%x"
+vfu_cfg_write(uint32_t offset, uint32_t val) "vfu: cfg: 0x%u <- 0x%x"
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH RFC server v2 07/11] vfio-user: handle DMA mappings
  2021-08-27 17:53 ` [PATCH RFC server v2 00/11] vfio-user server in QEMU Jagannathan Raman
                     ` (5 preceding siblings ...)
  2021-08-27 17:53   ` [PATCH RFC server v2 06/11] vfio-user: handle PCI config space accesses Jagannathan Raman
@ 2021-08-27 17:53   ` Jagannathan Raman
  2021-09-09  7:29     ` Stefan Hajnoczi
  2021-08-27 17:53   ` [PATCH RFC server v2 08/11] vfio-user: handle PCI BAR accesses Jagannathan Raman
                     ` (5 subsequent siblings)
  12 siblings, 1 reply; 108+ messages in thread
From: Jagannathan Raman @ 2021-08-27 17:53 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, thuth, jag.raman, swapnil.ingle,
	john.levon, philmd, alex.williamson, marcandre.lureau, stefanha,
	thanos.makatos, alex.bennee

Define and register callbacks to manage the RAM regions used for
device DMA

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/remote/vfio-user-obj.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++
 hw/remote/trace-events    |  2 ++
 2 files changed, 52 insertions(+)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 13011ce..76fb2d4 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -177,6 +177,49 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, char * const buf,
     return count;
 }
 
+static void dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
+{
+    MemoryRegion *subregion = NULL;
+    g_autofree char *name = NULL;
+    static unsigned int suffix;
+    struct iovec *iov = &info->iova;
+
+    if (!info->vaddr) {
+        return;
+    }
+
+    name = g_strdup_printf("remote-mem-%u", suffix++);
+
+    subregion = g_new0(MemoryRegion, 1);
+
+    memory_region_init_ram_ptr(subregion, NULL, name,
+                               iov->iov_len, info->vaddr);
+
+    memory_region_add_subregion(get_system_memory(), (hwaddr)iov->iov_base,
+                                subregion);
+
+    trace_vfu_dma_register((uint64_t)iov->iov_base, iov->iov_len);
+}
+
+static int dma_unregister(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
+{
+    MemoryRegion *mr = NULL;
+    ram_addr_t offset;
+
+    mr = memory_region_from_host(info->vaddr, &offset);
+    if (!mr) {
+        return 0;
+    }
+
+    memory_region_del_subregion(get_system_memory(), mr);
+
+    object_unparent((OBJECT(mr)));
+
+    trace_vfu_dma_unregister((uint64_t)info->iova.iov_base);
+
+    return 0;
+}
+
 static void vfu_object_machine_done(Notifier *notifier, void *data)
 {
     VfuObject *o = container_of(notifier, VfuObject, machine_done);
@@ -226,6 +269,13 @@ static void vfu_object_machine_done(Notifier *notifier, void *data)
         return;
     }
 
+    ret = vfu_setup_device_dma(o->vfu_ctx, &dma_register, &dma_unregister);
+    if (ret < 0) {
+        error_setg(&error_abort, "vfu: Failed to setup DMA handlers for %s",
+                   o->devid);
+        return;
+    }
+
     ret = vfu_realize_ctx(o->vfu_ctx);
     if (ret < 0) {
         error_setg(&error_abort, "vfu: Failed to realize device %s- %s",
diff --git a/hw/remote/trace-events b/hw/remote/trace-events
index 2ef7884..f945c7e 100644
--- a/hw/remote/trace-events
+++ b/hw/remote/trace-events
@@ -7,3 +7,5 @@ mpqemu_recv_io_error(int cmd, int size, int nfds) "failed to receive %d size %d,
 vfu_prop(const char *prop, const char *val) "vfu: setting %s as %s"
 vfu_cfg_read(uint32_t offset, uint32_t val) "vfu: cfg: 0x%u -> 0x%x"
 vfu_cfg_write(uint32_t offset, uint32_t val) "vfu: cfg: 0x%u <- 0x%x"
+vfu_dma_register(uint64_t gpa, size_t len) "vfu: registering GPA 0x%"PRIx64", %zu bytes"
+vfu_dma_unregister(uint64_t gpa) "vfu: unregistering GPA 0x%"PRIx64""
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH RFC server v2 08/11] vfio-user: handle PCI BAR accesses
  2021-08-27 17:53 ` [PATCH RFC server v2 00/11] vfio-user server in QEMU Jagannathan Raman
                     ` (6 preceding siblings ...)
  2021-08-27 17:53   ` [PATCH RFC server v2 07/11] vfio-user: handle DMA mappings Jagannathan Raman
@ 2021-08-27 17:53   ` Jagannathan Raman
  2021-09-09  7:37     ` Stefan Hajnoczi
  2021-08-27 17:53   ` [PATCH RFC server v2 09/11] vfio-user: handle device interrupts Jagannathan Raman
                     ` (4 subsequent siblings)
  12 siblings, 1 reply; 108+ messages in thread
From: Jagannathan Raman @ 2021-08-27 17:53 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, thuth, jag.raman, swapnil.ingle,
	john.levon, philmd, alex.williamson, marcandre.lureau, stefanha,
	thanos.makatos, alex.bennee

Determine the BARs used by the PCI device and register handlers to
manage the access to the same.

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/remote/vfio-user-obj.c | 95 +++++++++++++++++++++++++++++++++++++++++++++++
 hw/remote/trace-events    |  2 +
 2 files changed, 97 insertions(+)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 76fb2d4..299c938 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -220,6 +220,99 @@ static int dma_unregister(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
     return 0;
 }
 
+static ssize_t vfu_object_bar_rw(PCIDevice *pci_dev, hwaddr addr, size_t count,
+                                 char * const buf, const bool is_write,
+                                 uint8_t type)
+{
+    AddressSpace *as = NULL;
+    MemTxResult res;
+
+    if (type == PCI_BASE_ADDRESS_SPACE_MEMORY) {
+        as = pci_device_iommu_address_space(pci_dev);
+    } else {
+        as = &address_space_io;
+    }
+
+    trace_vfu_bar_rw_enter(is_write ? "Write" : "Read", (uint64_t)addr);
+
+    res = address_space_rw(as, addr, MEMTXATTRS_UNSPECIFIED, (void *)buf,
+                           (hwaddr)count, is_write);
+    if (res != MEMTX_OK) {
+        warn_report("vfu: failed to %s 0x%"PRIx64"",
+                    is_write ? "write to" : "read from",
+                    addr);
+        return -1;
+    }
+
+    trace_vfu_bar_rw_exit(is_write ? "Write" : "Read", (uint64_t)addr);
+
+    return count;
+}
+
+/**
+ * VFU_OBJECT_BAR_HANDLER - macro for defining handlers for PCI BARs.
+ *
+ * To create handler for BAR number 2, VFU_OBJECT_BAR_HANDLER(2) would
+ * define vfu_object_bar2_handler
+ */
+#define VFU_OBJECT_BAR_HANDLER(BAR_NO)                                         \
+    static ssize_t vfu_object_bar##BAR_NO##_handler(vfu_ctx_t *vfu_ctx,        \
+                                        char * const buf, size_t count,        \
+                                        loff_t offset, const bool is_write)    \
+    {                                                                          \
+        VfuObject *o = vfu_get_private(vfu_ctx);                               \
+        hwaddr addr = (hwaddr)(pci_get_long(o->pci_dev->config +               \
+                                            PCI_BASE_ADDRESS_0 +               \
+                                            (4 * BAR_NO)) + offset);           \
+                                                                               \
+        return vfu_object_bar_rw(o->pci_dev, addr, count, buf, is_write,       \
+                                 o->pci_dev->io_regions[BAR_NO].type);         \
+    }                                                                          \
+
+VFU_OBJECT_BAR_HANDLER(0)
+VFU_OBJECT_BAR_HANDLER(1)
+VFU_OBJECT_BAR_HANDLER(2)
+VFU_OBJECT_BAR_HANDLER(3)
+VFU_OBJECT_BAR_HANDLER(4)
+VFU_OBJECT_BAR_HANDLER(5)
+
+static vfu_region_access_cb_t *vfu_object_bar_handlers[PCI_NUM_REGIONS] = {
+    &vfu_object_bar0_handler,
+    &vfu_object_bar1_handler,
+    &vfu_object_bar2_handler,
+    &vfu_object_bar3_handler,
+    &vfu_object_bar4_handler,
+    &vfu_object_bar5_handler,
+};
+
+/**
+ * vfu_object_register_bars - Identify active BAR regions of pdev and setup
+ *                            callbacks to handle read/write accesses
+ */
+static void vfu_object_register_bars(vfu_ctx_t *vfu_ctx, PCIDevice *pdev)
+{
+    uint32_t orig_val, new_val;
+    int i, size;
+
+    for (i = 0; i < PCI_NUM_REGIONS; i++) {
+        orig_val = pci_default_read_config(pdev,
+                                           PCI_BASE_ADDRESS_0 + (4 * i), 4);
+        new_val = 0xffffffff;
+        pci_default_write_config(pdev,
+                                 PCI_BASE_ADDRESS_0 + (4 * i), new_val, 4);
+        new_val = pci_default_read_config(pdev,
+                                          PCI_BASE_ADDRESS_0 + (4 * i), 4);
+        size = (~(new_val & 0xFFFFFFF0)) + 1;
+        pci_default_write_config(pdev, PCI_BASE_ADDRESS_0 + (4 * i),
+                                 orig_val, 4);
+        if (size) {
+            vfu_setup_region(vfu_ctx, VFU_PCI_DEV_BAR0_REGION_IDX + i, size,
+                             vfu_object_bar_handlers[i], VFU_REGION_FLAG_RW,
+                             NULL, 0, -1, 0);
+        }
+    }
+}
+
 static void vfu_object_machine_done(Notifier *notifier, void *data)
 {
     VfuObject *o = container_of(notifier, VfuObject, machine_done);
@@ -276,6 +369,8 @@ static void vfu_object_machine_done(Notifier *notifier, void *data)
         return;
     }
 
+    vfu_object_register_bars(o->vfu_ctx, o->pci_dev);
+
     ret = vfu_realize_ctx(o->vfu_ctx);
     if (ret < 0) {
         error_setg(&error_abort, "vfu: Failed to realize device %s- %s",
diff --git a/hw/remote/trace-events b/hw/remote/trace-events
index f945c7e..f3f65e2 100644
--- a/hw/remote/trace-events
+++ b/hw/remote/trace-events
@@ -9,3 +9,5 @@ vfu_cfg_read(uint32_t offset, uint32_t val) "vfu: cfg: 0x%u -> 0x%x"
 vfu_cfg_write(uint32_t offset, uint32_t val) "vfu: cfg: 0x%u <- 0x%x"
 vfu_dma_register(uint64_t gpa, size_t len) "vfu: registering GPA 0x%"PRIx64", %zu bytes"
 vfu_dma_unregister(uint64_t gpa) "vfu: unregistering GPA 0x%"PRIx64""
+vfu_bar_rw_enter(const char *op, uint64_t addr) "vfu: %s request for BAR address 0x%"PRIx64""
+vfu_bar_rw_exit(const char *op, uint64_t addr) "vfu: Finished %s of BAR address 0x%"PRIx64""
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH RFC server v2 09/11] vfio-user: handle device interrupts
  2021-08-27 17:53 ` [PATCH RFC server v2 00/11] vfio-user server in QEMU Jagannathan Raman
                     ` (7 preceding siblings ...)
  2021-08-27 17:53   ` [PATCH RFC server v2 08/11] vfio-user: handle PCI BAR accesses Jagannathan Raman
@ 2021-08-27 17:53   ` Jagannathan Raman
  2021-09-09  7:40     ` Stefan Hajnoczi
  2021-08-27 17:53   ` [PATCH RFC server v2 10/11] vfio-user: register handlers to facilitate migration Jagannathan Raman
                     ` (3 subsequent siblings)
  12 siblings, 1 reply; 108+ messages in thread
From: Jagannathan Raman @ 2021-08-27 17:53 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, thuth, jag.raman, swapnil.ingle,
	john.levon, philmd, alex.williamson, marcandre.lureau, stefanha,
	thanos.makatos, alex.bennee

Forward remote device's interrupts to the guest

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 include/hw/remote/iohub.h |  2 ++
 hw/remote/iohub.c         |  5 +++++
 hw/remote/vfio-user-obj.c | 30 ++++++++++++++++++++++++++++++
 hw/remote/trace-events    |  1 +
 4 files changed, 38 insertions(+)

diff --git a/include/hw/remote/iohub.h b/include/hw/remote/iohub.h
index 0bf98e0..d5bd0b0 100644
--- a/include/hw/remote/iohub.h
+++ b/include/hw/remote/iohub.h
@@ -15,6 +15,7 @@
 #include "qemu/event_notifier.h"
 #include "qemu/thread-posix.h"
 #include "hw/remote/mpqemu-link.h"
+#include "libvfio-user.h"
 
 #define REMOTE_IOHUB_NB_PIRQS    PCI_DEVFN_MAX
 
@@ -30,6 +31,7 @@ typedef struct RemoteIOHubState {
     unsigned int irq_level[REMOTE_IOHUB_NB_PIRQS];
     ResampleToken token[REMOTE_IOHUB_NB_PIRQS];
     QemuMutex irq_level_lock[REMOTE_IOHUB_NB_PIRQS];
+    vfu_ctx_t *vfu_ctx[REMOTE_IOHUB_NB_PIRQS];
 } RemoteIOHubState;
 
 int remote_iohub_map_irq(PCIDevice *pci_dev, int intx);
diff --git a/hw/remote/iohub.c b/hw/remote/iohub.c
index 547d597..9410233 100644
--- a/hw/remote/iohub.c
+++ b/hw/remote/iohub.c
@@ -18,6 +18,7 @@
 #include "hw/remote/machine.h"
 #include "hw/remote/iohub.h"
 #include "qemu/main-loop.h"
+#include "trace.h"
 
 void remote_iohub_init(RemoteIOHubState *iohub)
 {
@@ -62,6 +63,10 @@ void remote_iohub_set_irq(void *opaque, int pirq, int level)
     QEMU_LOCK_GUARD(&iohub->irq_level_lock[pirq]);
 
     if (level) {
+        if (iohub->vfu_ctx[pirq]) {
+            trace_vfu_interrupt(pirq);
+            vfu_irq_trigger(iohub->vfu_ctx[pirq], 0);
+        }
         if (++iohub->irq_level[pirq] == 1) {
             event_notifier_set(&iohub->irqfds[pirq]);
         }
diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 299c938..92605ed 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -42,6 +42,9 @@
 #include "libvfio-user.h"
 #include "hw/qdev-core.h"
 #include "hw/pci/pci.h"
+#include "hw/boards.h"
+#include "hw/remote/iohub.h"
+#include "hw/remote/machine.h"
 
 #define TYPE_VFU_OBJECT "vfio-user"
 OBJECT_DECLARE_TYPE(VfuObject, VfuObjectClass, VFU_OBJECT)
@@ -313,6 +316,26 @@ static void vfu_object_register_bars(vfu_ctx_t *vfu_ctx, PCIDevice *pdev)
     }
 }
 
+static int vfu_object_setup_irqs(vfu_ctx_t *vfu_ctx, PCIDevice *pci_dev)
+{
+    RemoteMachineState *machine = REMOTE_MACHINE(current_machine);
+    RemoteIOHubState *iohub = &machine->iohub;
+    int pirq, intx, ret;
+
+    ret = vfu_setup_device_nr_irqs(vfu_ctx, VFU_DEV_INTX_IRQ, 1);
+    if (ret < 0) {
+        return ret;
+    }
+
+    intx = pci_get_byte(pci_dev->config + PCI_INTERRUPT_PIN) - 1;
+
+    pirq = remote_iohub_map_irq(pci_dev, intx);
+
+    iohub->vfu_ctx[pirq] = vfu_ctx;
+
+    return 0;
+}
+
 static void vfu_object_machine_done(Notifier *notifier, void *data)
 {
     VfuObject *o = container_of(notifier, VfuObject, machine_done);
@@ -371,6 +394,13 @@ static void vfu_object_machine_done(Notifier *notifier, void *data)
 
     vfu_object_register_bars(o->vfu_ctx, o->pci_dev);
 
+    ret = vfu_object_setup_irqs(o->vfu_ctx, o->pci_dev);
+    if (ret < 0) {
+        error_setg(&error_abort, "vfu: Failed to setup interrupts for %s",
+                   o->devid);
+        return;
+    }
+
     ret = vfu_realize_ctx(o->vfu_ctx);
     if (ret < 0) {
         error_setg(&error_abort, "vfu: Failed to realize device %s- %s",
diff --git a/hw/remote/trace-events b/hw/remote/trace-events
index f3f65e2..b419d6f 100644
--- a/hw/remote/trace-events
+++ b/hw/remote/trace-events
@@ -11,3 +11,4 @@ vfu_dma_register(uint64_t gpa, size_t len) "vfu: registering GPA 0x%"PRIx64", %z
 vfu_dma_unregister(uint64_t gpa) "vfu: unregistering GPA 0x%"PRIx64""
 vfu_bar_rw_enter(const char *op, uint64_t addr) "vfu: %s request for BAR address 0x%"PRIx64""
 vfu_bar_rw_exit(const char *op, uint64_t addr) "vfu: Finished %s of BAR address 0x%"PRIx64""
+vfu_interrupt(int pirq) "vfu: sending interrupt to device - PIRQ %d"
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH RFC server v2 10/11] vfio-user: register handlers to facilitate migration
  2021-08-27 17:53 ` [PATCH RFC server v2 00/11] vfio-user server in QEMU Jagannathan Raman
                     ` (8 preceding siblings ...)
  2021-08-27 17:53   ` [PATCH RFC server v2 09/11] vfio-user: handle device interrupts Jagannathan Raman
@ 2021-08-27 17:53   ` Jagannathan Raman
  2021-09-09  8:14     ` Stefan Hajnoczi
  2021-08-27 17:53   ` [PATCH RFC server v2 11/11] vfio-user: acceptance test Jagannathan Raman
                     ` (2 subsequent siblings)
  12 siblings, 1 reply; 108+ messages in thread
From: Jagannathan Raman @ 2021-08-27 17:53 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, thuth, jag.raman, swapnil.ingle,
	john.levon, philmd, alex.williamson, marcandre.lureau, stefanha,
	thanos.makatos, alex.bennee

Store and load the device's state during migration. use libvfio-user's
handlers for this purpose

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 migration/savevm.h        |   2 +
 hw/remote/vfio-user-obj.c | 313 ++++++++++++++++++++++++++++++++++++++++++++++
 migration/savevm.c        |  73 +++++++++++
 3 files changed, 388 insertions(+)

diff --git a/migration/savevm.h b/migration/savevm.h
index 6461342..8007064 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -67,5 +67,7 @@ int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
 int qemu_load_device_state(QEMUFile *f);
 int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
         bool in_postcopy, bool inactivate_disks);
+int qemu_remote_savevm(QEMUFile *f, DeviceState *dev);
+int qemu_remote_loadvm(QEMUFile *f);
 
 #endif
diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 92605ed..16cf515 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -45,6 +45,10 @@
 #include "hw/boards.h"
 #include "hw/remote/iohub.h"
 #include "hw/remote/machine.h"
+#include "migration/qemu-file.h"
+#include "migration/savevm.h"
+#include "migration/global_state.h"
+#include "block/block.h"
 
 #define TYPE_VFU_OBJECT "vfio-user"
 OBJECT_DECLARE_TYPE(VfuObject, VfuObjectClass, VFU_OBJECT)
@@ -72,6 +76,33 @@ struct VfuObject {
     PCIDevice *pci_dev;
 
     int vfu_poll_fd;
+
+    /*
+     * vfu_mig_buf holds the migration data. In the remote server, this
+     * buffer replaces the role of an IO channel which links the source
+     * and the destination.
+     *
+     * Whenever the client QEMU process initiates migration, the remote
+     * server gets notified via libvfio-user callbacks. The remote server
+     * sets up a QEMUFile object using this buffer as backend. The remote
+     * server passes this object to its migration subsystem, which slurps
+     * the VMSD of the device ('devid' above) referenced by this object
+     * and stores the VMSD in this buffer.
+     *
+     * The client subsequetly asks the remote server for any data that
+     * needs to be moved over to the destination via libvfio-user
+     * library's vfu_migration_callbacks_t callbacks. The remote hands
+     * over this buffer as data at this time.
+     *
+     * A reverse of this process happens at the destination.
+     */
+    uint8_t *vfu_mig_buf;
+
+    uint64_t vfu_mig_buf_size;
+
+    uint64_t vfu_mig_buf_pending;
+
+    QEMUFile *vfu_mig_file;
 };
 
 static void vfu_object_set_socket(Object *obj, const char *str, Error **errp)
@@ -96,6 +127,250 @@ static void vfu_object_set_devid(Object *obj, const char *str, Error **errp)
     trace_vfu_prop("devid", str);
 }
 
+/**
+ * Migration helper functions
+ *
+ * vfu_mig_buf_read & vfu_mig_buf_write are used by QEMU's migration
+ * subsystem - qemu_remote_loadvm & qemu_remote_savevm. loadvm/savevm
+ * call these functions via QEMUFileOps to load/save the VMSD of a
+ * device into vfu_mig_buf
+ *
+ */
+static ssize_t vfu_mig_buf_read(void *opaque, uint8_t *buf, int64_t pos,
+                                size_t size, Error **errp)
+{
+    VfuObject *o = opaque;
+
+    if (pos > o->vfu_mig_buf_size) {
+        size = 0;
+    } else if ((pos + size) > o->vfu_mig_buf_size) {
+        size = o->vfu_mig_buf_size;
+    }
+
+    memcpy(buf, (o->vfu_mig_buf + pos), size);
+
+    o->vfu_mig_buf_size -= size;
+
+    return size;
+}
+
+static ssize_t vfu_mig_buf_write(void *opaque, struct iovec *iov, int iovcnt,
+                                 int64_t pos, Error **errp)
+{
+    VfuObject *o = opaque;
+    uint64_t end = pos + iov_size(iov, iovcnt);
+    int i;
+
+    if (end > o->vfu_mig_buf_size) {
+        o->vfu_mig_buf = g_realloc(o->vfu_mig_buf, end);
+    }
+
+    for (i = 0; i < iovcnt; i++) {
+        memcpy((o->vfu_mig_buf + o->vfu_mig_buf_size), iov[i].iov_base,
+               iov[i].iov_len);
+        o->vfu_mig_buf_size += iov[i].iov_len;
+        o->vfu_mig_buf_pending += iov[i].iov_len;
+    }
+
+    return iov_size(iov, iovcnt);
+}
+
+static int vfu_mig_buf_shutdown(void *opaque, bool rd, bool wr, Error **errp)
+{
+    VfuObject *o = opaque;
+
+    o->vfu_mig_buf_size = 0;
+
+    g_free(o->vfu_mig_buf);
+
+    return 0;
+}
+
+static const QEMUFileOps vfu_mig_fops_save = {
+    .writev_buffer  = vfu_mig_buf_write,
+    .shut_down      = vfu_mig_buf_shutdown,
+};
+
+static const QEMUFileOps vfu_mig_fops_load = {
+    .get_buffer     = vfu_mig_buf_read,
+    .shut_down      = vfu_mig_buf_shutdown,
+};
+
+/**
+ * handlers for vfu_migration_callbacks_t
+ *
+ * The libvfio-user library accesses these handlers to drive the migration
+ * at the remote end, and also to transport the data stored in vfu_mig_buf
+ *
+ */
+static void vfu_mig_state_precopy(vfu_ctx_t *vfu_ctx)
+{
+    VfuObject *o = vfu_get_private(vfu_ctx);
+    int ret;
+
+    if (!o->vfu_mig_file) {
+        o->vfu_mig_file = qemu_fopen_ops(o, &vfu_mig_fops_save, false);
+    }
+
+    ret = qemu_remote_savevm(o->vfu_mig_file, DEVICE(o->pci_dev));
+    if (ret) {
+        qemu_file_shutdown(o->vfu_mig_file);
+        return;
+    }
+
+    qemu_fflush(o->vfu_mig_file);
+}
+
+static void vfu_mig_state_running(vfu_ctx_t *vfu_ctx)
+{
+    VfuObject *o = vfu_get_private(vfu_ctx);
+    VfuObjectClass *k = VFU_OBJECT_GET_CLASS(OBJECT(o));
+    static int migrated_devs;
+    Error *local_err = NULL;
+    int ret;
+
+    ret = qemu_remote_loadvm(o->vfu_mig_file);
+    if (ret) {
+        error_setg(&error_abort, "vfu: failed to restore device state");
+        return;
+    }
+
+    if (++migrated_devs == k->nr_devs) {
+        bdrv_invalidate_cache_all(&local_err);
+        if (local_err) {
+            error_report_err(local_err);
+            return;
+        }
+
+        vm_start();
+    }
+}
+
+static void vfu_mig_state_stop(vfu_ctx_t *vfu_ctx)
+{
+    VfuObject *o = vfu_get_private(vfu_ctx);
+    VfuObjectClass *k = VFU_OBJECT_GET_CLASS(OBJECT(o));
+    static int migrated_devs;
+
+    /**
+     * note: calling bdrv_inactivate_all() is not the best approach.
+     *
+     *  Ideally, we would identify the block devices (if any) indirectly
+     *  linked (such as via a scs-hd device) to each of the migrated devices,
+     *  and inactivate them individually. This is essential while operating
+     *  the server in a storage daemon mode, with devices from different VMs.
+     *
+     *  However, we currently don't have this capability. As such, we need to
+     *  inactivate all devices at the same time when migration is completed.
+     */
+    if (++migrated_devs == k->nr_devs) {
+        bdrv_inactivate_all();
+    }
+}
+
+static int vfu_mig_transition(vfu_ctx_t *vfu_ctx, vfu_migr_state_t state)
+{
+    switch (state) {
+    case VFU_MIGR_STATE_RESUME:
+    case VFU_MIGR_STATE_STOP_AND_COPY:
+        break;
+    case VFU_MIGR_STATE_STOP:
+        vfu_mig_state_stop(vfu_ctx);
+        break;
+    case VFU_MIGR_STATE_PRE_COPY:
+        vfu_mig_state_precopy(vfu_ctx);
+        break;
+    case VFU_MIGR_STATE_RUNNING:
+        if (!runstate_is_running()) {
+            vfu_mig_state_running(vfu_ctx);
+        }
+        break;
+    default:
+        warn_report("vfu: Unknown migration state %d", state);
+    }
+
+    return 0;
+}
+
+static uint64_t vfu_mig_get_pending_bytes(vfu_ctx_t *vfu_ctx)
+{
+    VfuObject *o = vfu_get_private(vfu_ctx);
+
+    return o->vfu_mig_buf_pending;
+}
+
+static int vfu_mig_prepare_data(vfu_ctx_t *vfu_ctx, uint64_t *offset,
+                                uint64_t *size)
+{
+    VfuObject *o = vfu_get_private(vfu_ctx);
+
+    if (offset) {
+        *offset = 0;
+    }
+
+    if (size) {
+        *size = o->vfu_mig_buf_size;
+    }
+
+    return 0;
+}
+
+static ssize_t vfu_mig_read_data(vfu_ctx_t *vfu_ctx, void *buf,
+                                 uint64_t size, uint64_t offset)
+{
+    VfuObject *o = vfu_get_private(vfu_ctx);
+
+    if (offset > o->vfu_mig_buf_size) {
+        return -1;
+    }
+
+    if ((offset + size) > o->vfu_mig_buf_size) {
+        warn_report("vfu: buffer overflow - check pending_bytes");
+        size = o->vfu_mig_buf_size - offset;
+    }
+
+    memcpy(buf, (o->vfu_mig_buf + offset), size);
+
+    o->vfu_mig_buf_pending -= size;
+
+    return size;
+}
+
+static ssize_t vfu_mig_write_data(vfu_ctx_t *vfu_ctx, void *data,
+                                  uint64_t size, uint64_t offset)
+{
+    VfuObject *o = vfu_get_private(vfu_ctx);
+    uint64_t end = offset + size;
+
+    if (end > o->vfu_mig_buf_size) {
+        o->vfu_mig_buf = g_realloc(o->vfu_mig_buf, end);
+        o->vfu_mig_buf_size = end;
+    }
+
+    memcpy((o->vfu_mig_buf + offset), data, size);
+
+    if (!o->vfu_mig_file) {
+        o->vfu_mig_file = qemu_fopen_ops(o, &vfu_mig_fops_load, false);
+    }
+
+    return size;
+}
+
+static int vfu_mig_data_written(vfu_ctx_t *vfu_ctx, uint64_t count)
+{
+    return 0;
+}
+
+static const vfu_migration_callbacks_t vfu_mig_cbs = {
+    .version = VFU_MIGR_CALLBACKS_VERS,
+    .transition = &vfu_mig_transition,
+    .get_pending_bytes = &vfu_mig_get_pending_bytes,
+    .prepare_data = &vfu_mig_prepare_data,
+    .read_data = &vfu_mig_read_data,
+    .data_written = &vfu_mig_data_written,
+    .write_data = &vfu_mig_write_data,
+};
+
 static void vfu_object_ctx_run(void *opaque)
 {
     VfuObject *o = opaque;
@@ -340,6 +615,7 @@ static void vfu_object_machine_done(Notifier *notifier, void *data)
 {
     VfuObject *o = container_of(notifier, VfuObject, machine_done);
     DeviceState *dev = NULL;
+    size_t migr_area_size;
     QemuThread thread;
     int ret;
 
@@ -401,6 +677,35 @@ static void vfu_object_machine_done(Notifier *notifier, void *data)
         return;
     }
 
+    /*
+     * TODO: The 0x20000 number used below is a temporary. We are working on
+     *     a cleaner fix for this.
+     *
+     *     The libvfio-user library assumes that the remote knows the size of
+     *     the data to be migrated at boot time, but that is not the case with
+     *     VMSDs, as it can contain a variable-size buffer. 0x20000 is used
+     *     as a sufficiently large buffer to demonstrate migration, but that
+     *     cannot be used as a solution.
+     *
+     */
+    ret = vfu_setup_region(o->vfu_ctx, VFU_PCI_DEV_MIGR_REGION_IDX,
+                           0x20000, NULL,
+                           VFU_REGION_FLAG_RW, NULL, 0, -1, 0);
+    if (ret < 0) {
+        error_setg(&error_abort, "vfu: Failed to register migration BAR %s- %s",
+                   o->devid, strerror(errno));
+        return;
+    }
+
+    migr_area_size = vfu_get_migr_register_area_size();
+    ret = vfu_setup_device_migration_callbacks(o->vfu_ctx, &vfu_mig_cbs,
+                                               migr_area_size);
+    if (ret < 0) {
+        error_setg(&error_abort, "vfu: Failed to setup migration %s- %s",
+                   o->devid, strerror(errno));
+        return;
+    }
+
     ret = vfu_realize_ctx(o->vfu_ctx);
     if (ret < 0) {
         error_setg(&error_abort, "vfu: Failed to realize device %s- %s",
@@ -435,6 +740,14 @@ static void vfu_object_init(Object *obj)
     qemu_add_machine_init_done_notifier(&o->machine_done);
 
     o->vfu_poll_fd = -1;
+
+    o->vfu_mig_file = NULL;
+
+    o->vfu_mig_buf = NULL;
+
+    o->vfu_mig_buf_size = 0;
+
+    o->vfu_mig_buf_pending = 0;
 }
 
 static void vfu_object_finalize(Object *obj)
diff --git a/migration/savevm.c b/migration/savevm.c
index 7b7b64b..341fde7 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1604,6 +1604,49 @@ static int qemu_savevm_state(QEMUFile *f, Error **errp)
     return ret;
 }
 
+static SaveStateEntry *find_se_from_dev(DeviceState *dev)
+{
+    SaveStateEntry *se;
+
+    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+        if (se->opaque == dev) {
+            return se;
+        }
+    }
+
+    return NULL;
+}
+
+int qemu_remote_savevm(QEMUFile *f, DeviceState *dev)
+{
+    SaveStateEntry *se;
+    int ret = 0;
+
+    se = find_se_from_dev(dev);
+    if (!se) {
+        return -ENODEV;
+    }
+
+    if (!se->vmsd || !vmstate_save_needed(se->vmsd, se->opaque)) {
+        return ret;
+    }
+
+    save_section_header(f, se, QEMU_VM_SECTION_FULL);
+
+    ret = vmstate_save(f, se, NULL);
+    if (ret) {
+        qemu_file_set_error(f, ret);
+        return ret;
+    }
+
+    save_section_footer(f, se);
+
+    qemu_put_byte(f, QEMU_VM_EOF);
+    qemu_fflush(f);
+
+    return 0;
+}
+
 void qemu_savevm_live_state(QEMUFile *f)
 {
     /* save QEMU_VM_SECTION_END section */
@@ -2444,6 +2487,36 @@ qemu_loadvm_section_start_full(QEMUFile *f, MigrationIncomingState *mis)
     return 0;
 }
 
+int qemu_remote_loadvm(QEMUFile *f)
+{
+    uint8_t section_type;
+    int ret = 0;
+
+    while (true) {
+        section_type = qemu_get_byte(f);
+
+        ret = qemu_file_get_error(f);
+        if (ret) {
+            break;
+        }
+
+        switch (section_type) {
+        case QEMU_VM_SECTION_FULL:
+            ret = qemu_loadvm_section_start_full(f, NULL);
+            if (ret < 0) {
+                break;
+            }
+            break;
+        case QEMU_VM_EOF:
+            return ret;
+        default:
+            return -EINVAL;
+        }
+    }
+
+    return ret;
+}
+
 static int
 qemu_loadvm_section_part_end(QEMUFile *f, MigrationIncomingState *mis)
 {
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH RFC server v2 11/11] vfio-user: acceptance test
  2021-08-27 17:53 ` [PATCH RFC server v2 00/11] vfio-user server in QEMU Jagannathan Raman
                     ` (9 preceding siblings ...)
  2021-08-27 17:53   ` [PATCH RFC server v2 10/11] vfio-user: register handlers to facilitate migration Jagannathan Raman
@ 2021-08-27 17:53   ` Jagannathan Raman
  2021-09-08 10:08   ` [PATCH RFC server v2 00/11] vfio-user server in QEMU Stefan Hajnoczi
  2021-09-09  8:17   ` Stefan Hajnoczi
  12 siblings, 0 replies; 108+ messages in thread
From: Jagannathan Raman @ 2021-08-27 17:53 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, thuth, jag.raman, swapnil.ingle,
	john.levon, philmd, alex.williamson, marcandre.lureau, stefanha,
	thanos.makatos, alex.bennee

Acceptance test for libvfio-user in QEMU

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 MAINTAINERS                   |  1 +
 tests/acceptance/vfio-user.py | 94 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 95 insertions(+)
 create mode 100644 tests/acceptance/vfio-user.py

diff --git a/MAINTAINERS b/MAINTAINERS
index f9d8092..2c7332b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3392,6 +3392,7 @@ F: include/hw/remote/proxy-memory-listener.h
 F: hw/remote/iohub.c
 F: include/hw/remote/iohub.h
 F: hw/remote/vfio-user-obj.c
+F: tests/acceptance/vfio-user.py
 
 EBPF:
 M: Jason Wang <jasowang@redhat.com>
diff --git a/tests/acceptance/vfio-user.py b/tests/acceptance/vfio-user.py
new file mode 100644
index 0000000..ef318d9
--- /dev/null
+++ b/tests/acceptance/vfio-user.py
@@ -0,0 +1,94 @@
+# vfio-user protocol sanity test
+#
+# This work is licensed under the terms of the GNU GPL, version 2 or
+# later.  See the COPYING file in the top-level directory.
+
+
+import os
+import socket
+import uuid
+
+from avocado_qemu import Test
+from avocado_qemu import wait_for_console_pattern
+from avocado_qemu import exec_command
+from avocado_qemu import exec_command_and_wait_for_pattern
+
+class VfioUser(Test):
+    """
+    :avocado: tags=vfiouser
+    """
+    KERNEL_COMMON_COMMAND_LINE = 'printk.time=0 '
+
+    def do_test(self, kernel_url, initrd_url, kernel_command_line,
+                machine_type):
+        """Main test method"""
+        self.require_accelerator('kvm')
+
+        kernel_path = self.fetch_asset(kernel_url)
+        initrd_path = self.fetch_asset(initrd_url)
+
+        socket = os.path.join('/tmp', str(uuid.uuid4()))
+        if os.path.exists(socket):
+            os.remove(socket)
+
+        # Create remote process
+        remote_vm = self.get_vm()
+        remote_vm.add_args('-machine', 'x-remote')
+        remote_vm.add_args('-nodefaults')
+        remote_vm.add_args('-device', 'lsi53c895a,id=lsi1')
+        remote_vm.add_args('-object', 'vfio-user,id=vfioobj1,'
+                           'devid=lsi1,socket='+socket)
+        remote_vm.launch()
+
+        # Create proxy process
+        self.vm.set_console()
+        self.vm.add_args('-machine', machine_type)
+        self.vm.add_args('-accel', 'kvm')
+        self.vm.add_args('-cpu', 'host')
+        self.vm.add_args('-object',
+                         'memory-backend-memfd,id=sysmem-file,size=2G')
+        self.vm.add_args('--numa', 'node,memdev=sysmem-file')
+        self.vm.add_args('-m', '2048')
+        self.vm.add_args('-kernel', kernel_path,
+                         '-initrd', initrd_path,
+                         '-append', kernel_command_line)
+        self.vm.add_args('-device',
+                         'vfio-user-pci,'
+                         'socket='+socket)
+        self.vm.launch()
+        wait_for_console_pattern(self, 'as init process',
+                                 'Kernel panic - not syncing')
+        exec_command(self, 'mount -t sysfs sysfs /sys')
+        exec_command_and_wait_for_pattern(self,
+                                          'cat /sys/bus/pci/devices/*/uevent',
+                                          'PCI_ID=1000:0012')
+
+    def test_multiprocess_x86_64(self):
+        """
+        :avocado: tags=arch:x86_64
+        """
+        kernel_url = ('https://archives.fedoraproject.org/pub/archive/fedora'
+                      '/linux/releases/31/Everything/x86_64/os/images'
+                      '/pxeboot/vmlinuz')
+        initrd_url = ('https://archives.fedoraproject.org/pub/archive/fedora'
+                      '/linux/releases/31/Everything/x86_64/os/images'
+                      '/pxeboot/initrd.img')
+        kernel_command_line = (self.KERNEL_COMMON_COMMAND_LINE +
+                               'console=ttyS0 rdinit=/bin/bash')
+        machine_type = 'pc'
+        self.do_test(kernel_url, initrd_url, kernel_command_line, machine_type)
+
+    def test_multiprocess_aarch64(self):
+        """
+        :avocado: tags=arch:aarch64
+        """
+        kernel_url = ('https://archives.fedoraproject.org/pub/archive/fedora'
+                      '/linux/releases/31/Everything/aarch64/os/images'
+                      '/pxeboot/vmlinuz')
+        initrd_url = ('https://archives.fedoraproject.org/pub/archive/fedora'
+                      '/linux/releases/31/Everything/aarch64/os/images'
+                      '/pxeboot/initrd.img')
+        kernel_command_line = (self.KERNEL_COMMON_COMMAND_LINE +
+                               'rdinit=/bin/bash console=ttyAMA0')
+        machine_type = 'virt,gic-version=3'
+        self.do_test(kernel_url, initrd_url, kernel_command_line, machine_type)
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC server v2 01/11] vfio-user: build library
  2021-08-27 17:53   ` [PATCH RFC server v2 01/11] vfio-user: build library Jagannathan Raman
@ 2021-08-27 18:05     ` Jag Raman
  2021-09-08 12:25     ` Stefan Hajnoczi
  2021-09-10 15:20     ` Philippe Mathieu-Daudé
  2 siblings, 0 replies; 108+ messages in thread
From: Jag Raman @ 2021-08-27 18:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Elena Ufimtseva, John Johnson, thuth, swapnil.ingle, john.levon,
	philmd, Alex Williamson, Marc-André Lureau, Stefan Hajnoczi,
	thanos.makatos, alex.bennee



> On Aug 27, 2021, at 1:53 PM, Jag Raman <jag.raman@oracle.com> wrote:
> 
> add the libvfio-user library as a submodule. build it as a cmake
> subproject.
> 
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> ---
> configure                | 11 +++++++++++
> meson.build              | 28 ++++++++++++++++++++++++++++
> .gitmodules              |  3 +++
> MAINTAINERS              |  7 +++++++
> hw/remote/meson.build    |  2 ++
> subprojects/libvfio-user |  1 +
> 6 files changed, 52 insertions(+)
> create mode 160000 subprojects/libvfio-user
> 
> diff --git a/configure b/configure
> index 9a79a00..794e900 100755
> --- a/configure
> +++ b/configure
> @@ -4291,6 +4291,17 @@ but not implemented on your system"
> fi
> 
> ##########################################
> +# check for multiprocess
> +
> +case "$multiprocess" in
> +  auto | enabled )
> +    if test "$git_submodules_action" != "ignore"; then
> +      git_submodules="${git_submodules} libvfio-user"
> +    fi
> +    ;;
> +esac
> +
> +##########################################
> # End of CC checks
> # After here, no more $cc or $ld runs
> 
> diff --git a/meson.build b/meson.build
> index bf63784..2b2d5c2 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -1898,6 +1898,34 @@ if get_option('cfi') and slirp_opt == 'system'
>          + ' Please configure with --enable-slirp=git')
> endif
> 
> +vfiouser = not_found
> +if have_system and multiprocess_allowed
> +  have_internal = fs.exists(meson.current_source_dir() / 'subprojects/libvfio-user/Makefile')
> +
> +  if not have_internal
> +    error('libvfio-user source not found - please pull git submodule')
> +  endif
> +
> +  json_c = dependency('json-c', required: false)
> +    if not json_c.found()
> +      json_c = dependency('libjson-c')
> +  endif

One of the things we’re wondering is about this json-c package that we need to build
libvfio-user library.

The gitlab runners typically don’t have this package installed, as such the gitlab builds
fail. Wondering if there's a way to install this package for all QEMU builds?

We checked out the various jobs defined in “.gitlab-ci.d/buildtest.yml” - there is a
“before_script” keyword which we could use to install this package. The “before_script”
keyword appears to be run every time before a job’s script is executed. But this option
appears to be per job/build. Wondering if there's a distro-independent global way to
install a required package for all builds.

Thank you!
--
Jag

> +
> +  cmake = import('cmake')
> +
> +  vfiouser_subproj = cmake.subproject('libvfio-user')
> +
> +  vfiouser_sl = vfiouser_subproj.dependency('vfio-user-static')
> +
> +  # Although cmake links the json-c library with vfio-user-static
> +  # target, that info is not available to meson via cmake.subproject.
> +  # As such, we have to separately declare the json-c dependency here.
> +  # This appears to be a current limitation of using cmake inside meson.
> +  # libvfio-user is planning a switch to meson in the future, which
> +  # would address this item automatically.
> +  vfiouser = declare_dependency(dependencies: [vfiouser_sl, json_c])
> +endif
> +
> fdt = not_found
> fdt_opt = get_option('fdt')
> if have_system
> diff --git a/.gitmodules b/.gitmodules
> index 08b1b48..cfeea7c 100644
> --- a/.gitmodules
> +++ b/.gitmodules
> @@ -64,3 +64,6 @@
> [submodule "roms/vbootrom"]
> 	path = roms/vbootrom
> 	url = https://gitlab.com/qemu-project/vbootrom.git
> +[submodule "subprojects/libvfio-user"]
> +	path = subprojects/libvfio-user
> +	url = https://github.com/nutanix/libvfio-user.git
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 4039d3c..0c5a18e 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3361,6 +3361,13 @@ F: semihosting/
> F: include/semihosting/
> F: tests/tcg/multiarch/arm-compat-semi/
> 
> +libvfio-user Library
> +M: Thanos Makatos <thanos.makatos@nutanix.com>
> +M: John Levon <john.levon@nutanix.com>
> +T: https://github.com/nutanix/libvfio-user.git
> +S: Maintained
> +F: subprojects/libvfio-user/*
> +
> Multi-process QEMU
> M: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> M: Jagannathan Raman <jag.raman@oracle.com>
> diff --git a/hw/remote/meson.build b/hw/remote/meson.build
> index e6a5574..fb35fb8 100644
> --- a/hw/remote/meson.build
> +++ b/hw/remote/meson.build
> @@ -7,6 +7,8 @@ remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('remote-obj.c'))
> remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('proxy.c'))
> remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('iohub.c'))
> 
> +remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: vfiouser)
> +
> specific_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('memory.c'))
> specific_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('proxy-memory-listener.c'))
> 
> diff --git a/subprojects/libvfio-user b/subprojects/libvfio-user
> new file mode 160000
> index 0000000..647c934
> --- /dev/null
> +++ b/subprojects/libvfio-user
> @@ -0,0 +1 @@
> +Subproject commit 647c9341d2e06266a710ddd075f69c95dd3b8446
> -- 
> 1.8.3.1
> 


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 04/16] vfio-user: connect vfio proxy to remote server
  2021-08-24 14:15   ` Stefan Hajnoczi
@ 2021-08-30  3:00     ` John Johnson
  2021-09-07 13:21       ` Stefan Hajnoczi
  0 siblings, 1 reply; 108+ messages in thread
From: John Johnson @ 2021-08-30  3:00 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, Jag Raman, Swapnil Ingle, John Levon,
	QEMU Devel Mailing List, Alex Williamson, thanos.makatos



> On Aug 24, 2021, at 7:15 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Mon, Aug 16, 2021 at 09:42:37AM -0700, Elena Ufimtseva wrote:
>> @@ -3361,13 +3362,35 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
>>     VFIOUserPCIDevice *udev = VFIO_USER_PCI(pdev);
>>     VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
>>     VFIODevice *vbasedev = &vdev->vbasedev;
>> +    SocketAddress addr;
>> +    VFIOProxy *proxy;
>> +    Error *err = NULL;
>> 
>> +    /*
>> +     * TODO: make option parser understand SocketAddress
>> +     * and use that instead of having scaler options
> 
> s/scaler/scalar/
> 

	OK


>> +VFIOProxy *vfio_user_connect_dev(SocketAddress *addr, Error **errp)
>> +{
>> +    VFIOProxy *proxy;
>> +    QIOChannelSocket *sioc;
>> +    QIOChannel *ioc;
>> +    char *sockname;
>> +
>> +    if (addr->type != SOCKET_ADDRESS_TYPE_UNIX) {
>> +        error_setg(errp, "vfio_user_connect - bad address family");
>> +        return NULL;
>> +    }
>> +    sockname = addr->u.q_unix.path;
>> +
>> +    sioc = qio_channel_socket_new();
>> +    ioc = QIO_CHANNEL(sioc);
>> +    if (qio_channel_socket_connect_sync(sioc, addr, errp)) {
>> +        object_unref(OBJECT(ioc));
>> +        return NULL;
>> +    }
>> +    qio_channel_set_blocking(ioc, true, NULL);
>> +
>> +    proxy = g_malloc0(sizeof(VFIOProxy));
>> +    proxy->sockname = sockname;
> 
> sockname is addr->u.q_unix.path, so there's an assumption that the
> lifetime of the addr argument is at least as long as the proxy object's
> lifetime. This doesn't seem to be the case in vfio_user_pci_realize()
> since the SocketAddress variable is declared on the stack.
> 
> I suggest making SocketAddress *addr const so it's obvious that this
> function just reads it (doesn't take ownership of the pointer) and
> copying the UNIX domain socket path with g_strdup() to avoid the
> dangling pointer.
> 

	OK


>> +    proxy->ioc = ioc;
>> +    proxy->flags = VFIO_PROXY_CLIENT;
>> +    proxy->state = VFIO_PROXY_CONNECTED;
>> +    qemu_cond_init(&proxy->close_cv);
>> +
>> +    if (vfio_user_iothread == NULL) {
>> +        vfio_user_iothread = iothread_create("VFIO user", errp);
>> +    }
> 
> Why is a dedicated IOThread needed for VFIO user?
> 

	It seemed the best model for inbound message processing.  Most messages
are replies, so the receiver will either signal a thread waiting the reply or
report any errors from the server if there is no waiter.  None of this requires
the BQL.

	If the message is a request - which currently only happens if device
DMA targets guest memory that wasn’t mmap()d by QEMU or if the ’secure-dma’
option is used - then the receiver can then acquire BQL so it can call the
VFIO code to handle the request.


>> +
>> +    qemu_mutex_init(&proxy->lock);
>> +    QTAILQ_INIT(&proxy->free);
>> +    QTAILQ_INIT(&proxy->pending);
>> +    QLIST_INSERT_HEAD(&vfio_user_sockets, proxy, next);
>> +
>> +    return proxy;
>> +}
>> +
> 
> /* Called with the BQL */

	OK

>> +void vfio_user_disconnect(VFIOProxy *proxy)
>> +{
>> +    VFIOUserReply *r1, *r2;
>> +
>> +    qemu_mutex_lock(&proxy->lock);
>> +
>> +    /* our side is quitting */
>> +    if (proxy->state == VFIO_PROXY_CONNECTED) {
>> +        vfio_user_shutdown(proxy);
>> +        if (!QTAILQ_EMPTY(&proxy->pending)) {
>> +            error_printf("vfio_user_disconnect: outstanding requests\n");
>> +        }
>> +    }
>> +    object_unref(OBJECT(proxy->ioc));
>> +    proxy->ioc = NULL;
>> +
>> +    proxy->state = VFIO_PROXY_CLOSING;
>> +    QTAILQ_FOREACH_SAFE(r1, &proxy->pending, next, r2) {
>> +        qemu_cond_destroy(&r1->cv);
>> +        QTAILQ_REMOVE(&proxy->pending, r1, next);
>> +        g_free(r1);
>> +    }
>> +    QTAILQ_FOREACH_SAFE(r1, &proxy->free, next, r2) {
>> +        qemu_cond_destroy(&r1->cv);
>> +        QTAILQ_REMOVE(&proxy->free, r1, next);
>> +        g_free(r1);
>> +    }
>> +
>> +    /*
>> +     * Make sure the iothread isn't blocking anywhere
>> +     * with a ref to this proxy by waiting for a BH
>> +     * handler to run after the proxy fd handlers were
>> +     * deleted above.
>> +     */
>> +    proxy->close_wait = 1;
> 
> Please use true. '1' is shorter but it's less obvious to the reader (I
> had to jump to the definition to check whether this field was bool or
> int).
> 

	I assume this is also true for the other boolean struct members
I’ve added.


>> +    aio_bh_schedule_oneshot(iothread_get_aio_context(vfio_user_iothread),
>> +                            vfio_user_cb, proxy);
>> +
>> +    /* drop locks so the iothread can make progress */
>> +    qemu_mutex_unlock_iothread();
> 
> Why does the BQL needs to be dropped so vfio_user_iothread can make
> progress?
> 

	See above.  Acquiring BQL by the iothread is rare, but I have to
handle the case where a disconnect is concurrent with an incoming request
message that is waiting for the BQL.  See the proxy state check after BQL
is acquired in vfio_user_recv()


>> +    qemu_cond_wait(&proxy->close_cv, &proxy->lock);


								JJ




^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 05/16] vfio-user: define VFIO Proxy and communication functions
  2021-08-24 15:14   ` Stefan Hajnoczi
@ 2021-08-30  3:04     ` John Johnson
  2021-09-07 13:35       ` Stefan Hajnoczi
  0 siblings, 1 reply; 108+ messages in thread
From: John Johnson @ 2021-08-30  3:04 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, Jag Raman, swapnil.ingle, john.levon,
	QEMU Devel Mailing List, Alex Williamson, thanos.makatos



> On Aug 24, 2021, at 8:14 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Mon, Aug 16, 2021 at 09:42:38AM -0700, Elena Ufimtseva wrote:
>> @@ -62,5 +65,10 @@ typedef struct VFIOProxy {
>> 
>> VFIOProxy *vfio_user_connect_dev(SocketAddress *addr, Error **errp);
>> void vfio_user_disconnect(VFIOProxy *proxy);
>> +void vfio_user_set_reqhandler(VFIODevice *vbasdev,
> 
> "vbasedev" for consistency?
> 

	OK

>> +                              int (*handler)(void *opaque, char *buf,
>> +                                             VFIOUserFDs *fds),
>> +                                             void *reqarg);
> 
> The handler callback is undocumented. What context does it run in, what
> do the arguments mean, and what should the function return? Please
> document it so it's easy for others to modify this code in the future
> without reverse-engineering the assumptions behind it.
> 

	OK


>> +void vfio_user_recv(void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOProxy *proxy = vbasedev->proxy;
>> +    VFIOUserReply *reply = NULL;
>> +    g_autofree int *fdp = NULL;
>> +    VFIOUserFDs reqfds = { 0, 0, fdp };
>> +    VFIOUserHdr msg;
>> +    struct iovec iov = {
>> +        .iov_base = &msg,
>> +        .iov_len = sizeof(msg),
>> +    };
>> +    bool isreply;
>> +    int i, ret;
>> +    size_t msgleft, numfds = 0;
>> +    char *data = NULL;
>> +    g_autofree char *buf = NULL;
>> +    Error *local_err = NULL;
>> +
>> +    qemu_mutex_lock(&proxy->lock);
>> +    if (proxy->state == VFIO_PROXY_CLOSING) {
>> +        qemu_mutex_unlock(&proxy->lock);
>> +        return;
>> +    }
>> +
>> +    ret = qio_channel_readv_full(proxy->ioc, &iov, 1, &fdp, &numfds,
>> +                                 &local_err);
> 
> This is a blocking call. My understanding is that the IOThread is shared
> by all vfio-user devices, so other devices will have to wait if one of
> them is acting up (e.g. the device emulation process sent less than
> sizeof(msg) bytes).
> 
> While we're blocked in this function the proxy device cannot be
> hot-removed since proxy->lock is held.
> 
> It would more robust to use of the event loop to avoid blocking. There
> could be a per-connection receiver coroutine that calls
> qio_channel_readv_full_all_eof() (it yields the coroutine if reading
> would block).
> 

	I thought the main loop uses BQL, which I don’t need for most
message processing.  The blocking behavior can be fixed with FIONREAD
beforehand to detect a message with fewer bytes than in a header.



>> +    /*
>> +     * Replies signal a waiter, requests get processed by vfio code
>> +     * that may assume the iothread lock is held.
>> +     */
>> +    if (isreply) {
>> +        reply->complete = 1;
>> +        if (!reply->nowait) {
>> +            qemu_cond_signal(&reply->cv);
>> +        } else {
>> +            if (msg.flags & VFIO_USER_ERROR) {
>> +                error_printf("vfio_user_rcv error reply on async request ");
>> +                error_printf("command %x error %s\n", msg.command,
>> +                             strerror(msg.error_reply));
>> +            }
>> +            /* just free it if no one is waiting */
>> +            reply->nowait = 0;
>> +            if (proxy->last_nowait == reply) {
>> +                proxy->last_nowait = NULL;
>> +            }
>> +            g_free(reply->msg);
>> +            QTAILQ_INSERT_HEAD(&proxy->free, reply, next);
>> +        }
>> +        qemu_mutex_unlock(&proxy->lock);
>> +    } else {
>> +        qemu_mutex_unlock(&proxy->lock);
>> +        qemu_mutex_lock_iothread();
> 
> The fact that proxy->request() runs with the BQL suggests that VFIO
> communication should take place in the main event loop thread instead of
> a separate IOThread.
> 

	See the last reply.  Using the main event loop optimizes the
least common case.


>> +        /*
>> +         * make sure proxy wasn't closed while we waited
>> +         * checking state without holding the proxy lock is safe
>> +         * since it's only set to CLOSING when BQL is held
>> +         */
>> +        if (proxy->state != VFIO_PROXY_CLOSING) {
>> +            ret = proxy->request(proxy->reqarg, buf, &reqfds);
> 
> The request() callback in an earlier patch is a noop for the client
> implementation. Who frees passed fds?
> 

	Right now no server->client requests send fd’s, but I do need
a single point where they are consumed if an error is returned. 


>> +            if (ret < 0 && !(msg.flags & VFIO_USER_NO_REPLY)) {
>> +                vfio_user_send_reply(proxy, buf, ret);
>> +            }
>> +        }
>> +        qemu_mutex_unlock_iothread();
>> +    }
>> +    return;
>> +
>> +fatal:
>> +    vfio_user_shutdown(proxy);
>> +    proxy->state = VFIO_PROXY_RECV_ERROR;
>> +
>> +err:
>> +    for (i = 0; i < numfds; i++) {
>> +        close(fdp[i]);
>> +    }
>> +    if (reply != NULL) {
>> +        /* force an error to keep sending thread from hanging */
>> +        reply->msg->flags |= VFIO_USER_ERROR;
>> +        reply->msg->error_reply = EINVAL;
>> +        reply->complete = 1;
>> +        qemu_cond_signal(&reply->cv);
> 
> What about fd passing? The actual fds have been closed already in fdp[]
> but reply has a copy too.
> 

	If the sender gets an error, it won’t be using the fd’s.  I
can zero reply->fds to make this clearer.


> What about the nowait case? If no one is waiting on reply->cv so this
> reply will be leaked?
> 

	This looks like a leak.

			JJ


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 06/16] vfio-user: negotiate version with remote server
  2021-08-24 15:59   ` Stefan Hajnoczi
@ 2021-08-30  3:08     ` John Johnson
  2021-09-07 13:52       ` Stefan Hajnoczi
  0 siblings, 1 reply; 108+ messages in thread
From: John Johnson @ 2021-08-30  3:08 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, Jag Raman, Swapnil Ingle, John Levon,
	QEMU Devel Mailing List, Alex Williamson, thanos.makatos



> On Aug 24, 2021, at 8:59 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Mon, Aug 16, 2021 at 09:42:39AM -0700, Elena Ufimtseva wrote:
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index 7005d9f891..eae33e746f 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -3397,6 +3397,12 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
>>         proxy->flags |= VFIO_PROXY_SECURE;
>>     }
>> 
>> +    vfio_user_validate_version(vbasedev, &err);
>> +    if (err != NULL) {
>> +        error_propagate(errp, err);
>> +        goto error;
>> +    }
>> +
>>     vbasedev->name = g_strdup_printf("VFIO user <%s>", udev->sock_name);
>>     vbasedev->dev = DEVICE(vdev);
>>     vbasedev->fd = -1;
>> @@ -3404,6 +3410,9 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
>>     vbasedev->no_mmap = false;
>>     vbasedev->ops = &vfio_user_pci_ops;
>> 
>> +error:
> 
> Missing return before error label? We shouldn't disconnect in the
> success case.
> 

	The return ended up in a later patch.



>> +    vfio_user_disconnect(proxy);
>> +    error_prepend(errp, VFIO_MSG_PREFIX, vdev->vbasedev.name);
>> }
>> 
>> static void vfio_user_instance_finalize(Object *obj)
>> diff --git a/hw/vfio/user.c b/hw/vfio/user.c
>> index 2fcc77d997..e89464a571 100644
>> --- a/hw/vfio/user.c
>> +++ b/hw/vfio/user.c
>> @@ -23,9 +23,16 @@
>> #include "io/channel-socket.h"
>> #include "io/channel-util.h"
>> #include "sysemu/iothread.h"
>> +#include "qapi/qmp/qdict.h"
>> +#include "qapi/qmp/qjson.h"
>> +#include "qapi/qmp/qnull.h"
>> +#include "qapi/qmp/qstring.h"
>> +#include "qapi/qmp/qnum.h"
>> #include "user.h"
>> 
>> static uint64_t max_xfer_size = VFIO_USER_DEF_MAX_XFER;
>> +static uint64_t max_send_fds = VFIO_USER_DEF_MAX_FDS;
>> +static int wait_time = 1000;   /* wait 1 sec for replies */
>> static IOThread *vfio_user_iothread;
>> 
>> static void vfio_user_shutdown(VFIOProxy *proxy);
>> @@ -34,7 +41,14 @@ static void vfio_user_send_locked(VFIOProxy *proxy, VFIOUserHdr *msg,
>>                                   VFIOUserFDs *fds);
>> static void vfio_user_send(VFIOProxy *proxy, VFIOUserHdr *msg,
>>                            VFIOUserFDs *fds);
>> +static void vfio_user_request_msg(VFIOUserHdr *hdr, uint16_t cmd,
>> +                                  uint32_t size, uint32_t flags);
>> +static void vfio_user_send_recv(VFIOProxy *proxy, VFIOUserHdr *msg,
>> +                                VFIOUserFDs *fds, int rsize, int flags);
>> 
>> +/* vfio_user_send_recv flags */
>> +#define NOWAIT          0x1  /* do not wait for reply */
>> +#define NOIOLOCK        0x2  /* do not drop iolock */
> 
> Please use "BQL", it's a widely used term while "iolock" isn't used:
> s/IOLOCK/BQL/
> 

	OK

>> 
>> /*
>>  * Functions called by main, CPU, or iothread threads
>> @@ -333,6 +347,79 @@ static void vfio_user_cb(void *opaque)
>>  * Functions called by main or CPU threads
>>  */
>> 
>> +static void vfio_user_send_recv(VFIOProxy *proxy, VFIOUserHdr *msg,
>> +                                VFIOUserFDs *fds, int rsize, int flags)
>> +{
>> +    VFIOUserReply *reply;
>> +    bool iolock = 0;
>> +
>> +    if (msg->flags & VFIO_USER_NO_REPLY) {
>> +        error_printf("vfio_user_send_recv on async message\n");
>> +        return;
>> +    }
>> +
>> +    /*
>> +     * We may block later, so use a per-proxy lock and let
>> +     * the iothreads run while we sleep unless told no to
> 
> s/no/not/

	OK


> 
>> +int vfio_user_validate_version(VFIODevice *vbasedev, Error **errp)
>> +{
>> +    g_autofree VFIOUserVersion *msgp;
>> +    GString *caps;
>> +    int size, caplen;
>> +
>> +    caps = caps_json();
>> +    caplen = caps->len + 1;
>> +    size = sizeof(*msgp) + caplen;
>> +    msgp = g_malloc0(size);
>> +
>> +    vfio_user_request_msg(&msgp->hdr, VFIO_USER_VERSION, size, 0);
>> +    msgp->major = VFIO_USER_MAJOR_VER;
>> +    msgp->minor = VFIO_USER_MINOR_VER;
>> +    memcpy(&msgp->capabilities, caps->str, caplen);
>> +    g_string_free(caps, true);
>> +
>> +    vfio_user_send_recv(vbasedev->proxy, &msgp->hdr, NULL, 0, 0);
>> +    if (msgp->hdr.flags & VFIO_USER_ERROR) {
>> +        error_setg_errno(errp, msgp->hdr.error_reply, "version reply");
>> +        return -1;
>> +    }
>> +
>> +    if (msgp->major != VFIO_USER_MAJOR_VER ||
>> +        msgp->minor > VFIO_USER_MINOR_VER) {
>> +        error_setg(errp, "incompatible server version");
>> +        return -1;
>> +    }
>> +    if (caps_check(msgp->minor, (char *)msgp + sizeof(*msgp), errp) != 0) {
> 
> The reply is untrusted so we cannot treat it as a NUL-terminated string
> yet. The final byte msgp->capabilities[] needs to be checked first.
> 
> Please be careful about input validation, I might miss something so it's
> best if you audit the patches too. QEMU must not trust the device
> emulation process and vice versa.
> 

	This message is consumed by vfio-user, so I can check for valid
replies, but for most messages this checking will have to be done at in
the VFIO common or bus-specific code, as vfio-user doens’t know valid
data from invalid.

								JJ


>> +        return -1;
>> +    }
>> +
>> +    return 0;
>> +}
>> -- 
>> 2.25.1


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 07/16] vfio-user: get device info
  2021-08-24 16:04   ` Stefan Hajnoczi
@ 2021-08-30  3:11     ` John Johnson
  2021-09-07 13:54       ` Stefan Hajnoczi
  0 siblings, 1 reply; 108+ messages in thread
From: John Johnson @ 2021-08-30  3:11 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, Jag Raman, swapnil.ingle, john.levon,
	QEMU Devel Mailing List, Alex Williamson, thanos.makatos



> On Aug 24, 2021, at 9:04 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Mon, Aug 16, 2021 at 09:42:40AM -0700, Elena Ufimtseva wrote:
>> +int vfio_user_get_info(VFIODevice *vbasedev)
>> +{
>> +    VFIOUserDeviceInfo msg;
>> +
>> +    memset(&msg, 0, sizeof(msg));
>> +    vfio_user_request_msg(&msg.hdr, VFIO_USER_DEVICE_GET_INFO, sizeof(msg), 0);
>> +    msg.argsz = sizeof(struct vfio_device_info);
>> +
>> +    vfio_user_send_recv(vbasedev->proxy, &msg.hdr, NULL, 0, 0);
>> +    if (msg.hdr.flags & VFIO_USER_ERROR) {
>> +        return -msg.hdr.error_reply;
>> +    }
>> +
>> +    vbasedev->num_irqs = msg.num_irqs;
>> +    vbasedev->num_regions = msg.num_regions;
>> +    vbasedev->flags = msg.flags;
>> +    vbasedev->reset_works = !!(msg.flags & VFIO_DEVICE_FLAGS_RESET);
> 
> No input validation. I haven't checked what happens when num_irqs,
> num_regions, or flags are bogus but it's a little concerning. Unlike
> kernel VFIO, we do not trust these values.
> 

	As in the last reply, vfio-user doesn’t know valid values
from invalid, so I need to re-work this so the PCI-specific code that
calls vfio-user_get_info() can test for invalid values.

							JJ



^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 04/16] vfio-user: connect vfio proxy to remote server
  2021-08-30  3:00     ` John Johnson
@ 2021-09-07 13:21       ` Stefan Hajnoczi
  2021-09-09  5:11         ` John Johnson
  0 siblings, 1 reply; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-09-07 13:21 UTC (permalink / raw)
  To: John Johnson
  Cc: Elena Ufimtseva, Jag Raman, Swapnil Ingle, John Levon,
	QEMU Devel Mailing List, Alex Williamson, thanos.makatos

[-- Attachment #1: Type: text/plain, Size: 3934 bytes --]

On Mon, Aug 30, 2021 at 03:00:37AM +0000, John Johnson wrote:
> > On Aug 24, 2021, at 7:15 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > On Mon, Aug 16, 2021 at 09:42:37AM -0700, Elena Ufimtseva wrote:
> >> +    proxy->ioc = ioc;
> >> +    proxy->flags = VFIO_PROXY_CLIENT;
> >> +    proxy->state = VFIO_PROXY_CONNECTED;
> >> +    qemu_cond_init(&proxy->close_cv);
> >> +
> >> +    if (vfio_user_iothread == NULL) {
> >> +        vfio_user_iothread = iothread_create("VFIO user", errp);
> >> +    }
> > 
> > Why is a dedicated IOThread needed for VFIO user?
> > 
> 
> 	It seemed the best model for inbound message processing.  Most messages
> are replies, so the receiver will either signal a thread waiting the reply or
> report any errors from the server if there is no waiter.  None of this requires
> the BQL.
> 
> 	If the message is a request - which currently only happens if device
> DMA targets guest memory that wasn’t mmap()d by QEMU or if the ’secure-dma’
> option is used - then the receiver can then acquire BQL so it can call the
> VFIO code to handle the request.

QEMU is generally event-driven and the APIs are designed for that style.
Threads in QEMU are there for parallelism or resource control,
everything else is event-driven.

It's not clear to me if there is a reason why the message processing
must be done in a separate thread or whether it is just done this way
because the code was written in multi-threaded style instead of
event-driven style.

You mentioned other threads waiting for replies. Which threads are they?

> > Please use true. '1' is shorter but it's less obvious to the reader (I
> > had to jump to the definition to check whether this field was bool or
> > int).
> > 
> 
> 	I assume this is also true for the other boolean struct members
> I’ve added.

Yes, please. QEMU uses bool and true/false.

> 
> 
> >> +    aio_bh_schedule_oneshot(iothread_get_aio_context(vfio_user_iothread),
> >> +                            vfio_user_cb, proxy);
> >> +
> >> +    /* drop locks so the iothread can make progress */
> >> +    qemu_mutex_unlock_iothread();
> > 
> > Why does the BQL needs to be dropped so vfio_user_iothread can make
> > progress?
> > 
> 
> 	See above.  Acquiring BQL by the iothread is rare, but I have to
> handle the case where a disconnect is concurrent with an incoming request
> message that is waiting for the BQL.  See the proxy state check after BQL
> is acquired in vfio_user_recv()

Here is how this code looks when written using coroutines (this is from
nbd/server.c):

  static coroutine_fn void nbd_trip(void *opaque)
  {
      ...
      req = nbd_request_get(client);
      ret = nbd_co_receive_request(req, &request, &local_err);
      client->recv_coroutine = NULL;
  
      if (client->closing) {
          /*
           * The client may be closed when we are blocked in
           * nbd_co_receive_request()
           */
          goto done;
      }

It's the same check. The code is inverted: the server reads the next
request using nbd_co_receive_request() (this coroutine function can
yield while waiting for data on the socket).

This way the network communication code doesn't need to know how
messages will by processed by the client or server. There is no need for
if (isreply) { qemu_cond_signal(&reply->cv); } else {
proxy->request(proxy->reqarg, buf, &reqfds); }. The callbacks and
threads aren't hardcoded into the network communication code.

This goes back to the question earlier about why a dedicated thread is
necessary here. I suggest writing the network communication code using
coroutines. That way the code is easier to read (no callbacks or
thread synchronization), there are fewer thread-safety issues to worry
about, and users or management tools don't need to know about additional
threads (e.g. CPU/NUMA affinity).

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 05/16] vfio-user: define VFIO Proxy and communication functions
  2021-08-30  3:04     ` John Johnson
@ 2021-09-07 13:35       ` Stefan Hajnoczi
  0 siblings, 0 replies; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-09-07 13:35 UTC (permalink / raw)
  To: John Johnson
  Cc: Elena Ufimtseva, Jag Raman, swapnil.ingle, john.levon,
	QEMU Devel Mailing List, Alex Williamson, thanos.makatos

[-- Attachment #1: Type: text/plain, Size: 3744 bytes --]

On Mon, Aug 30, 2021 at 03:04:08AM +0000, John Johnson wrote:
> 
> 
> > On Aug 24, 2021, at 8:14 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > 
> > On Mon, Aug 16, 2021 at 09:42:38AM -0700, Elena Ufimtseva wrote:
> >> @@ -62,5 +65,10 @@ typedef struct VFIOProxy {
> >> 
> >> VFIOProxy *vfio_user_connect_dev(SocketAddress *addr, Error **errp);
> >> void vfio_user_disconnect(VFIOProxy *proxy);
> >> +void vfio_user_set_reqhandler(VFIODevice *vbasdev,
> > 
> > "vbasedev" for consistency?
> > 
> 
> 	OK
> 
> >> +                              int (*handler)(void *opaque, char *buf,
> >> +                                             VFIOUserFDs *fds),
> >> +                                             void *reqarg);
> > 
> > The handler callback is undocumented. What context does it run in, what
> > do the arguments mean, and what should the function return? Please
> > document it so it's easy for others to modify this code in the future
> > without reverse-engineering the assumptions behind it.
> > 
> 
> 	OK
> 
> 
> >> +void vfio_user_recv(void *opaque)
> >> +{
> >> +    VFIODevice *vbasedev = opaque;
> >> +    VFIOProxy *proxy = vbasedev->proxy;
> >> +    VFIOUserReply *reply = NULL;
> >> +    g_autofree int *fdp = NULL;
> >> +    VFIOUserFDs reqfds = { 0, 0, fdp };
> >> +    VFIOUserHdr msg;
> >> +    struct iovec iov = {
> >> +        .iov_base = &msg,
> >> +        .iov_len = sizeof(msg),
> >> +    };
> >> +    bool isreply;
> >> +    int i, ret;
> >> +    size_t msgleft, numfds = 0;
> >> +    char *data = NULL;
> >> +    g_autofree char *buf = NULL;
> >> +    Error *local_err = NULL;
> >> +
> >> +    qemu_mutex_lock(&proxy->lock);
> >> +    if (proxy->state == VFIO_PROXY_CLOSING) {
> >> +        qemu_mutex_unlock(&proxy->lock);
> >> +        return;
> >> +    }
> >> +
> >> +    ret = qio_channel_readv_full(proxy->ioc, &iov, 1, &fdp, &numfds,
> >> +                                 &local_err);
> > 
> > This is a blocking call. My understanding is that the IOThread is shared
> > by all vfio-user devices, so other devices will have to wait if one of
> > them is acting up (e.g. the device emulation process sent less than
> > sizeof(msg) bytes).
> > 
> > While we're blocked in this function the proxy device cannot be
> > hot-removed since proxy->lock is held.
> > 
> > It would more robust to use of the event loop to avoid blocking. There
> > could be a per-connection receiver coroutine that calls
> > qio_channel_readv_full_all_eof() (it yields the coroutine if reading
> > would block).
> > 
> 
> 	I thought the main loop uses BQL, which I don’t need for most
> message processing.  The blocking behavior can be fixed with FIONREAD
> beforehand to detect a message with fewer bytes than in a header.

It's I/O-bound work, exactly what the main loop was intended for.

I'm not sure the BQL can be avoided anyway:
- The vfio-user client runs under the BQL (a vCPU thread).
- The vfio-user server needs to hold the BQL since most QEMU device
  models assume they are running under the BQL.

The network communication code doesn't need to know about the BQL
though. Event-driven code (code that runs in an AioContext) can rely on
the fact that its callbacks only execute in the AioContext, i.e. in one
thread at any given time.

The code probably doesn't need explicit BQL lock/unlock and can run
safely in another IOThread if the user wishes (I would leave that up to
the user, e.g. -device vfio-user-pci,iothread=iothread0, instead of
creating a dedicated IOThread that is shared for all vfio-user
communication). See nbd/server.c for an example of doing event-driven
network I/O with coroutines.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 06/16] vfio-user: negotiate version with remote server
  2021-08-30  3:08     ` John Johnson
@ 2021-09-07 13:52       ` Stefan Hajnoczi
  0 siblings, 0 replies; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-09-07 13:52 UTC (permalink / raw)
  To: John Johnson
  Cc: Elena Ufimtseva, Jag Raman, Swapnil Ingle, John Levon,
	QEMU Devel Mailing List, Alex Williamson, thanos.makatos

[-- Attachment #1: Type: text/plain, Size: 2194 bytes --]

On Mon, Aug 30, 2021 at 03:08:50AM +0000, John Johnson wrote:
> > On Aug 24, 2021, at 8:59 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > On Mon, Aug 16, 2021 at 09:42:39AM -0700, Elena Ufimtseva wrote:
> >> +int vfio_user_validate_version(VFIODevice *vbasedev, Error **errp)
> >> +{
> >> +    g_autofree VFIOUserVersion *msgp;
> >> +    GString *caps;
> >> +    int size, caplen;
> >> +
> >> +    caps = caps_json();
> >> +    caplen = caps->len + 1;
> >> +    size = sizeof(*msgp) + caplen;
> >> +    msgp = g_malloc0(size);
> >> +
> >> +    vfio_user_request_msg(&msgp->hdr, VFIO_USER_VERSION, size, 0);
> >> +    msgp->major = VFIO_USER_MAJOR_VER;
> >> +    msgp->minor = VFIO_USER_MINOR_VER;
> >> +    memcpy(&msgp->capabilities, caps->str, caplen);
> >> +    g_string_free(caps, true);
> >> +
> >> +    vfio_user_send_recv(vbasedev->proxy, &msgp->hdr, NULL, 0, 0);
> >> +    if (msgp->hdr.flags & VFIO_USER_ERROR) {
> >> +        error_setg_errno(errp, msgp->hdr.error_reply, "version reply");
> >> +        return -1;
> >> +    }
> >> +
> >> +    if (msgp->major != VFIO_USER_MAJOR_VER ||
> >> +        msgp->minor > VFIO_USER_MINOR_VER) {
> >> +        error_setg(errp, "incompatible server version");
> >> +        return -1;
> >> +    }
> >> +    if (caps_check(msgp->minor, (char *)msgp + sizeof(*msgp), errp) != 0) {
> > 
> > The reply is untrusted so we cannot treat it as a NUL-terminated string
> > yet. The final byte msgp->capabilities[] needs to be checked first.
> > 
> > Please be careful about input validation, I might miss something so it's
> > best if you audit the patches too. QEMU must not trust the device
> > emulation process and vice versa.
> > 
> 
> 	This message is consumed by vfio-user, so I can check for valid
> replies, but for most messages this checking will have to be done at in
> the VFIO common or bus-specific code, as vfio-user doens’t know valid
> data from invalid.

Good point. Today's VFIO code trusts the replies because they come from
the kernel. Now that they can come from an untrusted source they must be
validated and common VFIO code may need to change its trust model.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 07/16] vfio-user: get device info
  2021-08-30  3:11     ` John Johnson
@ 2021-09-07 13:54       ` Stefan Hajnoczi
  0 siblings, 0 replies; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-09-07 13:54 UTC (permalink / raw)
  To: John Johnson
  Cc: Elena Ufimtseva, Jag Raman, swapnil.ingle, john.levon,
	QEMU Devel Mailing List, Alex Williamson, thanos.makatos

[-- Attachment #1: Type: text/plain, Size: 1558 bytes --]

On Mon, Aug 30, 2021 at 03:11:39AM +0000, John Johnson wrote:
> 
> 
> > On Aug 24, 2021, at 9:04 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > 
> > On Mon, Aug 16, 2021 at 09:42:40AM -0700, Elena Ufimtseva wrote:
> >> +int vfio_user_get_info(VFIODevice *vbasedev)
> >> +{
> >> +    VFIOUserDeviceInfo msg;
> >> +
> >> +    memset(&msg, 0, sizeof(msg));
> >> +    vfio_user_request_msg(&msg.hdr, VFIO_USER_DEVICE_GET_INFO, sizeof(msg), 0);
> >> +    msg.argsz = sizeof(struct vfio_device_info);
> >> +
> >> +    vfio_user_send_recv(vbasedev->proxy, &msg.hdr, NULL, 0, 0);
> >> +    if (msg.hdr.flags & VFIO_USER_ERROR) {
> >> +        return -msg.hdr.error_reply;
> >> +    }
> >> +
> >> +    vbasedev->num_irqs = msg.num_irqs;
> >> +    vbasedev->num_regions = msg.num_regions;
> >> +    vbasedev->flags = msg.flags;
> >> +    vbasedev->reset_works = !!(msg.flags & VFIO_DEVICE_FLAGS_RESET);
> > 
> > No input validation. I haven't checked what happens when num_irqs,
> > num_regions, or flags are bogus but it's a little concerning. Unlike
> > kernel VFIO, we do not trust these values.
> > 
> 
> 	As in the last reply, vfio-user doesn’t know valid values
> from invalid, so I need to re-work this so the PCI-specific code that
> calls vfio-user_get_info() can test for invalid values.

Sounds good. I won't look further for missing input validation in the
VFIO message contents in this revision of the patch series. Once you're
happy with input validation I'll look at the code from this angle again.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 08/16] vfio-user: get region info
  2021-08-16 16:42 ` [PATCH RFC v2 08/16] vfio-user: get region info Elena Ufimtseva
@ 2021-09-07 14:31   ` Stefan Hajnoczi
  2021-09-09  5:35     ` John Johnson
  0 siblings, 1 reply; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-09-07 14:31 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: john.g.johnson, jag.raman, swapnil.ingle, john.levon, qemu-devel,
	alex.williamson, thanos.makatos

[-- Attachment #1: Type: text/plain, Size: 3879 bytes --]

On Mon, Aug 16, 2021 at 09:42:41AM -0700, Elena Ufimtseva wrote:
> @@ -1514,6 +1515,16 @@ bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
>      return true;
>  }
>  
> +static int vfio_get_region_info_remfd(VFIODevice *vbasedev, int index)
> +{
> +    struct vfio_region_info *info;
> +
> +    if (vbasedev->regions == NULL || vbasedev->regions[index] == NULL) {
> +        vfio_get_region_info(vbasedev, index, &info);
> +    }

Maybe this will be called from other places in the future, but the
vfio_region_setup() caller below already invoked vfio_get_region_info()
so I'm not sure it's necessary to do this again?

Perhaps add an int *remfd argument to vfio_get_region_info(). That way
vfio_get_region_info_remfd() isn't necessary.

> @@ -2410,6 +2442,24 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
>                           struct vfio_region_info **info)
>  {
>      size_t argsz = sizeof(struct vfio_region_info);
> +    int fd = -1;
> +    int ret;
> +
> +    /* create region cache */
> +    if (vbasedev->regions == NULL) {
> +        vbasedev->regions = g_new0(struct vfio_region_info *,
> +                                   vbasedev->num_regions);
> +        if (vbasedev->proxy != NULL) {
> +            vbasedev->regfds = g_new0(int, vbasedev->num_regions);
> +        }
> +    }
> +    /* check cache */
> +    if (vbasedev->regions[index] != NULL) {
> +        *info = g_malloc0(vbasedev->regions[index]->argsz);
> +        memcpy(*info, vbasedev->regions[index],
> +               vbasedev->regions[index]->argsz);
> +        return 0;
> +    }

Why is it necessary to introduce a cache? Is it to avoid passing the
same fd multiple times?

>  
>      *info = g_malloc0(argsz);
>  
> @@ -2417,7 +2467,17 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
>  retry:
>      (*info)->argsz = argsz;
>  
> -    if (ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, *info)) {
> +    if (vbasedev->proxy != NULL) {
> +        VFIOUserFDs fds = { 0, 1, &fd};
> +
> +        ret = vfio_user_get_region_info(vbasedev, index, *info, &fds);
> +    } else {
> +        ret = ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, *info);
> +        if (ret < 0) {
> +            ret = -errno;
> +        }
> +    }
> +    if (ret != 0) {
>          g_free(*info);
>          *info = NULL;
>          return -errno;
> @@ -2426,10 +2486,22 @@ retry:
>      if ((*info)->argsz > argsz) {
>          argsz = (*info)->argsz;
>          *info = g_realloc(*info, argsz);
> +        if (fd != -1) {
> +            close(fd);
> +            fd = -1;
> +        }
>  
>          goto retry;
>      }
>  
> +    /* fill cache */
> +    vbasedev->regions[index] = g_malloc0(argsz);
> +    memcpy(vbasedev->regions[index], *info, argsz);
> +    *vbasedev->regions[index] = **info;

The previous line already copied the contents of *info. What is the
purpose of this assignment?

> +    if (vbasedev->regfds != NULL) {
> +        vbasedev->regfds[index] = fd;
> +    }
> +
>      return 0;
>  }
>  
> diff --git a/hw/vfio/user.c b/hw/vfio/user.c
> index b584b8e0f2..91b51f37df 100644
> --- a/hw/vfio/user.c
> +++ b/hw/vfio/user.c
> @@ -734,3 +734,36 @@ int vfio_user_get_info(VFIODevice *vbasedev)
>      vbasedev->reset_works = !!(msg.flags & VFIO_DEVICE_FLAGS_RESET);
>      return 0;
>  }
> +
> +int vfio_user_get_region_info(VFIODevice *vbasedev, int index,
> +                              struct vfio_region_info *info, VFIOUserFDs *fds)
> +{
> +    g_autofree VFIOUserRegionInfo *msgp = NULL;
> +    int size;

Please use uint32_t size instead of int size to prevent possible
signedness issues:
- VFIOUserRegionInfo->argsz is uint32_t.
- sizeof(VFIOUserHdr) is size_t.
- The vfio_user_request_msg() size argument is uint32_t.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 09/16] vfio-user: region read/write
  2021-08-16 16:42 ` [PATCH RFC v2 09/16] vfio-user: region read/write Elena Ufimtseva
@ 2021-09-07 14:41   ` Stefan Hajnoczi
  2021-09-07 17:24   ` John Levon
  1 sibling, 0 replies; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-09-07 14:41 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: john.g.johnson, jag.raman, swapnil.ingle, john.levon, qemu-devel,
	alex.williamson, thanos.makatos

[-- Attachment #1: Type: text/plain, Size: 3860 bytes --]

On Mon, Aug 16, 2021 at 09:42:42AM -0700, Elena Ufimtseva wrote:
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 7d667b0533..a8b1ea9358 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -215,6 +215,7 @@ void vfio_region_write(void *opaque, hwaddr addr,
>          uint32_t dword;
>          uint64_t qword;
>      } buf;
> +    int ret;
>  
>      switch (size) {
>      case 1:
> @@ -234,7 +235,12 @@ void vfio_region_write(void *opaque, hwaddr addr,
>          break;
>      }
>  
> -    if (pwrite(vbasedev->fd, &buf, size, region->fd_offset + addr) != size) {
> +    if (vbasedev->proxy != NULL) {
> +        ret = vfio_user_region_write(vbasedev, region->nr, addr, size, &data);
> +    } else {
> +        ret = pwrite(vbasedev->fd, &buf, size, region->fd_offset + addr);
> +    }

The vfio-user spec says everything is little-endian. Why does
vfio_user_region_write() take the host-endian uint64_t data value
instead of the little-endian buf value?

> +    if (ret != size) {
>          error_report("%s(%s:region%d+0x%"HWADDR_PRIx", 0x%"PRIx64
>                       ",%d) failed: %m",
>                       __func__, vbasedev->name, region->nr,
> @@ -266,8 +272,14 @@ uint64_t vfio_region_read(void *opaque,
>          uint64_t qword;
>      } buf;
>      uint64_t data = 0;
> +    int ret;
>  
> -    if (pread(vbasedev->fd, &buf, size, region->fd_offset + addr) != size) {
> +    if (vbasedev->proxy != NULL) {
> +        ret = vfio_user_region_read(vbasedev, region->nr, addr, size, &buf);
> +    } else {
> +        ret = pread(vbasedev->fd, &buf, size, region->fd_offset + addr);
> +    }
> +    if (ret != size) {
>          error_report("%s(%s:region%d+0x%"HWADDR_PRIx", %d) failed: %m",
>                       __func__, vbasedev->name, region->nr,
>                       addr, size);
> diff --git a/hw/vfio/user.c b/hw/vfio/user.c
> index 91b51f37df..83235b2411 100644
> --- a/hw/vfio/user.c
> +++ b/hw/vfio/user.c
> @@ -767,3 +767,46 @@ int vfio_user_get_region_info(VFIODevice *vbasedev, int index,
>      memcpy(info, &msgp->argsz, info->argsz);
>      return 0;
>  }
> +
> +int vfio_user_region_read(VFIODevice *vbasedev, uint32_t index, uint64_t offset,
> +                                 uint32_t count, void *data)
> +{
> +    g_autofree VFIOUserRegionRW *msgp = NULL;
> +    int size = sizeof(*msgp) + count;
> +
> +    msgp = g_malloc0(size);
> +    vfio_user_request_msg(&msgp->hdr, VFIO_USER_REGION_READ, sizeof(*msgp), 0);
> +    msgp->offset = offset;
> +    msgp->region = index;
> +    msgp->count = count;
> +
> +    vfio_user_send_recv(vbasedev->proxy, &msgp->hdr, NULL, size, 0);
> +    if (msgp->hdr.flags & VFIO_USER_ERROR) {
> +        return -msgp->hdr.error_reply;
> +    } else if (msgp->count > count) {
> +        return -E2BIG;
> +    } else {
> +        memcpy(data, &msgp->data, msgp->count);
> +    }
> +
> +    return msgp->count;
> +}
> +
> +int vfio_user_region_write(VFIODevice *vbasedev, uint32_t index,
> +                           uint64_t offset, uint32_t count, void *data)
> +{
> +    g_autofree VFIOUserRegionRW *msgp = NULL;
> +    int size = sizeof(*msgp) + count;
> +
> +    msgp = g_malloc0(size);
> +    vfio_user_request_msg(&msgp->hdr, VFIO_USER_REGION_WRITE, size,
> +                          VFIO_USER_NO_REPLY);
> +    msgp->offset = offset;
> +    msgp->region = index;
> +    msgp->count = count;
> +    memcpy(&msgp->data, data, count);
> +
> +    vfio_user_send(vbasedev->proxy, &msgp->hdr, NULL);

Are VFIO region writes posted writes (VFIO_USER_NO_REPLY)? This can be a
problem if the device driver performs a write to the region followed by
another access (e.g. to an mmap region) and expected the write to
complete before the second access takes place.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 10/16] vfio-user: pci_user_realize PCI setup
  2021-08-16 16:42 ` [PATCH RFC v2 10/16] vfio-user: pci_user_realize PCI setup Elena Ufimtseva
@ 2021-09-07 15:00   ` Stefan Hajnoczi
  0 siblings, 0 replies; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-09-07 15:00 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: john.g.johnson, jag.raman, swapnil.ingle, john.levon, qemu-devel,
	alex.williamson, thanos.makatos

[-- Attachment #1: Type: text/plain, Size: 2781 bytes --]

On Mon, Aug 16, 2021 at 09:42:43AM -0700, Elena Ufimtseva wrote:
> @@ -3423,6 +3478,91 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
>          goto error;
>      }
>  
> +    /* Get a copy of config space */
> +    ret = vfio_user_region_read(vbasedev, VFIO_PCI_CONFIG_REGION_INDEX, 0,
> +                                MIN(pci_config_size(pdev), vdev->config_size),
> +                                pdev->config);
> +    if (ret < (int)MIN(pci_config_size(&vdev->pdev), vdev->config_size)) {
> +        error_setg_errno(errp, -ret, "failed to read device config space");
> +        goto error;
> +    }
> +
> +    /* vfio emulates a lot for us, but some bits need extra love */
> +    vdev->emulated_config_bits = g_malloc0(vdev->config_size);
> +
> +    /* QEMU can choose to expose the ROM or not */
> +    memset(vdev->emulated_config_bits + PCI_ROM_ADDRESS, 0xff, 4);
> +    /* QEMU can also add or extend BARs */
> +    memset(vdev->emulated_config_bits + PCI_BASE_ADDRESS_0, 0xff, 6 * 4);
> +    vdev->vendor_id = pci_get_word(pdev->config + PCI_VENDOR_ID);
> +    vdev->device_id = pci_get_word(pdev->config + PCI_DEVICE_ID);
> +
> +    /* QEMU can change multi-function devices to single function, or reverse */
> +    vdev->emulated_config_bits[PCI_HEADER_TYPE] =
> +                                              PCI_HEADER_TYPE_MULTI_FUNCTION;
> +
> +    /* Restore or clear multifunction, this is always controlled by QEMU */
> +    if (vdev->pdev.cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
> +        vdev->pdev.config[PCI_HEADER_TYPE] |= PCI_HEADER_TYPE_MULTI_FUNCTION;
> +    } else {
> +        vdev->pdev.config[PCI_HEADER_TYPE] &= ~PCI_HEADER_TYPE_MULTI_FUNCTION;
> +    }
> +
> +    /*
> +     * Clear host resource mapping info.  If we choose not to register a
> +     * BAR, such as might be the case with the option ROM, we can get
> +     * confusing, unwritable, residual addresses from the host here.
> +     */
> +    memset(&vdev->pdev.config[PCI_BASE_ADDRESS_0], 0, 24);
> +    memset(&vdev->pdev.config[PCI_ROM_ADDRESS], 0, 4);
> +
> +    vfio_pci_size_rom(vdev);
> +
> +    vfio_bars_prepare(vdev);
> +
> +    vfio_msix_early_setup(vdev, &err);
> +    if (err) {
> +        error_propagate(errp, err);
> +        goto error;
> +    }
> +
> +    vfio_bars_register(vdev);
> +
> +    ret = vfio_add_capabilities(vdev, errp);
> +    if (ret) {
> +        goto out_teardown;
> +    }

I haven't audited the common code to find places where the contents of
the PCI Configuration Space are trusted. Input validation may need to be
performed on offsets and other inputs that we read from the device.

Otherwise:

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 11/16] vfio-user: get and set IRQs
  2021-08-16 16:42 ` [PATCH RFC v2 11/16] vfio-user: get and set IRQs Elena Ufimtseva
@ 2021-09-07 15:14   ` Stefan Hajnoczi
  2021-09-09  5:50     ` John Johnson
  0 siblings, 1 reply; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-09-07 15:14 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: john.g.johnson, jag.raman, swapnil.ingle, john.levon, qemu-devel,
	alex.williamson, thanos.makatos

[-- Attachment #1: Type: text/plain, Size: 7580 bytes --]

On Mon, Aug 16, 2021 at 09:42:44AM -0700, Elena Ufimtseva wrote:
> From: John Johnson <john.g.johnson@oracle.com>
> 
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> ---
>  hw/vfio/user-protocol.h |  25 ++++++++++
>  hw/vfio/user.h          |   2 +
>  hw/vfio/common.c        |  26 ++++++++--
>  hw/vfio/pci.c           |  31 ++++++++++--
>  hw/vfio/user.c          | 106 ++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 181 insertions(+), 9 deletions(-)
> 
> diff --git a/hw/vfio/user-protocol.h b/hw/vfio/user-protocol.h
> index 56904cf872..5614efa0a4 100644
> --- a/hw/vfio/user-protocol.h
> +++ b/hw/vfio/user-protocol.h
> @@ -109,6 +109,31 @@ typedef struct {
>      uint64_t offset;
>  } VFIOUserRegionInfo;
>  
> +/*
> + * VFIO_USER_DEVICE_GET_IRQ_INFO
> + * imported from struct vfio_irq_info
> + */
> +typedef struct {
> +    VFIOUserHdr hdr;
> +    uint32_t argsz;
> +    uint32_t flags;
> +    uint32_t index;
> +    uint32_t count;
> +} VFIOUserIRQInfo;
> +
> +/*
> + * VFIO_USER_DEVICE_SET_IRQS
> + * imported from struct vfio_irq_set
> + */
> +typedef struct {
> +    VFIOUserHdr hdr;
> +    uint32_t argsz;
> +    uint32_t flags;
> +    uint32_t index;
> +    uint32_t start;
> +    uint32_t count;
> +} VFIOUserIRQSet;
> +
>  /*
>   * VFIO_USER_REGION_READ
>   * VFIO_USER_REGION_WRITE
> diff --git a/hw/vfio/user.h b/hw/vfio/user.h
> index 02f832a173..248ad80943 100644
> --- a/hw/vfio/user.h
> +++ b/hw/vfio/user.h
> @@ -74,6 +74,8 @@ int vfio_user_validate_version(VFIODevice *vbasedev, Error **errp);
>  int vfio_user_get_info(VFIODevice *vbasedev);
>  int vfio_user_get_region_info(VFIODevice *vbasedev, int index,
>                                struct vfio_region_info *info, VFIOUserFDs *fds);
> +int vfio_user_get_irq_info(VFIODevice *vbasedev, struct vfio_irq_info *info);
> +int vfio_user_set_irqs(VFIODevice *vbasedev, struct vfio_irq_set *irq);
>  int vfio_user_region_read(VFIODevice *vbasedev, uint32_t index, uint64_t offset,
>                            uint32_t count, void *data);
>  int vfio_user_region_write(VFIODevice *vbasedev, uint32_t index,
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index a8b1ea9358..9fe3e05dc6 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -71,7 +71,11 @@ void vfio_disable_irqindex(VFIODevice *vbasedev, int index)
>          .count = 0,
>      };
>  
> -    ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
> +    if (vbasedev->proxy != NULL) {
> +        vfio_user_set_irqs(vbasedev, &irq_set);
> +    } else {
> +        ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
> +    }
>  }
>  
>  void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index)
> @@ -84,7 +88,11 @@ void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index)
>          .count = 1,
>      };
>  
> -    ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
> +    if (vbasedev->proxy != NULL) {
> +        vfio_user_set_irqs(vbasedev, &irq_set);
> +    } else {
> +        ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
> +    }
>  }
>  
>  void vfio_mask_single_irqindex(VFIODevice *vbasedev, int index)
> @@ -97,7 +105,11 @@ void vfio_mask_single_irqindex(VFIODevice *vbasedev, int index)
>          .count = 1,
>      };
>  
> -    ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
> +    if (vbasedev->proxy != NULL) {
> +        vfio_user_set_irqs(vbasedev, &irq_set);
> +    } else {
> +        ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
> +    }
>  }
>  
>  static inline const char *action_to_str(int action)
> @@ -178,8 +190,12 @@ int vfio_set_irq_signaling(VFIODevice *vbasedev, int index, int subindex,
>      pfd = (int32_t *)&irq_set->data;
>      *pfd = fd;
>  
> -    if (ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, irq_set)) {
> -        ret = -errno;
> +    if (vbasedev->proxy != NULL) {
> +        ret = vfio_user_set_irqs(vbasedev, irq_set);
> +    } else {
> +        if (ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, irq_set)) {
> +            ret = -errno;
> +        }
>      }
>      g_free(irq_set);
>  
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index ea0df8be65..282de6a30b 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -403,7 +403,11 @@ static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool msix)
>          fds[i] = fd;
>      }
>  
> -    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
> +    if (vdev->vbasedev.proxy != NULL) {
> +        ret = vfio_user_set_irqs(&vdev->vbasedev, irq_set);
> +    } else {
> +        ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
> +    }
>  
>      g_free(irq_set);
>  
> @@ -2675,7 +2679,13 @@ static void vfio_populate_device(VFIOPCIDevice *vdev, Error **errp)
>  
>      irq_info.index = VFIO_PCI_ERR_IRQ_INDEX;
>  
> -    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info);
> +    if (vbasedev->proxy != NULL) {
> +        ret = vfio_user_get_irq_info(vbasedev, &irq_info);
> +    } else {
> +        ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info);
> +    }
> +
> +
>      if (ret) {
>          /* This can fail for an old kernel or legacy PCI dev */
>          trace_vfio_populate_device_get_irq_info_failure(strerror(errno));
> @@ -2794,8 +2804,16 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
>          return;
>      }
>  
> -    if (ioctl(vdev->vbasedev.fd,
> -              VFIO_DEVICE_GET_IRQ_INFO, &irq_info) < 0 || irq_info.count < 1) {
> +    if (vdev->vbasedev.proxy != NULL) {
> +        if (vfio_user_get_irq_info(&vdev->vbasedev, &irq_info) < 0) {
> +            return;
> +        }
> +    } else {
> +        if (ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info) < 0) {
> +            return;
> +        }
> +    }
> +    if (irq_info.count < 1) {
>          return;
>      }
>  
> @@ -3557,6 +3575,11 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
>          }
>      }
>  
> +    vfio_register_err_notifier(vdev);
> +    vfio_register_req_notifier(vdev);
> +
> +    return;
> +
>  out_deregister:
>      pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
>      kvm_irqchip_remove_change_notifier(&vdev->irqchip_change_notifier);
> diff --git a/hw/vfio/user.c b/hw/vfio/user.c
> index 83235b2411..b68ca1279d 100644
> --- a/hw/vfio/user.c
> +++ b/hw/vfio/user.c
> @@ -768,6 +768,112 @@ int vfio_user_get_region_info(VFIODevice *vbasedev, int index,
>      return 0;
>  }
>  
> +int vfio_user_get_irq_info(VFIODevice *vbasedev, struct vfio_irq_info *info)
> +{
> +    VFIOUserIRQInfo msg;
> +
> +    memset(&msg, 0, sizeof(msg));
> +    vfio_user_request_msg(&msg.hdr, VFIO_USER_DEVICE_GET_IRQ_INFO,
> +                          sizeof(msg), 0);
> +    msg.argsz = info->argsz;
> +    msg.index = info->index;
> +
> +    vfio_user_send_recv(vbasedev->proxy, &msg.hdr, NULL, 0, 0);
> +    if (msg.hdr.flags & VFIO_USER_ERROR) {
> +        return -msg.hdr.error_reply;
> +    }
> +
> +    memcpy(info, &msg.argsz, sizeof(*info));

Should this be info.count = msg.count instead? Not sure why argsz is
used here.

Also, I just noticed the lack of endianness conversion in this patch
series. The spec says values are little-endian but these patches mostly
use host-endian. Did I miss something?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 09/16] vfio-user: region read/write
  2021-08-16 16:42 ` [PATCH RFC v2 09/16] vfio-user: region read/write Elena Ufimtseva
  2021-09-07 14:41   ` Stefan Hajnoczi
@ 2021-09-07 17:24   ` John Levon
  2021-09-09  6:00     ` John Johnson
  1 sibling, 1 reply; 108+ messages in thread
From: John Levon @ 2021-09-07 17:24 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: john.g.johnson, jag.raman, Swapnil Ingle, qemu-devel,
	alex.williamson, stefanha, Thanos Makatos

On Mon, Aug 16, 2021 at 09:42:42AM -0700, Elena Ufimtseva wrote:

> +int vfio_user_region_write(VFIODevice *vbasedev, uint32_t index,
> +                           uint64_t offset, uint32_t count, void *data)
> +{
> +    g_autofree VFIOUserRegionRW *msgp = NULL;
> +    int size = sizeof(*msgp) + count;
> +
> +    msgp = g_malloc0(size);
> +    vfio_user_request_msg(&msgp->hdr, VFIO_USER_REGION_WRITE, size,
> +                          VFIO_USER_NO_REPLY);

Mirroring https://github.com/oracle/qemu/issues/10 here for visibility:

Currently, vfio_user_region_write uses VFIO_USER_NO_REPLY unconditionally,
meaning essentially all writes are posted. But that shouldn't be the case, for
example for PCI config space, where it's expected that writes will wait for an
ack before the VCPU continues.

regards
john

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 12/16] vfio-user: proxy container connect/disconnect
  2021-08-16 16:42 ` [PATCH RFC v2 12/16] vfio-user: proxy container connect/disconnect Elena Ufimtseva
@ 2021-09-08  8:30   ` Stefan Hajnoczi
  0 siblings, 0 replies; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-09-08  8:30 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: john.g.johnson, jag.raman, swapnil.ingle, john.levon, qemu-devel,
	alex.williamson, thanos.makatos

[-- Attachment #1: Type: text/plain, Size: 7434 bytes --]

On Mon, Aug 16, 2021 at 09:42:45AM -0700, Elena Ufimtseva wrote:
> From: John Johnson <john.g.johnson@oracle.com>
> 
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> ---
>  include/hw/vfio/vfio-common.h |  3 ++
>  hw/vfio/common.c              | 84 +++++++++++++++++++++++++++++++++++
>  hw/vfio/pci.c                 | 22 +++++++++
>  3 files changed, 109 insertions(+)

Alex: I'm not familiar enough with hw/vfio/ to review this in depth. You
might have suggestions on how to unify the vfio-user and vfio kernel
concepts of groups and containers.

> 
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index bdd25a546c..688660c28d 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -91,6 +91,7 @@ typedef struct VFIOContainer {
>      uint64_t max_dirty_bitmap_size;
>      unsigned long pgsizes;
>      unsigned int dma_max_mappings;
> +    VFIOProxy *proxy;
>      QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>      QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
>      QLIST_HEAD(, VFIOGroup) group_list;
> @@ -217,6 +218,8 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
>  void vfio_put_group(VFIOGroup *group);
>  int vfio_get_device(VFIOGroup *group, const char *name,
>                      VFIODevice *vbasedev, Error **errp);
> +void vfio_connect_proxy(VFIOProxy *proxy, VFIOGroup *group, AddressSpace *as);
> +void vfio_disconnect_proxy(VFIOGroup *group);
>  
>  extern const MemoryRegionOps vfio_region_ops;
>  typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 9fe3e05dc6..57b9e111e6 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -2249,6 +2249,55 @@ put_space_exit:
>      return ret;
>  }
>  
> +void vfio_connect_proxy(VFIOProxy *proxy, VFIOGroup *group, AddressSpace *as)
> +{
> +    VFIOAddressSpace *space;
> +    VFIOContainer *container;
> +
> +    if (QLIST_EMPTY(&vfio_group_list)) {
> +        qemu_register_reset(vfio_reset_handler, NULL);
> +    }
> +
> +    QLIST_INSERT_HEAD(&vfio_group_list, group, next);
> +
> +    /*
> +     * try to mirror vfio_connect_container()
> +     * as much as possible
> +     */
> +
> +    space = vfio_get_address_space(as);
> +
> +    container = g_malloc0(sizeof(*container));
> +    container->space = space;
> +    container->fd = -1;
> +    QLIST_INIT(&container->giommu_list);
> +    QLIST_INIT(&container->hostwin_list);
> +    container->proxy = proxy;
> +
> +    /*
> +     * The proxy uses a SW IOMMU in lieu of the HW one
> +     * used in the ioctl() version.  Use TYPE1 with the
> +     * target's page size for maximum capatibility
> +     */
> +    container->iommu_type = VFIO_TYPE1_IOMMU;
> +    vfio_host_win_add(container, 0, (hwaddr)-1, TARGET_PAGE_SIZE);
> +    container->pgsizes = TARGET_PAGE_SIZE;
> +
> +    container->dirty_pages_supported = true;
> +    container->max_dirty_bitmap_size = VFIO_USER_DEF_MAX_XFER;
> +    container->dirty_pgsizes = TARGET_PAGE_SIZE;
> +
> +    QLIST_INIT(&container->group_list);
> +    QLIST_INSERT_HEAD(&space->containers, container, next);
> +
> +    group->container = container;
> +    QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> +
> +    container->listener = vfio_memory_listener;
> +    memory_listener_register(&container->listener, container->space->as);
> +    container->initialized = true;
> +}
> +
>  static void vfio_disconnect_container(VFIOGroup *group)
>  {
>      VFIOContainer *container = group->container;
> @@ -2291,6 +2340,41 @@ static void vfio_disconnect_container(VFIOGroup *group)
>      }
>  }
>  
> +void vfio_disconnect_proxy(VFIOGroup *group)
> +{
> +    VFIOContainer *container = group->container;
> +    VFIOAddressSpace *space = container->space;
> +    VFIOGuestIOMMU *giommu, *tmp;
> +
> +    /*
> +     * try to mirror vfio_disconnect_container()
> +     * as much as possible, knowing each device
> +     * is in one group and one container
> +     */
> +
> +    QLIST_REMOVE(group, container_next);
> +    group->container = NULL;
> +
> +    /*
> +     * Explicitly release the listener first before unset container,
> +     * since unset may destroy the backend container if it's the last
> +     * group.
> +     */
> +    memory_listener_unregister(&container->listener);
> +
> +    QLIST_REMOVE(container, next);
> +
> +    QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
> +        memory_region_unregister_iommu_notifier(
> +            MEMORY_REGION(giommu->iommu), &giommu->n);
> +        QLIST_REMOVE(giommu, giommu_next);
> +        g_free(giommu);
> +    }
> +
> +    g_free(container);
> +    vfio_put_address_space(space);
> +}
> +
>  VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>  {
>      VFIOGroup *group;
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 282de6a30b..2c9fcb2fa9 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3442,6 +3442,7 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
>      VFIODevice *vbasedev = &vdev->vbasedev;
>      SocketAddress addr;
>      VFIOProxy *proxy;
> +    VFIOGroup *group = NULL;
>      int ret;
>      Error *err = NULL;
>  
> @@ -3484,6 +3485,19 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
>      vbasedev->no_mmap = false;
>      vbasedev->ops = &vfio_user_pci_ops;
>  
> +    /*
> +     * each device gets its own group and container
> +     * make them unrelated to any host IOMMU groupings
> +     */
> +    group = g_malloc0(sizeof(*group));
> +    group->fd = -1;
> +    group->groupid = -1;
> +    QLIST_INIT(&group->device_list);
> +    QLIST_INSERT_HEAD(&group->device_list, vbasedev, next);
> +    vbasedev->group = group;
> +
> +    vfio_connect_proxy(proxy, group, pci_device_iommu_address_space(pdev));
> +
>      ret = vfio_user_get_info(&vdev->vbasedev);
>      if (ret) {
>          error_setg_errno(errp, -ret, "get info failure");
> @@ -3587,6 +3601,9 @@ out_teardown:
>      vfio_teardown_msi(vdev);
>      vfio_bars_exit(vdev);
>  error:
> +    if (group != NULL) {
> +        vfio_disconnect_proxy(group);
> +    }
>      vfio_user_disconnect(proxy);
>      error_prepend(errp, VFIO_MSG_PREFIX, vdev->vbasedev.name);
>  }
> @@ -3595,6 +3612,11 @@ static void vfio_user_instance_finalize(Object *obj)
>  {
>      VFIOPCIDevice *vdev = VFIO_PCI_BASE(obj);
>      VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIOGroup *group = vbasedev->group;
> +
> +    vfio_disconnect_proxy(group);
> +    g_free(group);
> +    vbasedev->group = NULL;

Can vfio_put_group() be used instead? I'm worried that the cleanup code
will be duplicated or become inconsistent if it's not shared.

Also, vfio_instance_finalize() calls vfio_put_group() after
vfio_put_device(). Does this code intentionally take advantage of the if
(!vbasedev->group) early return in vfio_put_base_device()? This is
non-obvious. I recommend unifying the device and group cleanup instead
of special-casing it here (this is fragile!).

>  
>      vfio_put_device(vdev);
>  
> -- 
> 2.25.1
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 13/16] vfio-user: dma map/unmap operations
  2021-08-16 16:42 ` [PATCH RFC v2 13/16] vfio-user: dma map/unmap operations Elena Ufimtseva
@ 2021-09-08  9:16   ` Stefan Hajnoczi
  0 siblings, 0 replies; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-09-08  9:16 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: john.g.johnson, jag.raman, swapnil.ingle, john.levon, qemu-devel,
	alex.williamson, thanos.makatos

[-- Attachment #1: Type: text/plain, Size: 2840 bytes --]

On Mon, Aug 16, 2021 at 09:42:46AM -0700, Elena Ufimtseva wrote:
> +void vfio_user_drain_reqs(VFIOProxy *proxy)
> +{
> +    VFIOUserReply *reply;
> +    bool iolock = 0;
> +
> +    /*
> +     * Any DMA map/unmap requests sent in the middle
> +     * of a memory region transaction were sent async.
> +     * Wait for them here.
> +     */
> +    QEMU_LOCK_GUARD(&proxy->lock);
> +    if (proxy->last_nowait != NULL) {
> +        iolock = qemu_mutex_iothread_locked();
> +        if (iolock) {
> +            qemu_mutex_unlock_iothread();
> +        }
> +
> +        reply = proxy->last_nowait;
> +        reply->nowait = 0;
> +        while (reply->complete == 0) {
> +            if (!qemu_cond_timedwait(&reply->cv, &proxy->lock, wait_time)) {
> +                error_printf("vfio_drain_reqs - timed out\n");
> +                break;
> +            }
> +        }
> +
> +        if (reply->msg->flags & VFIO_USER_ERROR) {
> +            error_printf("vfio_user_rcv error reply on async request ");
> +            error_printf("command %x error %s\n", reply->msg->command,
> +                         strerror(reply->msg->error_reply));
> +        }
> +        proxy->last_nowait = NULL;
> +        g_free(reply->msg);
> +        QTAILQ_INSERT_HEAD(&proxy->free, reply, next);
> +    }
> +
> +    if (iolock) {
> +        qemu_mutex_lock_iothread();
> +    }

Not sure this lock ordering is correct. Above we acquire proxy->lock
while holding the BQL and here we acquire the BQL while holding
proxy->lock. If another thread (e.g. a vCPU thread) does something
similar this is the ABBA lock ordering problem.

The more obviously correct way to write this is:

  WITH_QEMU_LOCK_GUARD(&proxy->lock) {
      ...
  }

  if (iolock) {
      qemu_mutex_lock_iothread();
  }

> +}
> +
>  static void vfio_user_request_msg(VFIOUserHdr *hdr, uint16_t cmd,
>                                    uint32_t size, uint32_t flags)
>  {
> @@ -715,6 +756,89 @@ int vfio_user_validate_version(VFIODevice *vbasedev, Error **errp)
>      return 0;
>  }
>  
> +int vfio_user_dma_map(VFIOProxy *proxy, struct vfio_iommu_type1_dma_map *map,
> +                      VFIOUserFDs *fds, bool will_commit)
> +{
> +    VFIOUserDMAMap *msgp = g_malloc(sizeof(*msgp));

Is this zero-initialized anywhere to guarantee that QEMU memory isn't
exposed to the VFIO device emulation program?

> +    int ret, flags;
> +
> +    /* commit will wait, so send async without dropping BQL */
> +    flags = will_commit ? (NOIOLOCK | NOWAIT) : 0;

Why is this distinction between will_commit and !will_commit necessary?
I get a sense that the network communications code drops the BQL and
that's undesirable here for some reason. I wonder why the code doesn't
take the NOWAIT code path all the time?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 14/16] vfio-user: dma read/write operations
  2021-08-16 16:42 ` [PATCH RFC v2 14/16] vfio-user: dma read/write operations Elena Ufimtseva
@ 2021-09-08  9:51   ` Stefan Hajnoczi
  2021-09-08 11:03     ` John Levon
  0 siblings, 1 reply; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-09-08  9:51 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: john.g.johnson, jag.raman, swapnil.ingle, john.levon, qemu-devel,
	alex.williamson, thanos.makatos

[-- Attachment #1: Type: text/plain, Size: 2921 bytes --]

On Mon, Aug 16, 2021 at 09:42:47AM -0700, Elena Ufimtseva wrote:
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 2c9fcb2fa9..29a874c066 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3406,11 +3406,72 @@ type_init(register_vfio_pci_dev_type)
>   * vfio-user routines.
>   */
>  
> -static int vfio_user_pci_process_req(void *opaque, char *buf, VFIOUserFDs *fds)
> +static int vfio_user_dma_read(VFIOPCIDevice *vdev, VFIOUserDMARW *msg)
>  {
> +    PCIDevice *pdev = &vdev->pdev;
> +    char *buf;
> +    int size = msg->count + sizeof(VFIOUserDMARW);

The caller has only checked that hdr->size is large enough for
VFIOUserHdr, not VFIOUserDMARW. We must not access VFIOUserDMARW fields
until this has been checked.

Size should be size_t to avoid signedness issues.

Even then, this can overflow on 32-bit hosts so I suggest moving this
arithmetic expression below the msg->count > vfio_user_max_xfer() check.
That way it's clear that overflow cannot happen.

> +
> +    if (msg->hdr.flags & VFIO_USER_NO_REPLY) {
> +        return -EINVAL;
> +    }
> +    if (msg->count > vfio_user_max_xfer()) {
> +        return -E2BIG;
> +    }

Does vfio-user allow the request to be smaller than the reply? In other
words, is it okay that we're not checking msg->count against hdr->size?

> +
> +    buf = g_malloc0(size);
> +    memcpy(buf, msg, sizeof(*msg));
> +
> +    pci_dma_read(pdev, msg->offset, buf + sizeof(*msg), msg->count);

The vfio-user spec doesn't go into errors but pci_dma_read() can return
errors. Hmm...

> +
> +    vfio_user_send_reply(vdev->vbasedev.proxy, buf, size);
> +    g_free(buf);
>      return 0;
>  }
>  
> +static int vfio_user_dma_write(VFIOPCIDevice *vdev,
> +                               VFIOUserDMARW *msg)
> +{
> +    PCIDevice *pdev = &vdev->pdev;
> +    char *buf = (char *)msg + sizeof(*msg);

Or:

  char *buf = msg->data;

> +
> +    /* make sure transfer count isn't larger than the message data */
> +    if (msg->count > msg->hdr.size - sizeof(*msg)) {
> +        return -E2BIG;
> +    }

msg->count cannot be accessed until we have checked that msg->hdr.size
is large enough for VFIOUserDMARW. Adding the check also eliminates the
underflow in the subtraction if msg->hdr.size was smaller than
sizeof(VFIOUserDMARW).

> +
> +    pci_dma_write(pdev, msg->offset, buf, msg->count);
> +
> +    if ((msg->hdr.flags & VFIO_USER_NO_REPLY) == 0) {
> +        vfio_user_send_reply(vdev->vbasedev.proxy, (char *)msg,
> +                             sizeof(msg->hdr));
> +    }
> +    return 0;
> +}
> +
> +static int vfio_user_pci_process_req(void *opaque, char *buf, VFIOUserFDs *fds)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +    VFIOUserHdr *hdr = (VFIOUserHdr *)buf;
> +    int ret;
> +
> +    if (fds->recv_fds != 0) {
> +        return -EINVAL;

Where are the fds closed?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 15/16] vfio-user: pci reset
  2021-08-16 16:42 ` [PATCH RFC v2 15/16] vfio-user: pci reset Elena Ufimtseva
@ 2021-09-08  9:56   ` Stefan Hajnoczi
  0 siblings, 0 replies; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-09-08  9:56 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: john.g.johnson, jag.raman, swapnil.ingle, john.levon, qemu-devel,
	alex.williamson, thanos.makatos

[-- Attachment #1: Type: text/plain, Size: 4444 bytes --]

On Mon, Aug 16, 2021 at 09:42:48AM -0700, Elena Ufimtseva wrote:
> From: John Johnson <john.g.johnson@oracle.com>
> 
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> ---
>  hw/vfio/user.h |  1 +
>  hw/vfio/pci.c  | 29 ++++++++++++++++++++++++++---
>  hw/vfio/user.c | 12 ++++++++++++
>  3 files changed, 39 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/vfio/user.h b/hw/vfio/user.h
> index 32e8b70d28..5d4d0a43ba 100644
> --- a/hw/vfio/user.h
> +++ b/hw/vfio/user.h
> @@ -86,6 +86,7 @@ int vfio_user_region_read(VFIODevice *vbasedev, uint32_t index, uint64_t offset,
>                            uint32_t count, void *data);
>  int vfio_user_region_write(VFIODevice *vbasedev, uint32_t index,
>                             uint64_t offset, uint32_t count, void *data);
> +void vfio_user_reset(VFIODevice *vbasedev);
>  void vfio_user_drain_reqs(VFIOProxy *proxy);
>  
>  #endif /* VFIO_USER_H */
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 29a874c066..4b933ed10f 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2229,8 +2229,9 @@ static void vfio_pci_pre_reset(VFIOPCIDevice *vdev)
>  
>  static void vfio_pci_post_reset(VFIOPCIDevice *vdev)
>  {
> +    VFIODevice *vbasedev = &vdev->vbasedev;
>      Error *err = NULL;
> -    int nr;
> +    int ret, nr;
>  
>      vfio_intx_enable(vdev, &err);
>      if (err) {
> @@ -2238,11 +2239,18 @@ static void vfio_pci_post_reset(VFIOPCIDevice *vdev)
>      }
>  
>      for (nr = 0; nr < PCI_NUM_REGIONS - 1; ++nr) {
> -        off_t addr = vdev->config_offset + PCI_BASE_ADDRESS_0 + (4 * nr);
> +        off_t addr = PCI_BASE_ADDRESS_0 + (4 * nr);
>          uint32_t val = 0;
>          uint32_t len = sizeof(val);
>  
> -        if (pwrite(vdev->vbasedev.fd, &val, len, addr) != len) {
> +        if (vbasedev->proxy != NULL) {
> +            ret = vfio_user_region_write(vbasedev, VFIO_PCI_CONFIG_REGION_INDEX,
> +                                         addr, len, &val);
> +        } else {
> +            ret = pwrite(vdev->vbasedev.fd, &val, len,
> +                         vdev->config_offset + addr);
> +        }
> +        if (ret != len) {
>              error_report("%s(%s) reset bar %d failed: %m", __func__,
>                           vdev->vbasedev.name, nr);

The %m format string assumes vfio_user_region_write() sets errno. I
don't think it does. We're relying on vfio_user_region_write() never
failing here, which is true at the moment but not nice.

>          }
> @@ -3684,6 +3692,20 @@ static void vfio_user_instance_finalize(Object *obj)
>      vfio_user_disconnect(vbasedev->proxy);
>  }
>  
> +static void vfio_user_pci_reset(DeviceState *dev)
> +{
> +    VFIOPCIDevice *vdev = VFIO_PCI_BASE(dev);
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +
> +    vfio_pci_pre_reset(vdev);
> +
> +    if (vbasedev->reset_works) {
> +        vfio_user_reset(vbasedev);
> +    }
> +
> +    vfio_pci_post_reset(vdev);
> +}
> +
>  static Property vfio_user_pci_dev_properties[] = {
>      DEFINE_PROP_STRING("socket", VFIOUserPCIDevice, sock_name),
>      DEFINE_PROP_BOOL("secure-dma", VFIOUserPCIDevice, secure_dma, false),
> @@ -3695,6 +3717,7 @@ static void vfio_user_pci_dev_class_init(ObjectClass *klass, void *data)
>      DeviceClass *dc = DEVICE_CLASS(klass);
>      PCIDeviceClass *pdc = PCI_DEVICE_CLASS(klass);
>  
> +    dc->reset = vfio_user_pci_reset;
>      device_class_set_props(dc, vfio_user_pci_dev_properties);
>      dc->desc = "VFIO over socket PCI device assignment";
>      pdc->realize = vfio_user_pci_realize;
> diff --git a/hw/vfio/user.c b/hw/vfio/user.c
> index fcc041959c..7de2125346 100644
> --- a/hw/vfio/user.c
> +++ b/hw/vfio/user.c
> @@ -1045,3 +1045,15 @@ int vfio_user_region_write(VFIODevice *vbasedev, uint32_t index,
>  
>      return count;
>  }
> +
> +void vfio_user_reset(VFIODevice *vbasedev)
> +{
> +    VFIOUserHdr msg;

Maybe add "= {}" to ensure it's zero-initialized?

> +
> +    vfio_user_request_msg(&msg, VFIO_USER_DEVICE_RESET, sizeof(msg), 0);
> +
> +    vfio_user_send_recv(vbasedev->proxy, &msg, NULL, 0, 0);
> +    if (msg.flags & VFIO_USER_ERROR) {
> +        error_printf("reset reply error %d\n", msg.error_reply);
> +    }
> +}
> -- 
> 2.25.1
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 16/16] vfio-user: migration support
  2021-08-16 16:42 ` [PATCH RFC v2 16/16] vfio-user: migration support Elena Ufimtseva
@ 2021-09-08 10:04   ` Stefan Hajnoczi
  0 siblings, 0 replies; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-09-08 10:04 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: john.g.johnson, jag.raman, swapnil.ingle, john.levon, qemu-devel,
	alex.williamson, thanos.makatos

[-- Attachment #1: Type: text/plain, Size: 2229 bytes --]

On Mon, Aug 16, 2021 at 09:42:49AM -0700, Elena Ufimtseva wrote:
> @@ -1356,7 +1365,11 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
>          goto err_out;
>      }
>  
> -    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
> +    if (container->proxy != NULL) {
> +        ret = vfio_user_dirty_bitmap(container->proxy, dbitmap, range);
> +    } else {
> +        ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
> +    }
>      if (ret) {
>          error_report("Failed to get dirty bitmap for iova: 0x%"PRIx64
>                  " size: 0x%"PRIx64" err: %d", (uint64_t)range->iova,

This error_report() relies on errno. vfio_user_region_write() doesn't
set errno.

> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 82f654afb6..89926a3b01 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -27,6 +27,7 @@
>  #include "pci.h"
>  #include "trace.h"
>  #include "hw/hw.h"
> +#include "user.h"
>  
>  /*
>   * Flags to be used as unique delimiters for VFIO devices in the migration
> @@ -49,10 +50,18 @@ static int64_t bytes_transferred;
>  static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
>                                    off_t off, bool iswrite)
>  {
> +    VFIORegion *region = &vbasedev->migration->region;
>      int ret;
>  
> -    ret = iswrite ? pwrite(vbasedev->fd, val, count, off) :
> -                    pread(vbasedev->fd, val, count, off);
> +    if (vbasedev->proxy != NULL) {
> +        ret = iswrite ?
> +            vfio_user_region_write(vbasedev, region->nr, off, count, val) :
> +            vfio_user_region_read(vbasedev, region->nr, off, count, val);
> +    } else {
> +        off += region->fd_offset;
> +        ret = iswrite ? pwrite(vbasedev->fd, val, count, off) :
> +                        pread(vbasedev->fd, val, count, off);
> +    }
>      if (ret < count) {
>          error_report("vfio_mig_%s %d byte %s: failed at offset 0x%"
>                       HWADDR_PRIx", err: %s", iswrite ? "write" : "read", count,

Another errno user. I haven't exhaustively audited all the code for
these issues. Please take a look.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC server v2 00/11] vfio-user server in QEMU
  2021-08-27 17:53 ` [PATCH RFC server v2 00/11] vfio-user server in QEMU Jagannathan Raman
                     ` (10 preceding siblings ...)
  2021-08-27 17:53   ` [PATCH RFC server v2 11/11] vfio-user: acceptance test Jagannathan Raman
@ 2021-09-08 10:08   ` Stefan Hajnoczi
  2021-09-08 12:06     ` Jag Raman
  2021-09-09  8:17   ` Stefan Hajnoczi
  12 siblings, 1 reply; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-09-08 10:08 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: elena.ufimtseva, john.g.johnson, thuth, swapnil.ingle,
	john.levon, philmd, qemu-devel, alex.williamson,
	marcandre.lureau, thanos.makatos, alex.bennee

[-- Attachment #1: Type: text/plain, Size: 512 bytes --]

On Fri, Aug 27, 2021 at 01:53:19PM -0400, Jagannathan Raman wrote:
> Hi,
> 
> This series depends on the following series from
> Elena Ufimtseva <elena.ufimtseva@oracle.com>:
> [PATCH RFC v2 00/16] vfio-user implementation

Please send future revisions as separate email threads. Tools have
trouble separating your series from the one you replied to.

You can use "Based-on" to let CI know that Elena's series needs to be
applied first:

Based-on: <cover.1629131628.git.elena.ufimtseva@oracle.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 14/16] vfio-user: dma read/write operations
  2021-09-08  9:51   ` Stefan Hajnoczi
@ 2021-09-08 11:03     ` John Levon
  0 siblings, 0 replies; 108+ messages in thread
From: John Levon @ 2021-09-08 11:03 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, john.g.johnson, jag.raman, Swapnil Ingle,
	qemu-devel, alex.williamson, Thanos Makatos

On Wed, Sep 08, 2021 at 10:51:11AM +0100, Stefan Hajnoczi wrote:

> > +
> > +    buf = g_malloc0(size);
> > +    memcpy(buf, msg, sizeof(*msg));
> > +
> > +    pci_dma_read(pdev, msg->offset, buf + sizeof(*msg), msg->count);
> 
> The vfio-user spec doesn't go into errors but pci_dma_read() can return
> errors. Hmm...

It's certainly under-specified in the spec, but in terms of the library, we do
return EINVAL if we decide something invalid happened...

regards
john

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC server v2 00/11] vfio-user server in QEMU
  2021-09-08 10:08   ` [PATCH RFC server v2 00/11] vfio-user server in QEMU Stefan Hajnoczi
@ 2021-09-08 12:06     ` Jag Raman
  0 siblings, 0 replies; 108+ messages in thread
From: Jag Raman @ 2021-09-08 12:06 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, John Johnson, thuth, swapnil.ingle, john.levon,
	philmd, qemu-devel, Alex Williamson, Marc-André Lureau,
	thanos.makatos, alex.bennee



> On Sep 8, 2021, at 6:08 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Fri, Aug 27, 2021 at 01:53:19PM -0400, Jagannathan Raman wrote:
>> Hi,
>> 
>> This series depends on the following series from
>> Elena Ufimtseva <elena.ufimtseva@oracle.com>:
>> [PATCH RFC v2 00/16] vfio-user implementation
> 
> Please send future revisions as separate email threads. Tools have
> trouble separating your series from the one you replied to.
> 
> You can use "Based-on" to let CI know that Elena's series needs to be
> applied first:
> 
> Based-on: <cover.1629131628.git.elena.ufimtseva@oracle.com>

Thank you for letting us know, Stefan! Will do going forward.


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC server v2 01/11] vfio-user: build library
  2021-08-27 17:53   ` [PATCH RFC server v2 01/11] vfio-user: build library Jagannathan Raman
  2021-08-27 18:05     ` Jag Raman
@ 2021-09-08 12:25     ` Stefan Hajnoczi
  2021-09-10 15:21       ` Philippe Mathieu-Daudé
  2021-09-10 15:20     ` Philippe Mathieu-Daudé
  2 siblings, 1 reply; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-09-08 12:25 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: elena.ufimtseva, john.g.johnson, thuth, swapnil.ingle,
	john.levon, philmd, qemu-devel, alex.williamson,
	marcandre.lureau, thanos.makatos, alex.bennee

[-- Attachment #1: Type: text/plain, Size: 2816 bytes --]

On Fri, Aug 27, 2021 at 01:53:20PM -0400, Jagannathan Raman wrote:
> diff --git a/meson.build b/meson.build
> index bf63784..2b2d5c2 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -1898,6 +1898,34 @@ if get_option('cfi') and slirp_opt == 'system'
>           + ' Please configure with --enable-slirp=git')
>  endif
>  
> +vfiouser = not_found
> +if have_system and multiprocess_allowed
> +  have_internal = fs.exists(meson.current_source_dir() / 'subprojects/libvfio-user/Makefile')
> +
> +  if not have_internal
> +    error('libvfio-user source not found - please pull git submodule')
> +  endif
> +
> +  json_c = dependency('json-c', required: false)
> +    if not json_c.found()

Indentation is off.

> +      json_c = dependency('libjson-c')
> +  endif
> +
> +  cmake = import('cmake')
> +
> +  vfiouser_subproj = cmake.subproject('libvfio-user')
> +
> +  vfiouser_sl = vfiouser_subproj.dependency('vfio-user-static')
> +
> +  # Although cmake links the json-c library with vfio-user-static
> +  # target, that info is not available to meson via cmake.subproject.
> +  # As such, we have to separately declare the json-c dependency here.
> +  # This appears to be a current limitation of using cmake inside meson.
> +  # libvfio-user is planning a switch to meson in the future, which
> +  # would address this item automatically.
> +  vfiouser = declare_dependency(dependencies: [vfiouser_sl, json_c])
> +endif
> +
>  fdt = not_found
>  fdt_opt = get_option('fdt')
>  if have_system
> diff --git a/.gitmodules b/.gitmodules
> index 08b1b48..cfeea7c 100644
> --- a/.gitmodules
> +++ b/.gitmodules
> @@ -64,3 +64,6 @@
>  [submodule "roms/vbootrom"]
>  	path = roms/vbootrom
>  	url = https://gitlab.com/qemu-project/vbootrom.git
> +[submodule "subprojects/libvfio-user"]
> +	path = subprojects/libvfio-user
> +	url = https://github.com/nutanix/libvfio-user.git

Once this is merged I'll set up a
gitlab.com/qemu-project/libvfio-user.git mirror. This ensures that no
matter what happens with upstream libvfio-user.git, the source code that
QEMU builds against will remain archived/available.

> diff --git a/MAINTAINERS b/MAINTAINERS
> index 4039d3c..0c5a18e 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3361,6 +3361,13 @@ F: semihosting/
>  F: include/semihosting/
>  F: tests/tcg/multiarch/arm-compat-semi/
>  
> +libvfio-user Library
> +M: Thanos Makatos <thanos.makatos@nutanix.com>
> +M: John Levon <john.levon@nutanix.com>
> +T: https://github.com/nutanix/libvfio-user.git
> +S: Maintained
> +F: subprojects/libvfio-user/*

A MAINTAINERS entry isn't necessary for git submodules. This could
become outdated. People should look at the upstream project instead for
information on maintainership and how to contribute.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC server v2 02/11] vfio-user: define vfio-user object
  2021-08-27 17:53   ` [PATCH RFC server v2 02/11] vfio-user: define vfio-user object Jagannathan Raman
@ 2021-09-08 12:37     ` Stefan Hajnoczi
  2021-09-10 14:04       ` Jag Raman
  0 siblings, 1 reply; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-09-08 12:37 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: elena.ufimtseva, john.g.johnson, thuth, swapnil.ingle,
	john.levon, philmd, qemu-devel, alex.williamson,
	marcandre.lureau, thanos.makatos, alex.bennee

[-- Attachment #1: Type: text/plain, Size: 3863 bytes --]

On Fri, Aug 27, 2021 at 01:53:21PM -0400, Jagannathan Raman wrote:
> Define vfio-user object which is remote process server for QEMU. Setup
> object initialization functions and properties necessary to instantiate
> the object
> 
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> ---
>  qapi/qom.json             |  20 ++++++-
>  hw/remote/vfio-user-obj.c | 145 ++++++++++++++++++++++++++++++++++++++++++++++
>  MAINTAINERS               |   1 +
>  hw/remote/meson.build     |   1 +
>  hw/remote/trace-events    |   3 +
>  5 files changed, 168 insertions(+), 2 deletions(-)
>  create mode 100644 hw/remote/vfio-user-obj.c
> 
> diff --git a/qapi/qom.json b/qapi/qom.json
> index a25616b..3e941ee 100644
> --- a/qapi/qom.json
> +++ b/qapi/qom.json
> @@ -689,6 +689,20 @@
>    'data': { 'fd': 'str', 'devid': 'str' } }
>  
>  ##
> +# @VfioUserProperties:
> +#
> +# Properties for vfio-user objects.
> +#
> +# @socket: path to be used as socket by the libvfiouser library
> +#
> +# @devid: the id of the device to be associated with the file descriptor
> +#
> +# Since: 6.0
> +##
> +{ 'struct': 'VfioUserProperties',
> +  'data': { 'socket': 'str', 'devid': 'str' } }

Please use 'SocketAddress' for socket instead of 'str'. That way file
descriptor passing is easy to support and additional socket address
families can be supported in the future.

> +
> +##
>  # @RngProperties:
>  #
>  # Properties for objects of classes derived from rng.
> @@ -812,7 +826,8 @@
>      'tls-creds-psk',
>      'tls-creds-x509',
>      'tls-cipher-suites',
> -    'x-remote-object'
> +    'x-remote-object',
> +    'vfio-user'
>    ] }
>  
>  ##
> @@ -868,7 +883,8 @@
>        'tls-creds-psk':              'TlsCredsPskProperties',
>        'tls-creds-x509':             'TlsCredsX509Properties',
>        'tls-cipher-suites':          'TlsCredsProperties',
> -      'x-remote-object':            'RemoteObjectProperties'
> +      'x-remote-object':            'RemoteObjectProperties',
> +      'vfio-user':                  'VfioUserProperties'

"vfio-user" doesn't communicate whether this is a client or server. Is
"vfio-user-server" clearer?

>    } }
>  
>  ##
> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
> new file mode 100644
> index 0000000..4a1e297
> --- /dev/null
> +++ b/hw/remote/vfio-user-obj.c
> @@ -0,0 +1,145 @@
> +/**
> + * QEMU vfio-user server object
> + *
> + * Copyright © 2021 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL-v2, version 2 or later.
> + *
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +/**
> + * Usage: add options:
> + *     -machine x-remote
> + *     -device <PCI-device>,id=<pci-dev-id>
> + *     -object vfio-user,id=<id>,socket=<socket-path>,devid=<pci-dev-id>

I suggest renaming devid= to device= or pci-device= (similar to drive=
and netdev=) for consistency and to avoid confusion with PCI Device IDs.

> diff --git a/hw/remote/meson.build b/hw/remote/meson.build
> index fb35fb8..cd44dfc 100644
> --- a/hw/remote/meson.build
> +++ b/hw/remote/meson.build
> @@ -6,6 +6,7 @@ remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('message.c'))
>  remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('remote-obj.c'))
>  remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('proxy.c'))
>  remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('iohub.c'))
> +remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('vfio-user-obj.c'))

If you use CONFIG_VFIO_USER_SERVER then it's easier to separate mpqemu
from vfio-user. Sharing CONFIG_MULTIPROCESS could become messy later.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC server v2 03/11] vfio-user: instantiate vfio-user context
  2021-08-27 17:53   ` [PATCH RFC server v2 03/11] vfio-user: instantiate vfio-user context Jagannathan Raman
@ 2021-09-08 12:40     ` Stefan Hajnoczi
  2021-09-10 14:58       ` Jag Raman
  0 siblings, 1 reply; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-09-08 12:40 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: elena.ufimtseva, john.g.johnson, thuth, swapnil.ingle,
	john.levon, philmd, qemu-devel, alex.williamson,
	marcandre.lureau, thanos.makatos, alex.bennee

[-- Attachment #1: Type: text/plain, Size: 2938 bytes --]

On Fri, Aug 27, 2021 at 01:53:22PM -0400, Jagannathan Raman wrote:
> create a context with the vfio-user library to run a PCI device
> 
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> ---
>  hw/remote/vfio-user-obj.c | 29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
> 
> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
> index 4a1e297..99d3dd1 100644
> --- a/hw/remote/vfio-user-obj.c
> +++ b/hw/remote/vfio-user-obj.c
> @@ -27,11 +27,17 @@
>  #include "qemu/osdep.h"
>  #include "qemu-common.h"
>  
> +#include <errno.h>

qemu/osdep.h already includes <errno.h>

> +
>  #include "qom/object.h"
>  #include "qom/object_interfaces.h"
>  #include "qemu/error-report.h"
>  #include "trace.h"
>  #include "sysemu/runstate.h"
> +#include "qemu/notify.h"
> +#include "qapi/error.h"
> +#include "sysemu/sysemu.h"
> +#include "libvfio-user.h"
>  
>  #define TYPE_VFU_OBJECT "vfio-user"
>  OBJECT_DECLARE_TYPE(VfuObject, VfuObjectClass, VFU_OBJECT)
> @@ -51,6 +57,10 @@ struct VfuObject {
>  
>      char *socket;
>      char *devid;
> +
> +    Notifier machine_done;
> +
> +    vfu_ctx_t *vfu_ctx;
>  };
>  
>  static void vfu_object_set_socket(Object *obj, const char *str, Error **errp)
> @@ -75,9 +85,23 @@ static void vfu_object_set_devid(Object *obj, const char *str, Error **errp)
>      trace_vfu_prop("devid", str);
>  }
>  
> +static void vfu_object_machine_done(Notifier *notifier, void *data)

Please document the reason for using a machine init done notifier.

> +{
> +    VfuObject *o = container_of(notifier, VfuObject, machine_done);
> +
> +    o->vfu_ctx = vfu_create_ctx(VFU_TRANS_SOCK, o->socket, 0,
> +                                o, VFU_DEV_TYPE_PCI);
> +    if (o->vfu_ctx == NULL) {
> +        error_setg(&error_abort, "vfu: Failed to create context - %s",
> +                   strerror(errno));
> +        return;
> +    }
> +}
> +
>  static void vfu_object_init(Object *obj)
>  {
>      VfuObjectClass *k = VFU_OBJECT_GET_CLASS(obj);
> +    VfuObject *o = VFU_OBJECT(obj);
>  
>      if (!object_dynamic_cast(OBJECT(current_machine), TYPE_REMOTE_MACHINE)) {
>          error_report("vfu: %s only compatible with %s machine",
> @@ -92,6 +116,9 @@ static void vfu_object_init(Object *obj)
>      }
>  
>      k->nr_devs++;
> +
> +    o->machine_done.notify = vfu_object_machine_done;
> +    qemu_add_machine_init_done_notifier(&o->machine_done);
>  }
>  
>  static void vfu_object_finalize(Object *obj)
> @@ -101,6 +128,8 @@ static void vfu_object_finalize(Object *obj)
>  
>      k->nr_devs--;
>  
> +    vfu_destroy_ctx(o->vfu_ctx);

Will this function ever be called before vfu_object_machine_done() is
called? In that case vfu_ctx isn't initialized.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC server v2 04/11] vfio-user: find and init PCI device
  2021-08-27 17:53   ` [PATCH RFC server v2 04/11] vfio-user: find and init PCI device Jagannathan Raman
@ 2021-09-08 12:43     ` Stefan Hajnoczi
  2021-09-10 15:02       ` Jag Raman
  0 siblings, 1 reply; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-09-08 12:43 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: elena.ufimtseva, john.g.johnson, thuth, swapnil.ingle,
	john.levon, philmd, qemu-devel, alex.williamson,
	marcandre.lureau, thanos.makatos, alex.bennee

[-- Attachment #1: Type: text/plain, Size: 801 bytes --]

On Fri, Aug 27, 2021 at 01:53:23PM -0400, Jagannathan Raman wrote:
> @@ -96,6 +102,28 @@ static void vfu_object_machine_done(Notifier *notifier, void *data)
>                     strerror(errno));
>          return;
>      }
> +
> +    dev = qdev_find_recursive(sysbus_get_default(), o->devid);
> +    if (dev == NULL) {
> +        error_setg(&error_abort, "vfu: Device %s not found", o->devid);
> +        return;
> +    }
> +
> +    if (!object_dynamic_cast(OBJECT(dev), TYPE_PCI_DEVICE)) {
> +        error_setg(&error_abort, "vfu: %s not a PCI devices", o->devid);
> +        return;
> +    }
> +
> +    o->pci_dev = PCI_DEVICE(dev);
> +
> +    ret = vfu_pci_init(o->vfu_ctx, VFU_PCI_TYPE_CONVENTIONAL,
> +                       PCI_HEADER_TYPE_NORMAL, 0);

What is needed to support PCI Express?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC server v2 05/11] vfio-user: run vfio-user context
  2021-08-27 17:53   ` [PATCH RFC server v2 05/11] vfio-user: run vfio-user context Jagannathan Raman
@ 2021-09-08 12:58     ` Stefan Hajnoczi
  2021-09-08 13:37       ` John Levon
  0 siblings, 1 reply; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-09-08 12:58 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: elena.ufimtseva, john.g.johnson, thuth, swapnil.ingle,
	john.levon, philmd, qemu-devel, alex.williamson,
	marcandre.lureau, thanos.makatos, alex.bennee

[-- Attachment #1: Type: text/plain, Size: 5018 bytes --]

On Fri, Aug 27, 2021 at 01:53:24PM -0400, Jagannathan Raman wrote:
> Setup a handler to run vfio-user context. The context is driven by
> messages to the file descriptor associated with it - get the fd for
> the context and hook up the handler with it
> 
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> ---
>  hw/remote/vfio-user-obj.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 70 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
> index 5ae0991..0726eb9 100644
> --- a/hw/remote/vfio-user-obj.c
> +++ b/hw/remote/vfio-user-obj.c
> @@ -35,6 +35,7 @@
>  #include "trace.h"
>  #include "sysemu/runstate.h"
>  #include "qemu/notify.h"
> +#include "qemu/thread.h"
>  #include "qapi/error.h"
>  #include "sysemu/sysemu.h"
>  #include "libvfio-user.h"
> @@ -65,6 +66,8 @@ struct VfuObject {
>      vfu_ctx_t *vfu_ctx;
>  
>      PCIDevice *pci_dev;
> +
> +    int vfu_poll_fd;
>  };
>  
>  static void vfu_object_set_socket(Object *obj, const char *str, Error **errp)
> @@ -89,13 +92,67 @@ static void vfu_object_set_devid(Object *obj, const char *str, Error **errp)
>      trace_vfu_prop("devid", str);
>  }
>  
> +static void vfu_object_ctx_run(void *opaque)
> +{
> +    VfuObject *o = opaque;
> +    int ret = -1;
> +
> +    while (ret != 0) {
> +        ret = vfu_run_ctx(o->vfu_ctx);
> +        if (ret < 0) {
> +            if (errno == EINTR) {
> +                continue;
> +            } else if (errno == ENOTCONN) {
> +                qemu_set_fd_handler(o->vfu_poll_fd, NULL, NULL, NULL);
> +                o->vfu_poll_fd = -1;
> +                object_unparent(OBJECT(o));
> +                break;
> +            } else {
> +                error_setg(&error_abort, "vfu: Failed to run device %s - %s",
> +                           o->devid, strerror(errno));
> +                 break;
> +            }
> +        }
> +    }
> +}
> +
> +static void *vfu_object_attach_ctx(void *opaque)
> +{
> +    VfuObject *o = opaque;
> +    int ret;
> +
> +retry_attach:
> +    ret = vfu_attach_ctx(o->vfu_ctx);
> +    if (ret < 0 && (errno == EAGAIN || errno == EWOULDBLOCK)) {

Does this loop consume 100% CPU since this is non-blocking?

Is it possible to register the fd with a QEMU AioContext instead of
spawning a separate thread?

libvfio-user has non-blocking listen_fd but conn_fd is always blocking.
This means ATTACH_NB is not useful because vfu_attach_ctx() is actually
blocking. I think this means vfu_run_ctx() is also blocking in some
places and QEMU's event loop might hang :(.

Can you make libvfio-user non-blocking in order to solve these issues?

> +        goto retry_attach;
> +    } else if (ret < 0) {
> +        error_setg(&error_abort,
> +                   "vfu: Failed to attach device %s to context - %s",
> +                   o->devid, strerror(errno));
> +        return NULL;
> +    }
> +
> +    o->vfu_poll_fd = vfu_get_poll_fd(o->vfu_ctx);
> +    if (o->vfu_poll_fd < 0) {
> +        error_setg(&error_abort, "vfu: Failed to get poll fd %s", o->devid);
> +        return NULL;
> +    }
> +
> +    qemu_set_fd_handler(o->vfu_poll_fd, vfu_object_ctx_run,
> +                        NULL, o);
> +
> +    return NULL;
> +}
> +
>  static void vfu_object_machine_done(Notifier *notifier, void *data)
>  {
>      VfuObject *o = container_of(notifier, VfuObject, machine_done);
>      DeviceState *dev = NULL;
> +    QemuThread thread;
>      int ret;
>  
> -    o->vfu_ctx = vfu_create_ctx(VFU_TRANS_SOCK, o->socket, 0,
> +    o->vfu_ctx = vfu_create_ctx(VFU_TRANS_SOCK, o->socket,
> +                                LIBVFIO_USER_FLAG_ATTACH_NB,
>                                  o, VFU_DEV_TYPE_PCI);
>      if (o->vfu_ctx == NULL) {
>          error_setg(&error_abort, "vfu: Failed to create context - %s",
> @@ -124,6 +181,16 @@ static void vfu_object_machine_done(Notifier *notifier, void *data)
>                     o->devid, strerror(errno));
>          return;
>      }
> +
> +    ret = vfu_realize_ctx(o->vfu_ctx);
> +    if (ret < 0) {
> +        error_setg(&error_abort, "vfu: Failed to realize device %s- %s",
> +                   o->devid, strerror(errno));
> +        return;
> +    }
> +
> +    qemu_thread_create(&thread, o->socket, vfu_object_attach_ctx, o,
> +                       QEMU_THREAD_DETACHED);

Is this thread leaked when the object is destroyed?

>  }
>  
>  static void vfu_object_init(Object *obj)
> @@ -147,6 +214,8 @@ static void vfu_object_init(Object *obj)
>  
>      o->machine_done.notify = vfu_object_machine_done;
>      qemu_add_machine_init_done_notifier(&o->machine_done);
> +
> +    o->vfu_poll_fd = -1;
>  }
>  
>  static void vfu_object_finalize(Object *obj)
> -- 
> 1.8.3.1
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC server v2 05/11] vfio-user: run vfio-user context
  2021-09-08 12:58     ` Stefan Hajnoczi
@ 2021-09-08 13:37       ` John Levon
  2021-09-08 15:02         ` Stefan Hajnoczi
  0 siblings, 1 reply; 108+ messages in thread
From: John Levon @ 2021-09-08 13:37 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: elena.ufimtseva, john.g.johnson, thuth, Jagannathan Raman,
	Swapnil Ingle, philmd, qemu-devel, alex.williamson,
	marcandre.lureau, Thanos Makatos, alex.bennee

On Wed, Sep 08, 2021 at 01:58:46PM +0100, Stefan Hajnoczi wrote:

> > +static void *vfu_object_attach_ctx(void *opaque)
> > +{
> > +    VfuObject *o = opaque;
> > +    int ret;
> > +
> > +retry_attach:
> > +    ret = vfu_attach_ctx(o->vfu_ctx);
> > +    if (ret < 0 && (errno == EAGAIN || errno == EWOULDBLOCK)) {
> 
> Does this loop consume 100% CPU since this is non-blocking?

Looks like it. Instead after vfu_create_ctx, there should be a vfu_get_poll_fd()
to get the listen socket, then a qemu_set_fd_handler(vfu_object_attach_ctx)
to handle the attach when the listen socket is ready, modulo the below part.

> libvfio-user has non-blocking listen_fd but conn_fd is always blocking.

It is, but in vfu_run_ctx(), we poll on it:

```
790     if (vfu_ctx->flags & LIBVFIO_USER_FLAG_ATTACH_NB) {                          
791         sock_flags = MSG_DONTWAIT | MSG_WAITALL;                                 
792     }                                                                            
793     return get_msg(hdr, sizeof(*hdr), fds, nr_fds, ts->conn_fd, sock_flags);     
```

> This means ATTACH_NB is not useful because vfu_attach_ctx() is actually
> blocking.

You're correct that vfu_attach_ctx is in fact partially blocking: after
accepting the connection, we call negotiate(), which can indeed block waiting if
the client hasn't sent anything.

> I think this means vfu_run_ctx() is also blocking in some places

Correct. There's a presumption that if a message is ready, we can read it all
without blocking, and equally that we can write to the socket without blocking.

The library docs are not at all clear on this point.

> and QEMU's event loop might hang :(
> 
> Can you make libvfio-user non-blocking in order to solve these issues?

I presume you're concerned about the security aspect: a malicious client could
withhold a write, and hence hang the device server.

Problem is the libvfio-user API is synchronous: there's no way to return
half-way through a vfu_attach_ctx() (or a vfu_run_ctx() after we read the
header) then resume.

We'd have to have a whole separate API to do that, so a separate thread seems a
better approach?

regards
john

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC server v2 05/11] vfio-user: run vfio-user context
  2021-09-08 13:37       ` John Levon
@ 2021-09-08 15:02         ` Stefan Hajnoczi
  2021-09-08 15:21           ` John Levon
  0 siblings, 1 reply; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-09-08 15:02 UTC (permalink / raw)
  To: John Levon
  Cc: elena.ufimtseva, john.g.johnson, thuth, Jagannathan Raman,
	Swapnil Ingle, philmd, qemu-devel, alex.williamson,
	marcandre.lureau, Thanos Makatos, alex.bennee

[-- Attachment #1: Type: text/plain, Size: 2888 bytes --]

On Wed, Sep 08, 2021 at 01:37:53PM +0000, John Levon wrote:
> On Wed, Sep 08, 2021 at 01:58:46PM +0100, Stefan Hajnoczi wrote:
> 
> > > +static void *vfu_object_attach_ctx(void *opaque)
> > > +{
> > > +    VfuObject *o = opaque;
> > > +    int ret;
> > > +
> > > +retry_attach:
> > > +    ret = vfu_attach_ctx(o->vfu_ctx);
> > > +    if (ret < 0 && (errno == EAGAIN || errno == EWOULDBLOCK)) {
> > 
> > Does this loop consume 100% CPU since this is non-blocking?
> 
> Looks like it. Instead after vfu_create_ctx, there should be a vfu_get_poll_fd()
> to get the listen socket, then a qemu_set_fd_handler(vfu_object_attach_ctx)
> to handle the attach when the listen socket is ready, modulo the below part.
> 
> > libvfio-user has non-blocking listen_fd but conn_fd is always blocking.
> 
> It is, but in vfu_run_ctx(), we poll on it:
> 
> ```
> 790     if (vfu_ctx->flags & LIBVFIO_USER_FLAG_ATTACH_NB) {                          
> 791         sock_flags = MSG_DONTWAIT | MSG_WAITALL;                                 
> 792     }                                                                            
> 793     return get_msg(hdr, sizeof(*hdr), fds, nr_fds, ts->conn_fd, sock_flags);     
> ```

This is only used for the request header. Other I/O is blocking.

> 
> > This means ATTACH_NB is not useful because vfu_attach_ctx() is actually
> > blocking.
> 
> You're correct that vfu_attach_ctx is in fact partially blocking: after
> accepting the connection, we call negotiate(), which can indeed block waiting if
> the client hasn't sent anything.
> 
> > I think this means vfu_run_ctx() is also blocking in some places
> 
> Correct. There's a presumption that if a message is ready, we can read it all
> without blocking, and equally that we can write to the socket without blocking.
> 
> The library docs are not at all clear on this point.
> 
> > and QEMU's event loop might hang :(
> > 
> > Can you make libvfio-user non-blocking in order to solve these issues?
> 
> I presume you're concerned about the security aspect: a malicious client could
> withhold a write, and hence hang the device server.
> 
> Problem is the libvfio-user API is synchronous: there's no way to return
> half-way through a vfu_attach_ctx() (or a vfu_run_ctx() after we read the
> header) then resume.
> 
> We'd have to have a whole separate API to do that, so a separate thread seems a
> better approach?

Whether to support non-blocking properly in libvfio-user is a decision
for you. If libvfio-user doesn't support non-blocking, then QEMU should
run a dedicated thread instead of the partially non-blocking approach in
this patch.

A non-blocking approach is nice when there are many devices hosted in a
single process or a lot of async replies (which requires extra thread
synchronization with the blocking approach).

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC server v2 05/11] vfio-user: run vfio-user context
  2021-09-08 15:02         ` Stefan Hajnoczi
@ 2021-09-08 15:21           ` John Levon
  2021-09-08 15:46             ` Stefan Hajnoczi
  0 siblings, 1 reply; 108+ messages in thread
From: John Levon @ 2021-09-08 15:21 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: elena.ufimtseva, john.g.johnson, thuth, Jagannathan Raman,
	Swapnil Ingle, philmd, qemu-devel, alex.williamson,
	marcandre.lureau, Thanos Makatos, alex.bennee

On Wed, Sep 08, 2021 at 04:02:22PM +0100, Stefan Hajnoczi wrote:

> > We'd have to have a whole separate API to do that, so a separate thread seems a
> > better approach?
> 
> Whether to support non-blocking properly in libvfio-user is a decision
> for you. If libvfio-user doesn't support non-blocking, then QEMU should
> run a dedicated thread instead of the partially non-blocking approach in
> this patch.

Right, sure. At this point we don't have any plans to implement a separate async
API due to the amount of work involved. 

> A non-blocking approach is nice when there are many devices hosted in a
> single process or a lot of async replies (which requires extra thread
> synchronization with the blocking approach).

I suppose this would be more of a problem with devices where the I/O path has to
be handled via the socket.

regards
john

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC server v2 05/11] vfio-user: run vfio-user context
  2021-09-08 15:21           ` John Levon
@ 2021-09-08 15:46             ` Stefan Hajnoczi
  0 siblings, 0 replies; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-09-08 15:46 UTC (permalink / raw)
  To: John Levon
  Cc: elena.ufimtseva, john.g.johnson, thuth, Jagannathan Raman,
	Swapnil Ingle, philmd, qemu-devel, alex.williamson,
	marcandre.lureau, Thanos Makatos, alex.bennee

[-- Attachment #1: Type: text/plain, Size: 1190 bytes --]

On Wed, Sep 08, 2021 at 03:21:19PM +0000, John Levon wrote:
> On Wed, Sep 08, 2021 at 04:02:22PM +0100, Stefan Hajnoczi wrote:
> 
> > > We'd have to have a whole separate API to do that, so a separate thread seems a
> > > better approach?
> > 
> > Whether to support non-blocking properly in libvfio-user is a decision
> > for you. If libvfio-user doesn't support non-blocking, then QEMU should
> > run a dedicated thread instead of the partially non-blocking approach in
> > this patch.
> 
> Right, sure. At this point we don't have any plans to implement a separate async
> API due to the amount of work involved. 
> 
> > A non-blocking approach is nice when there are many devices hosted in a
> > single process or a lot of async replies (which requires extra thread
> > synchronization with the blocking approach).
> 
> I suppose this would be more of a problem with devices where the I/O path has to
> be handled via the socket.

Yes, exactly. I think it shouldn't be a problem when shared memory is
used and the irqfd (eventfd) mechanism is used for IRQs. It becomes slow
when there's no shared memory or if raising IRQs requires protocol
messages.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 04/16] vfio-user: connect vfio proxy to remote server
  2021-09-07 13:21       ` Stefan Hajnoczi
@ 2021-09-09  5:11         ` John Johnson
  2021-09-09  6:29           ` Stefan Hajnoczi
  0 siblings, 1 reply; 108+ messages in thread
From: John Johnson @ 2021-09-09  5:11 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, Jag Raman, Swapnil Ingle, John Levon,
	QEMU Devel Mailing List, Alex Williamson, thanos.makatos



> On Sep 7, 2021, at 6:21 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> 
> This way the network communication code doesn't need to know how
> messages will by processed by the client or server. There is no need for
> if (isreply) { qemu_cond_signal(&reply->cv); } else {
> proxy->request(proxy->reqarg, buf, &reqfds); }. The callbacks and
> threads aren't hardcoded into the network communication code.
> 

	I fear we are talking past each other.  The vfio-user protocol
is bi-directional.  e.g., the client both sends requests to the server
and receives requests from the server on the same socket.  No matter
what threading model we use, the receive algorithm will be:


read message header
if it’s a reply
   schedule the thread waiting for the reply
else
   run a callback to process the request


	The only way I can see changing this is to establish two
uni-directional sockets: one for requests outbound to the server,
and one for requests inbound from the server.

	This is the reason I chose the iothread model.  It can run
independently of any vCPU/main threads waiting for replies and of
the callback thread.  I did muddle this idea by having the iothread
become a callback thread by grabbing BQL and running the callback
inline when it receives a request from the server, but if you like a
pure event driven model, I can make incoming requests kick a BH from
the main loop.  e.g.,

if it’s a reply
   qemu_cond_signal(reply cv)
else
   qemu_bh_schedule(proxy bh)

	That would avoid disconnect having to handle the iothread
blocked on BQL.


> This goes back to the question earlier about why a dedicated thread is
> necessary here. I suggest writing the network communication code using
> coroutines. That way the code is easier to read (no callbacks or
> thread synchronization), there are fewer thread-safety issues to worry
> about, and users or management tools don't need to know about additional
> threads (e.g. CPU/NUMA affinity).
> 


	I did look at coroutines, but they seemed to work when the sender
is triggering the coroutine on send, not when request packets are arriving
asynchronously to the sends.

								JJ


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 08/16] vfio-user: get region info
  2021-09-07 14:31   ` Stefan Hajnoczi
@ 2021-09-09  5:35     ` John Johnson
  2021-09-09  5:59       ` Stefan Hajnoczi
  0 siblings, 1 reply; 108+ messages in thread
From: John Johnson @ 2021-09-09  5:35 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, Jag Raman, Swapnil Ingle, John Levon,
	QEMU Devel Mailing List, Alex Williamson, thanos.makatos



> On Sep 7, 2021, at 7:31 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Mon, Aug 16, 2021 at 09:42:41AM -0700, Elena Ufimtseva wrote:
>> @@ -1514,6 +1515,16 @@ bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
>>     return true;
>> }
>> 
>> +static int vfio_get_region_info_remfd(VFIODevice *vbasedev, int index)
>> +{
>> +    struct vfio_region_info *info;
>> +
>> +    if (vbasedev->regions == NULL || vbasedev->regions[index] == NULL) {
>> +        vfio_get_region_info(vbasedev, index, &info);
>> +    }
> 
> Maybe this will be called from other places in the future, but the
> vfio_region_setup() caller below already invoked vfio_get_region_info()
> so I'm not sure it's necessary to do this again?
> 
> Perhaps add an int *remfd argument to vfio_get_region_info(). That way
> vfio_get_region_info_remfd() isn't necessary.
> 

	I think they could be combined, but the region capabilities are
retrieved with separate calls to vfio_get_region_info_cap(), so I followed
that precedent.


>> @@ -2410,6 +2442,24 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
>>                          struct vfio_region_info **info)
>> {
>>     size_t argsz = sizeof(struct vfio_region_info);
>> +    int fd = -1;
>> +    int ret;
>> +
>> +    /* create region cache */
>> +    if (vbasedev->regions == NULL) {
>> +        vbasedev->regions = g_new0(struct vfio_region_info *,
>> +                                   vbasedev->num_regions);
>> +        if (vbasedev->proxy != NULL) {
>> +            vbasedev->regfds = g_new0(int, vbasedev->num_regions);
>> +        }
>> +    }
>> +    /* check cache */
>> +    if (vbasedev->regions[index] != NULL) {
>> +        *info = g_malloc0(vbasedev->regions[index]->argsz);
>> +        memcpy(*info, vbasedev->regions[index],
>> +               vbasedev->regions[index]->argsz);
>> +        return 0;
>> +    }
> 
> Why is it necessary to introduce a cache? Is it to avoid passing the
> same fd multiple times?
> 

	Yes.  For polled regions, the server returns an FD with each
message, so there would be an FD leak if I didn’t check that the region
hadn’t already returned one.  Since I had to cache the FD anyway, I
cached the whole struct.


>> 
>>     *info = g_malloc0(argsz);
>> 
>> @@ -2417,7 +2467,17 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
>> retry:
>>     (*info)->argsz = argsz;
>> 
>> -    if (ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, *info)) {
>> +    if (vbasedev->proxy != NULL) {
>> +        VFIOUserFDs fds = { 0, 1, &fd};
>> +
>> +        ret = vfio_user_get_region_info(vbasedev, index, *info, &fds);
>> +    } else {
>> +        ret = ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, *info);
>> +        if (ret < 0) {
>> +            ret = -errno;
>> +        }
>> +    }
>> +    if (ret != 0) {
>>         g_free(*info);
>>         *info = NULL;
>>         return -errno;
>> @@ -2426,10 +2486,22 @@ retry:
>>     if ((*info)->argsz > argsz) {
>>         argsz = (*info)->argsz;
>>         *info = g_realloc(*info, argsz);
>> +        if (fd != -1) {
>> +            close(fd);
>> +            fd = -1;
>> +        }
>> 
>>         goto retry;
>>     }
>> 
>> +    /* fill cache */
>> +    vbasedev->regions[index] = g_malloc0(argsz);
>> +    memcpy(vbasedev->regions[index], *info, argsz);
>> +    *vbasedev->regions[index] = **info;
> 
> The previous line already copied the contents of *info. What is the
> purpose of this assignment?
> 

	That might be a mis-merge.  The struct assignment was a bug
fixed several revs ago with the memcpy() call.


>> +    if (vbasedev->regfds != NULL) {
>> +        vbasedev->regfds[index] = fd;
>> +    }
>> +
>>     return 0;
>> }
>> 
>> diff --git a/hw/vfio/user.c b/hw/vfio/user.c
>> index b584b8e0f2..91b51f37df 100644
>> --- a/hw/vfio/user.c
>> +++ b/hw/vfio/user.c
>> @@ -734,3 +734,36 @@ int vfio_user_get_info(VFIODevice *vbasedev)
>>     vbasedev->reset_works = !!(msg.flags & VFIO_DEVICE_FLAGS_RESET);
>>     return 0;
>> }
>> +
>> +int vfio_user_get_region_info(VFIODevice *vbasedev, int index,
>> +                              struct vfio_region_info *info, VFIOUserFDs *fds)
>> +{
>> +    g_autofree VFIOUserRegionInfo *msgp = NULL;
>> +    int size;
> 
> Please use uint32_t size instead of int size to prevent possible
> signedness issues:
> - VFIOUserRegionInfo->argsz is uint32_t.
> - sizeof(VFIOUserHdr) is size_t.
> - The vfio_user_request_msg() size argument is uint32_t.

	OK

		JJ



^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 11/16] vfio-user: get and set IRQs
  2021-09-07 15:14   ` Stefan Hajnoczi
@ 2021-09-09  5:50     ` John Johnson
  2021-09-09 13:50       ` Stefan Hajnoczi
  0 siblings, 1 reply; 108+ messages in thread
From: John Johnson @ 2021-09-09  5:50 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, Jag Raman, Swapnil Ingle, John Levon,
	QEMU Devel Mailing List, Alex Williamson, thanos.makatos



> On Sep 7, 2021, at 8:14 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Mon, Aug 16, 2021 at 09:42:44AM -0700, Elena Ufimtseva wrote:
>> From: John Johnson <john.g.johnson@oracle.com>
>> 
>> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
>> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
>> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
>> ---
>> hw/vfio/user-protocol.h |  25 ++++++++++
>> hw/vfio/user.h          |   2 +
>> hw/vfio/common.c        |  26 ++++++++--
>> hw/vfio/pci.c           |  31 ++++++++++--
>> hw/vfio/user.c          | 106 ++++++++++++++++++++++++++++++++++++++++
>> 5 files changed, 181 insertions(+), 9 deletions(-)
>> 
>> diff --git a/hw/vfio/user-protocol.h b/hw/vfio/user-protocol.h
>> index 56904cf872..5614efa0a4 100644
>> --- a/hw/vfio/user-protocol.h
>> +++ b/hw/vfio/user-protocol.h
>> @@ -109,6 +109,31 @@ typedef struct {
>>     uint64_t offset;
>> } VFIOUserRegionInfo;
>> 
>> +/*
>> + * VFIO_USER_DEVICE_GET_IRQ_INFO
>> + * imported from struct vfio_irq_info
>> + */
>> +typedef struct {
>> +    VFIOUserHdr hdr;
>> +    uint32_t argsz;
>> +    uint32_t flags;
>> +    uint32_t index;
>> +    uint32_t count;
>> +} VFIOUserIRQInfo;
>> +
>> +/*
>> + * VFIO_USER_DEVICE_SET_IRQS
>> + * imported from struct vfio_irq_set
>> + */
>> +typedef struct {
>> +    VFIOUserHdr hdr;
>> +    uint32_t argsz;
>> +    uint32_t flags;
>> +    uint32_t index;
>> +    uint32_t start;
>> +    uint32_t count;
>> +} VFIOUserIRQSet;
>> +
>> /*
>>  * VFIO_USER_REGION_READ
>>  * VFIO_USER_REGION_WRITE
>> diff --git a/hw/vfio/user.h b/hw/vfio/user.h
>> index 02f832a173..248ad80943 100644
>> --- a/hw/vfio/user.h
>> +++ b/hw/vfio/user.h
>> @@ -74,6 +74,8 @@ int vfio_user_validate_version(VFIODevice *vbasedev, Error **errp);
>> int vfio_user_get_info(VFIODevice *vbasedev);
>> int vfio_user_get_region_info(VFIODevice *vbasedev, int index,
>>                               struct vfio_region_info *info, VFIOUserFDs *fds);
>> +int vfio_user_get_irq_info(VFIODevice *vbasedev, struct vfio_irq_info *info);
>> +int vfio_user_set_irqs(VFIODevice *vbasedev, struct vfio_irq_set *irq);
>> int vfio_user_region_read(VFIODevice *vbasedev, uint32_t index, uint64_t offset,
>>                           uint32_t count, void *data);
>> int vfio_user_region_write(VFIODevice *vbasedev, uint32_t index,
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index a8b1ea9358..9fe3e05dc6 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -71,7 +71,11 @@ void vfio_disable_irqindex(VFIODevice *vbasedev, int index)
>>         .count = 0,
>>     };
>> 
>> -    ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
>> +    if (vbasedev->proxy != NULL) {
>> +        vfio_user_set_irqs(vbasedev, &irq_set);
>> +    } else {
>> +        ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
>> +    }
>> }
>> 
>> void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index)
>> @@ -84,7 +88,11 @@ void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index)
>>         .count = 1,
>>     };
>> 
>> -    ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
>> +    if (vbasedev->proxy != NULL) {
>> +        vfio_user_set_irqs(vbasedev, &irq_set);
>> +    } else {
>> +        ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
>> +    }
>> }
>> 
>> void vfio_mask_single_irqindex(VFIODevice *vbasedev, int index)
>> @@ -97,7 +105,11 @@ void vfio_mask_single_irqindex(VFIODevice *vbasedev, int index)
>>         .count = 1,
>>     };
>> 
>> -    ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
>> +    if (vbasedev->proxy != NULL) {
>> +        vfio_user_set_irqs(vbasedev, &irq_set);
>> +    } else {
>> +        ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
>> +    }
>> }
>> 
>> static inline const char *action_to_str(int action)
>> @@ -178,8 +190,12 @@ int vfio_set_irq_signaling(VFIODevice *vbasedev, int index, int subindex,
>>     pfd = (int32_t *)&irq_set->data;
>>     *pfd = fd;
>> 
>> -    if (ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, irq_set)) {
>> -        ret = -errno;
>> +    if (vbasedev->proxy != NULL) {
>> +        ret = vfio_user_set_irqs(vbasedev, irq_set);
>> +    } else {
>> +        if (ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, irq_set)) {
>> +            ret = -errno;
>> +        }
>>     }
>>     g_free(irq_set);
>> 
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index ea0df8be65..282de6a30b 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -403,7 +403,11 @@ static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool msix)
>>         fds[i] = fd;
>>     }
>> 
>> -    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
>> +    if (vdev->vbasedev.proxy != NULL) {
>> +        ret = vfio_user_set_irqs(&vdev->vbasedev, irq_set);
>> +    } else {
>> +        ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
>> +    }
>> 
>>     g_free(irq_set);
>> 
>> @@ -2675,7 +2679,13 @@ static void vfio_populate_device(VFIOPCIDevice *vdev, Error **errp)
>> 
>>     irq_info.index = VFIO_PCI_ERR_IRQ_INDEX;
>> 
>> -    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info);
>> +    if (vbasedev->proxy != NULL) {
>> +        ret = vfio_user_get_irq_info(vbasedev, &irq_info);
>> +    } else {
>> +        ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info);
>> +    }
>> +
>> +
>>     if (ret) {
>>         /* This can fail for an old kernel or legacy PCI dev */
>>         trace_vfio_populate_device_get_irq_info_failure(strerror(errno));
>> @@ -2794,8 +2804,16 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
>>         return;
>>     }
>> 
>> -    if (ioctl(vdev->vbasedev.fd,
>> -              VFIO_DEVICE_GET_IRQ_INFO, &irq_info) < 0 || irq_info.count < 1) {
>> +    if (vdev->vbasedev.proxy != NULL) {
>> +        if (vfio_user_get_irq_info(&vdev->vbasedev, &irq_info) < 0) {
>> +            return;
>> +        }
>> +    } else {
>> +        if (ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info) < 0) {
>> +            return;
>> +        }
>> +    }
>> +    if (irq_info.count < 1) {
>>         return;
>>     }
>> 
>> @@ -3557,6 +3575,11 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
>>         }
>>     }
>> 
>> +    vfio_register_err_notifier(vdev);
>> +    vfio_register_req_notifier(vdev);
>> +
>> +    return;
>> +
>> out_deregister:
>>     pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
>>     kvm_irqchip_remove_change_notifier(&vdev->irqchip_change_notifier);
>> diff --git a/hw/vfio/user.c b/hw/vfio/user.c
>> index 83235b2411..b68ca1279d 100644
>> --- a/hw/vfio/user.c
>> +++ b/hw/vfio/user.c
>> @@ -768,6 +768,112 @@ int vfio_user_get_region_info(VFIODevice *vbasedev, int index,
>>     return 0;
>> }
>> 
>> +int vfio_user_get_irq_info(VFIODevice *vbasedev, struct vfio_irq_info *info)
>> +{
>> +    VFIOUserIRQInfo msg;
>> +
>> +    memset(&msg, 0, sizeof(msg));
>> +    vfio_user_request_msg(&msg.hdr, VFIO_USER_DEVICE_GET_IRQ_INFO,
>> +                          sizeof(msg), 0);
>> +    msg.argsz = info->argsz;
>> +    msg.index = info->index;
>> +
>> +    vfio_user_send_recv(vbasedev->proxy, &msg.hdr, NULL, 0, 0);
>> +    if (msg.hdr.flags & VFIO_USER_ERROR) {
>> +        return -msg.hdr.error_reply;
>> +    }
>> +
>> +    memcpy(info, &msg.argsz, sizeof(*info));
> 
> Should this be info.count = msg.count instead? Not sure why argsz is
> used here.

	It’s copying the entire returned vfio_irq_info struct, which starts
at &msg.argsz.


> 
> Also, I just noticed the lack of endianness conversion in this patch
> series. The spec says values are little-endian but these patches mostly
> use host-endian. Did I miss something?


	I had thought we were using host endianness for UNIX sockets and
were deferring the cross endianness issue to when we support TCP, but the
spec does say all LE.

								JJ



^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 08/16] vfio-user: get region info
  2021-09-09  5:35     ` John Johnson
@ 2021-09-09  5:59       ` Stefan Hajnoczi
  0 siblings, 0 replies; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-09-09  5:59 UTC (permalink / raw)
  To: John Johnson
  Cc: Elena Ufimtseva, Jag Raman, Swapnil Ingle, John Levon,
	QEMU Devel Mailing List, Alex Williamson, thanos.makatos

[-- Attachment #1: Type: text/plain, Size: 2689 bytes --]

On Thu, Sep 09, 2021 at 05:35:40AM +0000, John Johnson wrote:
> 
> 
> > On Sep 7, 2021, at 7:31 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > 
> > On Mon, Aug 16, 2021 at 09:42:41AM -0700, Elena Ufimtseva wrote:
> >> @@ -1514,6 +1515,16 @@ bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
> >>     return true;
> >> }
> >> 
> >> +static int vfio_get_region_info_remfd(VFIODevice *vbasedev, int index)
> >> +{
> >> +    struct vfio_region_info *info;
> >> +
> >> +    if (vbasedev->regions == NULL || vbasedev->regions[index] == NULL) {
> >> +        vfio_get_region_info(vbasedev, index, &info);
> >> +    }
> > 
> > Maybe this will be called from other places in the future, but the
> > vfio_region_setup() caller below already invoked vfio_get_region_info()
> > so I'm not sure it's necessary to do this again?
> > 
> > Perhaps add an int *remfd argument to vfio_get_region_info(). That way
> > vfio_get_region_info_remfd() isn't necessary.
> > 
> 
> 	I think they could be combined, but the region capabilities are
> retrieved with separate calls to vfio_get_region_info_cap(), so I followed
> that precedent.
> 
> 
> >> @@ -2410,6 +2442,24 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
> >>                          struct vfio_region_info **info)
> >> {
> >>     size_t argsz = sizeof(struct vfio_region_info);
> >> +    int fd = -1;
> >> +    int ret;
> >> +
> >> +    /* create region cache */
> >> +    if (vbasedev->regions == NULL) {
> >> +        vbasedev->regions = g_new0(struct vfio_region_info *,
> >> +                                   vbasedev->num_regions);
> >> +        if (vbasedev->proxy != NULL) {
> >> +            vbasedev->regfds = g_new0(int, vbasedev->num_regions);
> >> +        }
> >> +    }
> >> +    /* check cache */
> >> +    if (vbasedev->regions[index] != NULL) {
> >> +        *info = g_malloc0(vbasedev->regions[index]->argsz);
> >> +        memcpy(*info, vbasedev->regions[index],
> >> +               vbasedev->regions[index]->argsz);
> >> +        return 0;
> >> +    }
> > 
> > Why is it necessary to introduce a cache? Is it to avoid passing the
> > same fd multiple times?
> > 
> 
> 	Yes.  For polled regions, the server returns an FD with each
> message, so there would be an FD leak if I didn’t check that the region
> hadn’t already returned one.  Since I had to cache the FD anyway, I
> cached the whole struct.

If vfio_get_region_info() takes an int *fd argument then fd ownership
becomes explicit and the need for the cache falls away. Maybe Alex has a
preference for how to structure the code to track per-region fds.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 09/16] vfio-user: region read/write
  2021-09-07 17:24   ` John Levon
@ 2021-09-09  6:00     ` John Johnson
  2021-09-09 12:05       ` John Levon
  0 siblings, 1 reply; 108+ messages in thread
From: John Johnson @ 2021-09-09  6:00 UTC (permalink / raw)
  To: John Levon
  Cc: Elena Ufimtseva, Jag Raman, Swapnil Ingle, qemu-devel,
	alex.williamson, stefanha, Thanos Makatos



> On Sep 7, 2021, at 10:24 AM, John Levon <john.levon@nutanix.com> wrote:
> 
> On Mon, Aug 16, 2021 at 09:42:42AM -0700, Elena Ufimtseva wrote:
> 
>> +int vfio_user_region_write(VFIODevice *vbasedev, uint32_t index,
>> +                           uint64_t offset, uint32_t count, void *data)
>> +{
>> +    g_autofree VFIOUserRegionRW *msgp = NULL;
>> +    int size = sizeof(*msgp) + count;
>> +
>> +    msgp = g_malloc0(size);
>> +    vfio_user_request_msg(&msgp->hdr, VFIO_USER_REGION_WRITE, size,
>> +                          VFIO_USER_NO_REPLY);
> 
> Mirroring https://github.com/oracle/qemu/issues/10 here for visibility:
> 
> Currently, vfio_user_region_write uses VFIO_USER_NO_REPLY unconditionally,
> meaning essentially all writes are posted. But that shouldn't be the case, for
> example for PCI config space, where it's expected that writes will wait for an
> ack before the VCPU continues.
> 

	I’m not sure following the PCI spec (mem writes posted, config & IO
are not) completely solves the issue if the device uses sparse mmap.  A store
to went over the socket can be passed by a load that goes directly to memory,
which could break a driver that assumes a load completion means older stores
to the same device have also completed.

								JJ



^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 04/16] vfio-user: connect vfio proxy to remote server
  2021-09-09  5:11         ` John Johnson
@ 2021-09-09  6:29           ` Stefan Hajnoczi
  2021-09-10  5:25             ` John Johnson
  0 siblings, 1 reply; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-09-09  6:29 UTC (permalink / raw)
  To: John Johnson
  Cc: Elena Ufimtseva, Jag Raman, Swapnil Ingle, John Levon,
	QEMU Devel Mailing List, Alex Williamson, thanos.makatos

[-- Attachment #1: Type: text/plain, Size: 3434 bytes --]

On Thu, Sep 09, 2021 at 05:11:49AM +0000, John Johnson wrote:
> 
> 
> > On Sep 7, 2021, at 6:21 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > 
> > 
> > This way the network communication code doesn't need to know how
> > messages will by processed by the client or server. There is no need for
> > if (isreply) { qemu_cond_signal(&reply->cv); } else {
> > proxy->request(proxy->reqarg, buf, &reqfds); }. The callbacks and
> > threads aren't hardcoded into the network communication code.
> > 
> 
> 	I fear we are talking past each other.  The vfio-user protocol
> is bi-directional.  e.g., the client both sends requests to the server
> and receives requests from the server on the same socket.  No matter
> what threading model we use, the receive algorithm will be:
> 
> 
> read message header
> if it’s a reply
>    schedule the thread waiting for the reply
> else
>    run a callback to process the request
> 
> 
> 	The only way I can see changing this is to establish two
> uni-directional sockets: one for requests outbound to the server,
> and one for requests inbound from the server.
> 
> 	This is the reason I chose the iothread model.  It can run
> independently of any vCPU/main threads waiting for replies and of
> the callback thread.  I did muddle this idea by having the iothread
> become a callback thread by grabbing BQL and running the callback
> inline when it receives a request from the server, but if you like a
> pure event driven model, I can make incoming requests kick a BH from
> the main loop.  e.g.,
> 
> if it’s a reply
>    qemu_cond_signal(reply cv)
> else
>    qemu_bh_schedule(proxy bh)
> 
> 	That would avoid disconnect having to handle the iothread
> blocked on BQL.
> 
> 
> > This goes back to the question earlier about why a dedicated thread is
> > necessary here. I suggest writing the network communication code using
> > coroutines. That way the code is easier to read (no callbacks or
> > thread synchronization), there are fewer thread-safety issues to worry
> > about, and users or management tools don't need to know about additional
> > threads (e.g. CPU/NUMA affinity).
> > 
> 
> 
> 	I did look at coroutines, but they seemed to work when the sender
> is triggering the coroutine on send, not when request packets are arriving
> asynchronously to the sends.

This can be done with a receiver coroutine. Its job is to be the only
thing that reads vfio-user messages from the socket. A receiver
coroutine reads messages from the socket and wakes up the waiting
coroutine that yielded from vfio_user_send_recv() or
vfio_user_pci_process_req().

(Although vfio_user_pci_process_req() could be called directly from the
receiver coroutine, it seems safer to have a separate coroutine that
processes requests so that the receiver isn't blocked in case
vfio_user_pci_process_req() yields while processing a request.)

Going back to what you mentioned above, the receiver coroutine does
something like this:

  if it's a reply
      reply = find_reply(...)
      qemu_coroutine_enter(reply->co) // instead of signalling reply->cv
  else
      QSIMPLEQ_INSERT_TAIL(&pending_reqs, request, next);
      if (pending_reqs_was_empty) {
          qemu_coroutine_enter(process_request_co);
      }

The pending_reqs queue holds incoming requests that the
process_request_co coroutine processes.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC server v2 06/11] vfio-user: handle PCI config space accesses
  2021-08-27 17:53   ` [PATCH RFC server v2 06/11] vfio-user: handle PCI config space accesses Jagannathan Raman
@ 2021-09-09  7:27     ` Stefan Hajnoczi
  2021-09-10 16:22       ` Jag Raman
  0 siblings, 1 reply; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-09-09  7:27 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: elena.ufimtseva, john.g.johnson, thuth, swapnil.ingle,
	john.levon, philmd, qemu-devel, alex.williamson,
	marcandre.lureau, thanos.makatos, alex.bennee

[-- Attachment #1: Type: text/plain, Size: 1309 bytes --]

On Fri, Aug 27, 2021 at 01:53:25PM -0400, Jagannathan Raman wrote:
> +static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, char * const buf,
> +                                     size_t count, loff_t offset,
> +                                     const bool is_write)
> +{
> +    VfuObject *o = vfu_get_private(vfu_ctx);
> +    uint32_t pci_access_width = sizeof(uint32_t);
> +    size_t bytes = count;
> +    uint32_t val = 0;
> +    char *ptr = buf;
> +    int len;
> +
> +    while (bytes > 0) {
> +        len = (bytes > pci_access_width) ? pci_access_width : bytes;
> +        if (is_write) {
> +            memcpy(&val, ptr, len);
> +            pci_default_write_config(PCI_DEVICE(o->pci_dev),
> +                                     offset, val, len);
> +            trace_vfu_cfg_write(offset, val);
> +        } else {
> +            val = pci_default_read_config(PCI_DEVICE(o->pci_dev),
> +                                          offset, len);
> +            memcpy(ptr, &val, len);

pci_default_read_config() returns a host-endian 32-bit value. This code
looks wrong because it copies different bytes on big- and little-endian
hosts.

> +            trace_vfu_cfg_read(offset, val);
> +        }

Why call pci_default_read/write_config() directly instead of
pci_dev->config_read/write()?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC server v2 07/11] vfio-user: handle DMA mappings
  2021-08-27 17:53   ` [PATCH RFC server v2 07/11] vfio-user: handle DMA mappings Jagannathan Raman
@ 2021-09-09  7:29     ` Stefan Hajnoczi
  0 siblings, 0 replies; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-09-09  7:29 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: elena.ufimtseva, john.g.johnson, thuth, swapnil.ingle,
	john.levon, philmd, qemu-devel, alex.williamson,
	marcandre.lureau, thanos.makatos, alex.bennee

[-- Attachment #1: Type: text/plain, Size: 557 bytes --]

On Fri, Aug 27, 2021 at 01:53:26PM -0400, Jagannathan Raman wrote:
> Define and register callbacks to manage the RAM regions used for
> device DMA
> 
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> ---
>  hw/remote/vfio-user-obj.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++
>  hw/remote/trace-events    |  2 ++
>  2 files changed, 52 insertions(+)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC server v2 08/11] vfio-user: handle PCI BAR accesses
  2021-08-27 17:53   ` [PATCH RFC server v2 08/11] vfio-user: handle PCI BAR accesses Jagannathan Raman
@ 2021-09-09  7:37     ` Stefan Hajnoczi
  2021-09-10 16:36       ` Jag Raman
  0 siblings, 1 reply; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-09-09  7:37 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: elena.ufimtseva, john.g.johnson, thuth, swapnil.ingle,
	john.levon, philmd, qemu-devel, alex.williamson,
	marcandre.lureau, thanos.makatos, alex.bennee

[-- Attachment #1: Type: text/plain, Size: 1620 bytes --]

On Fri, Aug 27, 2021 at 01:53:27PM -0400, Jagannathan Raman wrote:
> +/**
> + * VFU_OBJECT_BAR_HANDLER - macro for defining handlers for PCI BARs.
> + *
> + * To create handler for BAR number 2, VFU_OBJECT_BAR_HANDLER(2) would
> + * define vfu_object_bar2_handler
> + */
> +#define VFU_OBJECT_BAR_HANDLER(BAR_NO)                                         \
> +    static ssize_t vfu_object_bar##BAR_NO##_handler(vfu_ctx_t *vfu_ctx,        \
> +                                        char * const buf, size_t count,        \
> +                                        loff_t offset, const bool is_write)    \
> +    {                                                                          \
> +        VfuObject *o = vfu_get_private(vfu_ctx);                               \
> +        hwaddr addr = (hwaddr)(pci_get_long(o->pci_dev->config +               \
> +                                            PCI_BASE_ADDRESS_0 +               \
> +                                            (4 * BAR_NO)) + offset);           \

Does this handle 64-bit BARs?

> +/**
> + * vfu_object_register_bars - Identify active BAR regions of pdev and setup
> + *                            callbacks to handle read/write accesses
> + */
> +static void vfu_object_register_bars(vfu_ctx_t *vfu_ctx, PCIDevice *pdev)
> +{
> +    uint32_t orig_val, new_val;
> +    int i, size;
> +
> +    for (i = 0; i < PCI_NUM_REGIONS; i++) {
> +        orig_val = pci_default_read_config(pdev,
> +                                           PCI_BASE_ADDRESS_0 + (4 * i), 4);

Same question as in an earlier patch: should we call pdev->read_config()?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC server v2 09/11] vfio-user: handle device interrupts
  2021-08-27 17:53   ` [PATCH RFC server v2 09/11] vfio-user: handle device interrupts Jagannathan Raman
@ 2021-09-09  7:40     ` Stefan Hajnoczi
  0 siblings, 0 replies; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-09-09  7:40 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: elena.ufimtseva, john.g.johnson, thuth, swapnil.ingle,
	john.levon, philmd, qemu-devel, alex.williamson,
	marcandre.lureau, thanos.makatos, alex.bennee

[-- Attachment #1: Type: text/plain, Size: 587 bytes --]

On Fri, Aug 27, 2021 at 01:53:28PM -0400, Jagannathan Raman wrote:
> Forward remote device's interrupts to the guest
> 
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> ---
>  include/hw/remote/iohub.h |  2 ++
>  hw/remote/iohub.c         |  5 +++++
>  hw/remote/vfio-user-obj.c | 30 ++++++++++++++++++++++++++++++
>  hw/remote/trace-events    |  1 +
>  4 files changed, 38 insertions(+)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC server v2 10/11] vfio-user: register handlers to facilitate migration
  2021-08-27 17:53   ` [PATCH RFC server v2 10/11] vfio-user: register handlers to facilitate migration Jagannathan Raman
@ 2021-09-09  8:14     ` Stefan Hajnoczi
  0 siblings, 0 replies; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-09-09  8:14 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: elena.ufimtseva, john.g.johnson, thuth, swapnil.ingle,
	john.levon, philmd, qemu-devel, alex.williamson,
	marcandre.lureau, thanos.makatos, alex.bennee

[-- Attachment #1: Type: text/plain, Size: 9492 bytes --]

On Fri, Aug 27, 2021 at 01:53:29PM -0400, Jagannathan Raman wrote:
> +static ssize_t vfu_mig_buf_read(void *opaque, uint8_t *buf, int64_t pos,
> +                                size_t size, Error **errp)
> +{
> +    VfuObject *o = opaque;
> +
> +    if (pos > o->vfu_mig_buf_size) {
> +        size = 0;
> +    } else if ((pos + size) > o->vfu_mig_buf_size) {
> +        size = o->vfu_mig_buf_size;
> +    }
> +
> +    memcpy(buf, (o->vfu_mig_buf + pos), size);
> +
> +    o->vfu_mig_buf_size -= size;

This looks strange. pos increases each time we're called. We seem to be
truncating the buffer on each read. Should this line be dropped? Did you
test live migration (maybe this code needs more debugging)?

> +
> +    return size;
> +}
> +
> +static ssize_t vfu_mig_buf_write(void *opaque, struct iovec *iov, int iovcnt,
> +                                 int64_t pos, Error **errp)
> +{
> +    VfuObject *o = opaque;
> +    uint64_t end = pos + iov_size(iov, iovcnt);
> +    int i;
> +
> +    if (end > o->vfu_mig_buf_size) {
> +        o->vfu_mig_buf = g_realloc(o->vfu_mig_buf, end);
> +    }
> +
> +    for (i = 0; i < iovcnt; i++) {
> +        memcpy((o->vfu_mig_buf + o->vfu_mig_buf_size), iov[i].iov_base,
> +               iov[i].iov_len);
> +        o->vfu_mig_buf_size += iov[i].iov_len;
> +        o->vfu_mig_buf_pending += iov[i].iov_len;
> +    }
> +
> +    return iov_size(iov, iovcnt);
> +}
> +
> +static int vfu_mig_buf_shutdown(void *opaque, bool rd, bool wr, Error **errp)
> +{
> +    VfuObject *o = opaque;
> +
> +    o->vfu_mig_buf_size = 0;
> +
> +    g_free(o->vfu_mig_buf);
> +
> +    return 0;
> +}
> +
> +static const QEMUFileOps vfu_mig_fops_save = {
> +    .writev_buffer  = vfu_mig_buf_write,
> +    .shut_down      = vfu_mig_buf_shutdown,
> +};
> +
> +static const QEMUFileOps vfu_mig_fops_load = {
> +    .get_buffer     = vfu_mig_buf_read,
> +    .shut_down      = vfu_mig_buf_shutdown,
> +};
> +
> +/**
> + * handlers for vfu_migration_callbacks_t
> + *
> + * The libvfio-user library accesses these handlers to drive the migration
> + * at the remote end, and also to transport the data stored in vfu_mig_buf
> + *
> + */
> +static void vfu_mig_state_precopy(vfu_ctx_t *vfu_ctx)
> +{
> +    VfuObject *o = vfu_get_private(vfu_ctx);
> +    int ret;
> +
> +    if (!o->vfu_mig_file) {
> +        o->vfu_mig_file = qemu_fopen_ops(o, &vfu_mig_fops_save, false);
> +    }
> +
> +    ret = qemu_remote_savevm(o->vfu_mig_file, DEVICE(o->pci_dev));
> +    if (ret) {
> +        qemu_file_shutdown(o->vfu_mig_file);
> +        return;
> +    }
> +
> +    qemu_fflush(o->vfu_mig_file);
> +}

Are you sure pre-copy is the state where you want to serialize the
savevm data? IIUC pre-copy is the iterative state while the device is
still running (e.g. when copying RAM but before devices are stopped). I
expected savevm to happen when we reach stop-and-copy.

The reason why this matters is that we're saving the state of the device
while the guest is still running and possibly interacting with the
device. The destination won't have the final state of the device, it
will have an earlier state of the device when we started migrating RAM!

Maybe I'm wrong, please double-check, but this looks like a bug.

> +
> +static void vfu_mig_state_running(vfu_ctx_t *vfu_ctx)
> +{
> +    VfuObject *o = vfu_get_private(vfu_ctx);
> +    VfuObjectClass *k = VFU_OBJECT_GET_CLASS(OBJECT(o));
> +    static int migrated_devs;
> +    Error *local_err = NULL;
> +    int ret;
> +
> +    ret = qemu_remote_loadvm(o->vfu_mig_file);
> +    if (ret) {
> +        error_setg(&error_abort, "vfu: failed to restore device state");
> +        return;
> +    }
> +
> +    if (++migrated_devs == k->nr_devs) {
> +        bdrv_invalidate_cache_all(&local_err);
> +        if (local_err) {
> +            error_report_err(local_err);
> +            return;
> +        }
> +
> +        vm_start();
> +    }
> +}

This looks like it's intended for the destination side. Does this code
work on the source side if the device is transitioned back into the
running state?

> +
> +static void vfu_mig_state_stop(vfu_ctx_t *vfu_ctx)
> +{
> +    VfuObject *o = vfu_get_private(vfu_ctx);
> +    VfuObjectClass *k = VFU_OBJECT_GET_CLASS(OBJECT(o));
> +    static int migrated_devs;
> +
> +    /**
> +     * note: calling bdrv_inactivate_all() is not the best approach.
> +     *
> +     *  Ideally, we would identify the block devices (if any) indirectly
> +     *  linked (such as via a scs-hd device) to each of the migrated devices,
> +     *  and inactivate them individually. This is essential while operating
> +     *  the server in a storage daemon mode, with devices from different VMs.
> +     *
> +     *  However, we currently don't have this capability. As such, we need to
> +     *  inactivate all devices at the same time when migration is completed.
> +     */
> +    if (++migrated_devs == k->nr_devs) {
> +        bdrv_inactivate_all();
> +    }
> +}
> +
> +static int vfu_mig_transition(vfu_ctx_t *vfu_ctx, vfu_migr_state_t state)
> +{
> +    switch (state) {
> +    case VFU_MIGR_STATE_RESUME:
> +    case VFU_MIGR_STATE_STOP_AND_COPY:
> +        break;
> +    case VFU_MIGR_STATE_STOP:
> +        vfu_mig_state_stop(vfu_ctx);
> +        break;
> +    case VFU_MIGR_STATE_PRE_COPY:
> +        vfu_mig_state_precopy(vfu_ctx);
> +        break;
> +    case VFU_MIGR_STATE_RUNNING:
> +        if (!runstate_is_running()) {
> +            vfu_mig_state_running(vfu_ctx);
> +        }
> +        break;
> +    default:
> +        warn_report("vfu: Unknown migration state %d", state);
> +    }
> +
> +    return 0;
> +}
> +
> +static uint64_t vfu_mig_get_pending_bytes(vfu_ctx_t *vfu_ctx)
> +{
> +    VfuObject *o = vfu_get_private(vfu_ctx);
> +
> +    return o->vfu_mig_buf_pending;
> +}
> +
> +static int vfu_mig_prepare_data(vfu_ctx_t *vfu_ctx, uint64_t *offset,
> +                                uint64_t *size)
> +{
> +    VfuObject *o = vfu_get_private(vfu_ctx);
> +
> +    if (offset) {
> +        *offset = 0;
> +    }
> +
> +    if (size) {
> +        *size = o->vfu_mig_buf_size;
> +    }
> +
> +    return 0;
> +}
> +
> +static ssize_t vfu_mig_read_data(vfu_ctx_t *vfu_ctx, void *buf,
> +                                 uint64_t size, uint64_t offset)
> +{
> +    VfuObject *o = vfu_get_private(vfu_ctx);
> +
> +    if (offset > o->vfu_mig_buf_size) {
> +        return -1;
> +    }
> +
> +    if ((offset + size) > o->vfu_mig_buf_size) {
> +        warn_report("vfu: buffer overflow - check pending_bytes");
> +        size = o->vfu_mig_buf_size - offset;
> +    }
> +
> +    memcpy(buf, (o->vfu_mig_buf + offset), size);
> +
> +    o->vfu_mig_buf_pending -= size;
> +
> +    return size;
> +}
> +
> +static ssize_t vfu_mig_write_data(vfu_ctx_t *vfu_ctx, void *data,
> +                                  uint64_t size, uint64_t offset)
> +{
> +    VfuObject *o = vfu_get_private(vfu_ctx);
> +    uint64_t end = offset + size;
> +
> +    if (end > o->vfu_mig_buf_size) {
> +        o->vfu_mig_buf = g_realloc(o->vfu_mig_buf, end);
> +        o->vfu_mig_buf_size = end;
> +    }
> +
> +    memcpy((o->vfu_mig_buf + offset), data, size);
> +
> +    if (!o->vfu_mig_file) {
> +        o->vfu_mig_file = qemu_fopen_ops(o, &vfu_mig_fops_load, false);
> +    }

Why open the file here where it's not accessed? I expected this to
happen at the point where data has been fully written and we call
qemu_remote_loadvm().

> +
> +    return size;
> +}
> +
> +static int vfu_mig_data_written(vfu_ctx_t *vfu_ctx, uint64_t count)
> +{
> +    return 0;
> +}
> +
> +static const vfu_migration_callbacks_t vfu_mig_cbs = {
> +    .version = VFU_MIGR_CALLBACKS_VERS,
> +    .transition = &vfu_mig_transition,
> +    .get_pending_bytes = &vfu_mig_get_pending_bytes,
> +    .prepare_data = &vfu_mig_prepare_data,
> +    .read_data = &vfu_mig_read_data,
> +    .data_written = &vfu_mig_data_written,
> +    .write_data = &vfu_mig_write_data,
> +};
> +
>  static void vfu_object_ctx_run(void *opaque)
>  {
>      VfuObject *o = opaque;
> @@ -340,6 +615,7 @@ static void vfu_object_machine_done(Notifier *notifier, void *data)
>  {
>      VfuObject *o = container_of(notifier, VfuObject, machine_done);
>      DeviceState *dev = NULL;
> +    size_t migr_area_size;
>      QemuThread thread;
>      int ret;
>  
> @@ -401,6 +677,35 @@ static void vfu_object_machine_done(Notifier *notifier, void *data)
>          return;
>      }
>  
> +    /*
> +     * TODO: The 0x20000 number used below is a temporary. We are working on
> +     *     a cleaner fix for this.
> +     *
> +     *     The libvfio-user library assumes that the remote knows the size of
> +     *     the data to be migrated at boot time, but that is not the case with
> +     *     VMSDs, as it can contain a variable-size buffer. 0x20000 is used
> +     *     as a sufficiently large buffer to demonstrate migration, but that
> +     *     cannot be used as a solution.
> +     *
> +     */

libvfio-user has the vfu_migration_callbacks_t interface that allows the
device to save/load more data regardless of the size of the migration
region. I don't see the issue here since the region doesn't need to be
sized to fit the savevm data?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC server v2 00/11] vfio-user server in QEMU
  2021-08-27 17:53 ` [PATCH RFC server v2 00/11] vfio-user server in QEMU Jagannathan Raman
                     ` (11 preceding siblings ...)
  2021-09-08 10:08   ` [PATCH RFC server v2 00/11] vfio-user server in QEMU Stefan Hajnoczi
@ 2021-09-09  8:17   ` Stefan Hajnoczi
  2021-09-10 14:02     ` Jag Raman
  12 siblings, 1 reply; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-09-09  8:17 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: elena.ufimtseva, john.g.johnson, thuth, swapnil.ingle,
	john.levon, philmd, qemu-devel, alex.williamson,
	marcandre.lureau, thanos.makatos, alex.bennee

[-- Attachment #1: Type: text/plain, Size: 134 bytes --]

Hi Jag,
I have finished reviewing these patches and left comments. I didn't take
a look at the libvfio-user's implementation.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 09/16] vfio-user: region read/write
  2021-09-09  6:00     ` John Johnson
@ 2021-09-09 12:05       ` John Levon
  2021-09-10  6:07         ` John Johnson
  0 siblings, 1 reply; 108+ messages in thread
From: John Levon @ 2021-09-09 12:05 UTC (permalink / raw)
  To: John Johnson
  Cc: Elena Ufimtseva, Jag Raman, Swapnil Ingle, qemu-devel,
	alex.williamson, stefanha, Thanos Makatos

On Thu, Sep 09, 2021 at 06:00:36AM +0000, John Johnson wrote:

> > On Sep 7, 2021, at 10:24 AM, John Levon <john.levon@nutanix.com> wrote:
> > 
> > On Mon, Aug 16, 2021 at 09:42:42AM -0700, Elena Ufimtseva wrote:
> > 
> >> +int vfio_user_region_write(VFIODevice *vbasedev, uint32_t index,
> >> +                           uint64_t offset, uint32_t count, void *data)
> >> +{
> >> +    g_autofree VFIOUserRegionRW *msgp = NULL;
> >> +    int size = sizeof(*msgp) + count;
> >> +
> >> +    msgp = g_malloc0(size);
> >> +    vfio_user_request_msg(&msgp->hdr, VFIO_USER_REGION_WRITE, size,
> >> +                          VFIO_USER_NO_REPLY);
> > 
> > Mirroring https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_oracle_qemu_issues_10&d=DwIGaQ&c=s883GpUCOChKOHiocYtGcg&r=v7SNLJqx7b9Vfc7ZO82Wg4nnZ8O5XkACFQ30bVKxotI&m=PJ390CfKPdTFUffSi02whMSqey2en8OJv7dm9VAQKI0&s=Mfp1xRKET3LEucEeZwUVUvSJ3V0zzGEktOzFwMsTfEE&e=  here for visibility:
> > 
> > Currently, vfio_user_region_write uses VFIO_USER_NO_REPLY unconditionally,
> > meaning essentially all writes are posted. But that shouldn't be the case, for
> > example for PCI config space, where it's expected that writes will wait for an
> > ack before the VCPU continues.
> 
> 	I’m not sure following the PCI spec (mem writes posted, config & IO
> are not) completely solves the issue if the device uses sparse mmap.  A store
> to went over the socket can be passed by a load that goes directly to memory,
> which could break a driver that assumes a load completion means older stores
> to the same device have also completed.

Sure, but sparse mmaps are under the device's control - so wouldn't that be
something of a "don't do that" scenario?

regards
john

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 11/16] vfio-user: get and set IRQs
  2021-09-09  5:50     ` John Johnson
@ 2021-09-09 13:50       ` Stefan Hajnoczi
  0 siblings, 0 replies; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-09-09 13:50 UTC (permalink / raw)
  To: John Johnson
  Cc: Elena Ufimtseva, Jag Raman, Swapnil Ingle, John Levon,
	QEMU Devel Mailing List, Alex Williamson, thanos.makatos

[-- Attachment #1: Type: text/plain, Size: 8266 bytes --]

On Thu, Sep 09, 2021 at 05:50:39AM +0000, John Johnson wrote:
> 
> 
> > On Sep 7, 2021, at 8:14 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > 
> > On Mon, Aug 16, 2021 at 09:42:44AM -0700, Elena Ufimtseva wrote:
> >> From: John Johnson <john.g.johnson@oracle.com>
> >> 
> >> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> >> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> >> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> >> ---
> >> hw/vfio/user-protocol.h |  25 ++++++++++
> >> hw/vfio/user.h          |   2 +
> >> hw/vfio/common.c        |  26 ++++++++--
> >> hw/vfio/pci.c           |  31 ++++++++++--
> >> hw/vfio/user.c          | 106 ++++++++++++++++++++++++++++++++++++++++
> >> 5 files changed, 181 insertions(+), 9 deletions(-)
> >> 
> >> diff --git a/hw/vfio/user-protocol.h b/hw/vfio/user-protocol.h
> >> index 56904cf872..5614efa0a4 100644
> >> --- a/hw/vfio/user-protocol.h
> >> +++ b/hw/vfio/user-protocol.h
> >> @@ -109,6 +109,31 @@ typedef struct {
> >>     uint64_t offset;
> >> } VFIOUserRegionInfo;
> >> 
> >> +/*
> >> + * VFIO_USER_DEVICE_GET_IRQ_INFO
> >> + * imported from struct vfio_irq_info
> >> + */
> >> +typedef struct {
> >> +    VFIOUserHdr hdr;
> >> +    uint32_t argsz;
> >> +    uint32_t flags;
> >> +    uint32_t index;
> >> +    uint32_t count;
> >> +} VFIOUserIRQInfo;
> >> +
> >> +/*
> >> + * VFIO_USER_DEVICE_SET_IRQS
> >> + * imported from struct vfio_irq_set
> >> + */
> >> +typedef struct {
> >> +    VFIOUserHdr hdr;
> >> +    uint32_t argsz;
> >> +    uint32_t flags;
> >> +    uint32_t index;
> >> +    uint32_t start;
> >> +    uint32_t count;
> >> +} VFIOUserIRQSet;
> >> +
> >> /*
> >>  * VFIO_USER_REGION_READ
> >>  * VFIO_USER_REGION_WRITE
> >> diff --git a/hw/vfio/user.h b/hw/vfio/user.h
> >> index 02f832a173..248ad80943 100644
> >> --- a/hw/vfio/user.h
> >> +++ b/hw/vfio/user.h
> >> @@ -74,6 +74,8 @@ int vfio_user_validate_version(VFIODevice *vbasedev, Error **errp);
> >> int vfio_user_get_info(VFIODevice *vbasedev);
> >> int vfio_user_get_region_info(VFIODevice *vbasedev, int index,
> >>                               struct vfio_region_info *info, VFIOUserFDs *fds);
> >> +int vfio_user_get_irq_info(VFIODevice *vbasedev, struct vfio_irq_info *info);
> >> +int vfio_user_set_irqs(VFIODevice *vbasedev, struct vfio_irq_set *irq);
> >> int vfio_user_region_read(VFIODevice *vbasedev, uint32_t index, uint64_t offset,
> >>                           uint32_t count, void *data);
> >> int vfio_user_region_write(VFIODevice *vbasedev, uint32_t index,
> >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >> index a8b1ea9358..9fe3e05dc6 100644
> >> --- a/hw/vfio/common.c
> >> +++ b/hw/vfio/common.c
> >> @@ -71,7 +71,11 @@ void vfio_disable_irqindex(VFIODevice *vbasedev, int index)
> >>         .count = 0,
> >>     };
> >> 
> >> -    ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
> >> +    if (vbasedev->proxy != NULL) {
> >> +        vfio_user_set_irqs(vbasedev, &irq_set);
> >> +    } else {
> >> +        ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
> >> +    }
> >> }
> >> 
> >> void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index)
> >> @@ -84,7 +88,11 @@ void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index)
> >>         .count = 1,
> >>     };
> >> 
> >> -    ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
> >> +    if (vbasedev->proxy != NULL) {
> >> +        vfio_user_set_irqs(vbasedev, &irq_set);
> >> +    } else {
> >> +        ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
> >> +    }
> >> }
> >> 
> >> void vfio_mask_single_irqindex(VFIODevice *vbasedev, int index)
> >> @@ -97,7 +105,11 @@ void vfio_mask_single_irqindex(VFIODevice *vbasedev, int index)
> >>         .count = 1,
> >>     };
> >> 
> >> -    ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
> >> +    if (vbasedev->proxy != NULL) {
> >> +        vfio_user_set_irqs(vbasedev, &irq_set);
> >> +    } else {
> >> +        ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
> >> +    }
> >> }
> >> 
> >> static inline const char *action_to_str(int action)
> >> @@ -178,8 +190,12 @@ int vfio_set_irq_signaling(VFIODevice *vbasedev, int index, int subindex,
> >>     pfd = (int32_t *)&irq_set->data;
> >>     *pfd = fd;
> >> 
> >> -    if (ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, irq_set)) {
> >> -        ret = -errno;
> >> +    if (vbasedev->proxy != NULL) {
> >> +        ret = vfio_user_set_irqs(vbasedev, irq_set);
> >> +    } else {
> >> +        if (ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, irq_set)) {
> >> +            ret = -errno;
> >> +        }
> >>     }
> >>     g_free(irq_set);
> >> 
> >> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> >> index ea0df8be65..282de6a30b 100644
> >> --- a/hw/vfio/pci.c
> >> +++ b/hw/vfio/pci.c
> >> @@ -403,7 +403,11 @@ static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool msix)
> >>         fds[i] = fd;
> >>     }
> >> 
> >> -    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
> >> +    if (vdev->vbasedev.proxy != NULL) {
> >> +        ret = vfio_user_set_irqs(&vdev->vbasedev, irq_set);
> >> +    } else {
> >> +        ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
> >> +    }
> >> 
> >>     g_free(irq_set);
> >> 
> >> @@ -2675,7 +2679,13 @@ static void vfio_populate_device(VFIOPCIDevice *vdev, Error **errp)
> >> 
> >>     irq_info.index = VFIO_PCI_ERR_IRQ_INDEX;
> >> 
> >> -    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info);
> >> +    if (vbasedev->proxy != NULL) {
> >> +        ret = vfio_user_get_irq_info(vbasedev, &irq_info);
> >> +    } else {
> >> +        ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info);
> >> +    }
> >> +
> >> +
> >>     if (ret) {
> >>         /* This can fail for an old kernel or legacy PCI dev */
> >>         trace_vfio_populate_device_get_irq_info_failure(strerror(errno));
> >> @@ -2794,8 +2804,16 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
> >>         return;
> >>     }
> >> 
> >> -    if (ioctl(vdev->vbasedev.fd,
> >> -              VFIO_DEVICE_GET_IRQ_INFO, &irq_info) < 0 || irq_info.count < 1) {
> >> +    if (vdev->vbasedev.proxy != NULL) {
> >> +        if (vfio_user_get_irq_info(&vdev->vbasedev, &irq_info) < 0) {
> >> +            return;
> >> +        }
> >> +    } else {
> >> +        if (ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info) < 0) {
> >> +            return;
> >> +        }
> >> +    }
> >> +    if (irq_info.count < 1) {
> >>         return;
> >>     }
> >> 
> >> @@ -3557,6 +3575,11 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
> >>         }
> >>     }
> >> 
> >> +    vfio_register_err_notifier(vdev);
> >> +    vfio_register_req_notifier(vdev);
> >> +
> >> +    return;
> >> +
> >> out_deregister:
> >>     pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
> >>     kvm_irqchip_remove_change_notifier(&vdev->irqchip_change_notifier);
> >> diff --git a/hw/vfio/user.c b/hw/vfio/user.c
> >> index 83235b2411..b68ca1279d 100644
> >> --- a/hw/vfio/user.c
> >> +++ b/hw/vfio/user.c
> >> @@ -768,6 +768,112 @@ int vfio_user_get_region_info(VFIODevice *vbasedev, int index,
> >>     return 0;
> >> }
> >> 
> >> +int vfio_user_get_irq_info(VFIODevice *vbasedev, struct vfio_irq_info *info)
> >> +{
> >> +    VFIOUserIRQInfo msg;
> >> +
> >> +    memset(&msg, 0, sizeof(msg));
> >> +    vfio_user_request_msg(&msg.hdr, VFIO_USER_DEVICE_GET_IRQ_INFO,
> >> +                          sizeof(msg), 0);
> >> +    msg.argsz = info->argsz;
> >> +    msg.index = info->index;
> >> +
> >> +    vfio_user_send_recv(vbasedev->proxy, &msg.hdr, NULL, 0, 0);
> >> +    if (msg.hdr.flags & VFIO_USER_ERROR) {
> >> +        return -msg.hdr.error_reply;
> >> +    }
> >> +
> >> +    memcpy(info, &msg.argsz, sizeof(*info));
> > 
> > Should this be info.count = msg.count instead? Not sure why argsz is
> > used here.
> 
> 	It’s copying the entire returned vfio_irq_info struct, which starts
> at &msg.argsz.

That makes sense, I missed it. Thanks!

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 04/16] vfio-user: connect vfio proxy to remote server
  2021-09-09  6:29           ` Stefan Hajnoczi
@ 2021-09-10  5:25             ` John Johnson
  2021-09-13 12:35               ` Stefan Hajnoczi
  2021-09-13 17:23               ` John Johnson
  0 siblings, 2 replies; 108+ messages in thread
From: John Johnson @ 2021-09-10  5:25 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, Jag Raman, Swapnil Ingle, John Levon,
	QEMU Devel Mailing List, Alex Williamson, thanos.makatos



> On Sep 8, 2021, at 11:29 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Thu, Sep 09, 2021 at 05:11:49AM +0000, John Johnson wrote:
>> 
>> 
>> 	I did look at coroutines, but they seemed to work when the sender
>> is triggering the coroutine on send, not when request packets are arriving
>> asynchronously to the sends.
> 
> This can be done with a receiver coroutine. Its job is to be the only
> thing that reads vfio-user messages from the socket. A receiver
> coroutine reads messages from the socket and wakes up the waiting
> coroutine that yielded from vfio_user_send_recv() or
> vfio_user_pci_process_req().
> 
> (Although vfio_user_pci_process_req() could be called directly from the
> receiver coroutine, it seems safer to have a separate coroutine that
> processes requests so that the receiver isn't blocked in case
> vfio_user_pci_process_req() yields while processing a request.)
> 
> Going back to what you mentioned above, the receiver coroutine does
> something like this:
> 
>  if it's a reply
>      reply = find_reply(...)
>      qemu_coroutine_enter(reply->co) // instead of signalling reply->cv
>  else
>      QSIMPLEQ_INSERT_TAIL(&pending_reqs, request, next);
>      if (pending_reqs_was_empty) {
>          qemu_coroutine_enter(process_request_co);
>      }
> 
> The pending_reqs queue holds incoming requests that the
> process_request_co coroutine processes.
> 


	How do coroutines work across threads?  There can be multiple vCPU
threads waiting for replies, and I think the receiver coroutine will be
running in the main loop thread.  Where would a vCPU block waiting for
a reply?  I think coroutine_yield() returns to its coroutine_enter() caller.

							JJ




^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 09/16] vfio-user: region read/write
  2021-09-09 12:05       ` John Levon
@ 2021-09-10  6:07         ` John Johnson
  2021-09-10 12:16           ` John Levon
  0 siblings, 1 reply; 108+ messages in thread
From: John Johnson @ 2021-09-10  6:07 UTC (permalink / raw)
  To: John Levon
  Cc: Elena Ufimtseva, Jag Raman, Swapnil Ingle, qemu-devel,
	alex.williamson, stefanha, Thanos Makatos



> On Sep 9, 2021, at 5:05 AM, John Levon <john.levon@nutanix.com> wrote:
> 
> On Thu, Sep 09, 2021 at 06:00:36AM +0000, John Johnson wrote:
> 
>>> On Sep 7, 2021, at 10:24 AM, John Levon <john.levon@nutanix.com> wrote:
>>> 
>>> On Mon, Aug 16, 2021 at 09:42:42AM -0700, Elena Ufimtseva wrote:
>>> 
>>>> +int vfio_user_region_write(VFIODevice *vbasedev, uint32_t index,
>>>> +                           uint64_t offset, uint32_t count, void *data)
>>>> +{
>>>> +    g_autofree VFIOUserRegionRW *msgp = NULL;
>>>> +    int size = sizeof(*msgp) + count;
>>>> +
>>>> +    msgp = g_malloc0(size);
>>>> +    vfio_user_request_msg(&msgp->hdr, VFIO_USER_REGION_WRITE, size,
>>>> +                          VFIO_USER_NO_REPLY);
>>> 
>>> Mirroring https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_oracle_qemu_issues_10&d=DwIGaQ&c=s883GpUCOChKOHiocYtGcg&r=v7SNLJqx7b9Vfc7ZO82Wg4nnZ8O5XkACFQ30bVKxotI&m=PJ390CfKPdTFUffSi02whMSqey2en8OJv7dm9VAQKI0&s=Mfp1xRKET3LEucEeZwUVUvSJ3V0zzGEktOzFwMsTfEE&e=  here for visibility:
>>> 
>>> Currently, vfio_user_region_write uses VFIO_USER_NO_REPLY unconditionally,
>>> meaning essentially all writes are posted. But that shouldn't be the case, for
>>> example for PCI config space, where it's expected that writes will wait for an
>>> ack before the VCPU continues.
>> 
>> 	I’m not sure following the PCI spec (mem writes posted, config & IO
>> are not) completely solves the issue if the device uses sparse mmap.  A store
>> to went over the socket can be passed by a load that goes directly to memory,
>> which could break a driver that assumes a load completion means older stores
>> to the same device have also completed.
> 
> Sure, but sparse mmaps are under the device's control - so wouldn't that be
> something of a "don't do that" scenario?
> 

	The sparse mmaps are under the emulation program’s control, but it
doesn’t know what registers the guest device driver is using to force stores
to complete.  The Linux device drivers doc on kernel.org just says the driver
must read from the same device.

								JJ


https://www.kernel.org/doc/Documentation/driver-api/device-io.rst

While the basic functions are defined to be synchronous with respect to
each other and ordered with respect to each other the busses the devices
sit on may themselves have asynchronicity. In particular many authors
are burned by the fact that PCI bus writes are posted asynchronously. A
driver author must issue a read from the same device to ensure that
writes have occurred in the specific cases the author cares.


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 09/16] vfio-user: region read/write
  2021-09-10  6:07         ` John Johnson
@ 2021-09-10 12:16           ` John Levon
  0 siblings, 0 replies; 108+ messages in thread
From: John Levon @ 2021-09-10 12:16 UTC (permalink / raw)
  To: John Johnson
  Cc: Elena Ufimtseva, Jag Raman, Swapnil Ingle, qemu-devel,
	alex.williamson, stefanha, Thanos Makatos

On Fri, Sep 10, 2021 at 06:07:56AM +0000, John Johnson wrote:

> >>> On Mon, Aug 16, 2021 at 09:42:42AM -0700, Elena Ufimtseva wrote:
> >>> 
> >>>> +int vfio_user_region_write(VFIODevice *vbasedev, uint32_t index,
> >>>> +                           uint64_t offset, uint32_t count, void *data)
> >>>> +{
> >>>> +    g_autofree VFIOUserRegionRW *msgp = NULL;
> >>>> +    int size = sizeof(*msgp) + count;
> >>>> +
> >>>> +    msgp = g_malloc0(size);
> >>>> +    vfio_user_request_msg(&msgp->hdr, VFIO_USER_REGION_WRITE, size,
> >>>> +                          VFIO_USER_NO_REPLY);
> >>> 
> >>> Mirroring https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_oracle_qemu_issues_10&d=DwIGaQ&c=s883GpUCOChKOHiocYtGcg&r=v7SNLJqx7b9Vfc7ZO82Wg4nnZ8O5XkACFQ30bVKxotI&m=PJ390CfKPdTFUffSi02whMSqey2en8OJv7dm9VAQKI0&s=Mfp1xRKET3LEucEeZwUVUvSJ3V0zzGEktOzFwMsTfEE&e=  here for visibility:
> >>> 
> >>> Currently, vfio_user_region_write uses VFIO_USER_NO_REPLY unconditionally,
> >>> meaning essentially all writes are posted. But that shouldn't be the case, for
> >>> example for PCI config space, where it's expected that writes will wait for an
> >>> ack before the VCPU continues.
> >> 
> >> 	I’m not sure following the PCI spec (mem writes posted, config & IO
> >> are not) completely solves the issue if the device uses sparse mmap.  A store
> >> to went over the socket can be passed by a load that goes directly to memory,
> >> which could break a driver that assumes a load completion means older stores
> >> to the same device have also completed.
> > 
> > Sure, but sparse mmaps are under the device's control - so wouldn't that be
> > something of a "don't do that" scenario?
> 
> 	The sparse mmaps are under the emulation program’s control, but it
> doesn’t know what registers the guest device driver is using to force stores
> to complete.  The Linux device drivers doc on kernel.org just says the driver
> must read from the same device.

Sure, but any device where that is important wouldn't use the sparse mmaps, no?

There's no other alternative.

regards
john

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC server v2 00/11] vfio-user server in QEMU
  2021-09-09  8:17   ` Stefan Hajnoczi
@ 2021-09-10 14:02     ` Jag Raman
  0 siblings, 0 replies; 108+ messages in thread
From: Jag Raman @ 2021-09-10 14:02 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, John Johnson, thuth, swapnil.ingle, john.levon,
	philmd, qemu-devel, Alex Williamson, Marc-André Lureau,
	thanos.makatos, alex.bennee



> On Sep 9, 2021, at 4:17 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> Hi Jag,
> I have finished reviewing these patches and left comments. I didn't take
> a look at the libvfio-user's implementation.

Thank you for you comments, Stefan - we’ll get cracking on them. :)

--
Jag

> 
> Stefan


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC server v2 02/11] vfio-user: define vfio-user object
  2021-09-08 12:37     ` Stefan Hajnoczi
@ 2021-09-10 14:04       ` Jag Raman
  0 siblings, 0 replies; 108+ messages in thread
From: Jag Raman @ 2021-09-10 14:04 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, John Johnson, thuth, swapnil.ingle, john.levon,
	philmd, qemu-devel, Alex Williamson, Marc-André Lureau,
	thanos.makatos, alex.bennee



> On Sep 8, 2021, at 8:37 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Fri, Aug 27, 2021 at 01:53:21PM -0400, Jagannathan Raman wrote:
>> Define vfio-user object which is remote process server for QEMU. Setup
>> object initialization functions and properties necessary to instantiate
>> the object
>> 
>> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
>> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
>> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
>> ---
>> qapi/qom.json             |  20 ++++++-
>> hw/remote/vfio-user-obj.c | 145 ++++++++++++++++++++++++++++++++++++++++++++++
>> MAINTAINERS               |   1 +
>> hw/remote/meson.build     |   1 +
>> hw/remote/trace-events    |   3 +
>> 5 files changed, 168 insertions(+), 2 deletions(-)
>> create mode 100644 hw/remote/vfio-user-obj.c
>> 
>> diff --git a/qapi/qom.json b/qapi/qom.json
>> index a25616b..3e941ee 100644
>> --- a/qapi/qom.json
>> +++ b/qapi/qom.json
>> @@ -689,6 +689,20 @@
>>   'data': { 'fd': 'str', 'devid': 'str' } }
>> 
>> ##
>> +# @VfioUserProperties:
>> +#
>> +# Properties for vfio-user objects.
>> +#
>> +# @socket: path to be used as socket by the libvfiouser library
>> +#
>> +# @devid: the id of the device to be associated with the file descriptor
>> +#
>> +# Since: 6.0
>> +##
>> +{ 'struct': 'VfioUserProperties',
>> +  'data': { 'socket': 'str', 'devid': 'str' } }
> 
> Please use 'SocketAddress' for socket instead of 'str'. That way file
> descriptor passing is easy to support and additional socket address
> families can be supported in the future.

OK, will do.

> 
>> +
>> +##
>> # @RngProperties:
>> #
>> # Properties for objects of classes derived from rng.
>> @@ -812,7 +826,8 @@
>>     'tls-creds-psk',
>>     'tls-creds-x509',
>>     'tls-cipher-suites',
>> -    'x-remote-object'
>> +    'x-remote-object',
>> +    'vfio-user'
>>   ] }
>> 
>> ##
>> @@ -868,7 +883,8 @@
>>       'tls-creds-psk':              'TlsCredsPskProperties',
>>       'tls-creds-x509':             'TlsCredsX509Properties',
>>       'tls-cipher-suites':          'TlsCredsProperties',
>> -      'x-remote-object':            'RemoteObjectProperties'
>> +      'x-remote-object':            'RemoteObjectProperties',
>> +      'vfio-user':                  'VfioUserProperties'
> 
> "vfio-user" doesn't communicate whether this is a client or server. Is
> "vfio-user-server" clearer?

“vfio-user-server” sounds better.

> 
>>   } }
>> 
>> ##
>> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
>> new file mode 100644
>> index 0000000..4a1e297
>> --- /dev/null
>> +++ b/hw/remote/vfio-user-obj.c
>> @@ -0,0 +1,145 @@
>> +/**
>> + * QEMU vfio-user server object
>> + *
>> + * Copyright © 2021 Oracle and/or its affiliates.
>> + *
>> + * This work is licensed under the terms of the GNU GPL-v2, version 2 or later.
>> + *
>> + * See the COPYING file in the top-level directory.
>> + *
>> + */
>> +
>> +/**
>> + * Usage: add options:
>> + *     -machine x-remote
>> + *     -device <PCI-device>,id=<pci-dev-id>
>> + *     -object vfio-user,id=<id>,socket=<socket-path>,devid=<pci-dev-id>
> 
> I suggest renaming devid= to device= or pci-device= (similar to drive=
> and netdev=) for consistency and to avoid confusion with PCI Device IDs.

OK, will do.

> 
>> diff --git a/hw/remote/meson.build b/hw/remote/meson.build
>> index fb35fb8..cd44dfc 100644
>> --- a/hw/remote/meson.build
>> +++ b/hw/remote/meson.build
>> @@ -6,6 +6,7 @@ remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('message.c'))
>> remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('remote-obj.c'))
>> remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('proxy.c'))
>> remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('iohub.c'))
>> +remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('vfio-user-obj.c'))
> 
> If you use CONFIG_VFIO_USER_SERVER then it's easier to separate mpqemu
> from vfio-user. Sharing CONFIG_MULTIPROCESS could become messy later.

OK, got it.

--
Jag


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC server v2 03/11] vfio-user: instantiate vfio-user context
  2021-09-08 12:40     ` Stefan Hajnoczi
@ 2021-09-10 14:58       ` Jag Raman
  0 siblings, 0 replies; 108+ messages in thread
From: Jag Raman @ 2021-09-10 14:58 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, John Johnson, thuth, swapnil.ingle, john.levon,
	philmd, qemu-devel, Alex Williamson, Marc-André Lureau,
	thanos.makatos, alex.bennee



> On Sep 8, 2021, at 8:40 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Fri, Aug 27, 2021 at 01:53:22PM -0400, Jagannathan Raman wrote:
>> create a context with the vfio-user library to run a PCI device
>> 
>> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
>> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
>> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
>> ---
>> hw/remote/vfio-user-obj.c | 29 +++++++++++++++++++++++++++++
>> 1 file changed, 29 insertions(+)
>> 
>> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
>> index 4a1e297..99d3dd1 100644
>> --- a/hw/remote/vfio-user-obj.c
>> +++ b/hw/remote/vfio-user-obj.c
>> @@ -27,11 +27,17 @@
>> #include "qemu/osdep.h"
>> #include "qemu-common.h"
>> 
>> +#include <errno.h>
> 
> qemu/osdep.h already includes <errno.h>
> 
>> +
>> #include "qom/object.h"
>> #include "qom/object_interfaces.h"
>> #include "qemu/error-report.h"
>> #include "trace.h"
>> #include "sysemu/runstate.h"
>> +#include "qemu/notify.h"
>> +#include "qapi/error.h"
>> +#include "sysemu/sysemu.h"
>> +#include "libvfio-user.h"
>> 
>> #define TYPE_VFU_OBJECT "vfio-user"
>> OBJECT_DECLARE_TYPE(VfuObject, VfuObjectClass, VFU_OBJECT)
>> @@ -51,6 +57,10 @@ struct VfuObject {
>> 
>>     char *socket;
>>     char *devid;
>> +
>> +    Notifier machine_done;
>> +
>> +    vfu_ctx_t *vfu_ctx;
>> };
>> 
>> static void vfu_object_set_socket(Object *obj, const char *str, Error **errp)
>> @@ -75,9 +85,23 @@ static void vfu_object_set_devid(Object *obj, const char *str, Error **errp)
>>     trace_vfu_prop("devid", str);
>> }
>> 
>> +static void vfu_object_machine_done(Notifier *notifier, void *data)
> 
> Please document the reason for using a machine init done notifier.

OK, will do.

> 
>> +{
>> +    VfuObject *o = container_of(notifier, VfuObject, machine_done);
>> +
>> +    o->vfu_ctx = vfu_create_ctx(VFU_TRANS_SOCK, o->socket, 0,
>> +                                o, VFU_DEV_TYPE_PCI);
>> +    if (o->vfu_ctx == NULL) {
>> +        error_setg(&error_abort, "vfu: Failed to create context - %s",
>> +                   strerror(errno));
>> +        return;
>> +    }
>> +}
>> +
>> static void vfu_object_init(Object *obj)
>> {
>>     VfuObjectClass *k = VFU_OBJECT_GET_CLASS(obj);
>> +    VfuObject *o = VFU_OBJECT(obj);
>> 
>>     if (!object_dynamic_cast(OBJECT(current_machine), TYPE_REMOTE_MACHINE)) {
>>         error_report("vfu: %s only compatible with %s machine",
>> @@ -92,6 +116,9 @@ static void vfu_object_init(Object *obj)
>>     }
>> 
>>     k->nr_devs++;
>> +
>> +    o->machine_done.notify = vfu_object_machine_done;
>> +    qemu_add_machine_init_done_notifier(&o->machine_done);
>> }
>> 
>> static void vfu_object_finalize(Object *obj)
>> @@ -101,6 +128,8 @@ static void vfu_object_finalize(Object *obj)
>> 
>>     k->nr_devs--;
>> 
>> +    vfu_destroy_ctx(o->vfu_ctx);
> 
> Will this function ever be called before vfu_object_machine_done() is
> called? In that case vfu_ctx isn't initialized.

There are some case where vfu_object_finalize() could be called before
vfu_object_machine_done() executes. In that case o->vfu_ctx would be
NULL - we didn’t account for that before.

vfu_destroy_ctx() does check for NULL - however, we’ll add a check
here as well in case vfu_destroy_ctx() changes in the future.

--
Jag


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC server v2 04/11] vfio-user: find and init PCI device
  2021-09-08 12:43     ` Stefan Hajnoczi
@ 2021-09-10 15:02       ` Jag Raman
  0 siblings, 0 replies; 108+ messages in thread
From: Jag Raman @ 2021-09-10 15:02 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, John Johnson, thuth, swapnil.ingle, john.levon,
	philmd, qemu-devel, Alex Williamson, Marc-André Lureau,
	thanos.makatos, alex.bennee



> On Sep 8, 2021, at 8:43 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Fri, Aug 27, 2021 at 01:53:23PM -0400, Jagannathan Raman wrote:
>> @@ -96,6 +102,28 @@ static void vfu_object_machine_done(Notifier *notifier, void *data)
>>                    strerror(errno));
>>         return;
>>     }
>> +
>> +    dev = qdev_find_recursive(sysbus_get_default(), o->devid);
>> +    if (dev == NULL) {
>> +        error_setg(&error_abort, "vfu: Device %s not found", o->devid);
>> +        return;
>> +    }
>> +
>> +    if (!object_dynamic_cast(OBJECT(dev), TYPE_PCI_DEVICE)) {
>> +        error_setg(&error_abort, "vfu: %s not a PCI devices", o->devid);
>> +        return;
>> +    }
>> +
>> +    o->pci_dev = PCI_DEVICE(dev);
>> +
>> +    ret = vfu_pci_init(o->vfu_ctx, VFU_PCI_TYPE_CONVENTIONAL,
>> +                       PCI_HEADER_TYPE_NORMAL, 0);
> 
> What is needed to support PCI Express?

I think we could check if o->pci_dev supports QEMU_PCI_CAP_EXPRESS,
and based on that choose if we should use
VFU_PCI_TYPE_CONVENTIONAL or VFU_PCI_TYPE_EXPRESS.

pci_is_express() is already doing that, although it’s a private function
now. It’s a good time to export it.

--
Jag

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC server v2 01/11] vfio-user: build library
  2021-08-27 17:53   ` [PATCH RFC server v2 01/11] vfio-user: build library Jagannathan Raman
  2021-08-27 18:05     ` Jag Raman
  2021-09-08 12:25     ` Stefan Hajnoczi
@ 2021-09-10 15:20     ` Philippe Mathieu-Daudé
  2021-09-10 17:08       ` Jag Raman
  2021-09-11 22:29       ` John Levon
  2 siblings, 2 replies; 108+ messages in thread
From: Philippe Mathieu-Daudé @ 2021-09-10 15:20 UTC (permalink / raw)
  To: Jagannathan Raman, qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, thuth, swapnil.ingle,
	john.levon, alex.williamson, marcandre.lureau, stefanha,
	thanos.makatos, alex.bennee

On 8/27/21 7:53 PM, Jagannathan Raman wrote:
> add the libvfio-user library as a submodule. build it as a cmake
> subproject.
> 
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> ---
>  configure                | 11 +++++++++++
>  meson.build              | 28 ++++++++++++++++++++++++++++
>  .gitmodules              |  3 +++
>  MAINTAINERS              |  7 +++++++
>  hw/remote/meson.build    |  2 ++
>  subprojects/libvfio-user |  1 +
>  6 files changed, 52 insertions(+)
>  create mode 160000 subprojects/libvfio-user

> diff --git a/subprojects/libvfio-user b/subprojects/libvfio-user
> new file mode 160000
> index 0000000..647c934
> --- /dev/null
> +++ b/subprojects/libvfio-user
> @@ -0,0 +1 @@
> +Subproject commit 647c9341d2e06266a710ddd075f69c95dd3b8446
> 

Could we point to a sha1 of a released tag instead?



^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC server v2 01/11] vfio-user: build library
  2021-09-08 12:25     ` Stefan Hajnoczi
@ 2021-09-10 15:21       ` Philippe Mathieu-Daudé
  2021-09-13 12:15         ` Stefan Hajnoczi
  0 siblings, 1 reply; 108+ messages in thread
From: Philippe Mathieu-Daudé @ 2021-09-10 15:21 UTC (permalink / raw)
  To: Stefan Hajnoczi, Jagannathan Raman
  Cc: elena.ufimtseva, john.g.johnson, thuth, swapnil.ingle,
	john.levon, qemu-devel, alex.williamson, marcandre.lureau,
	thanos.makatos, alex.bennee

On 9/8/21 2:25 PM, Stefan Hajnoczi wrote:
> On Fri, Aug 27, 2021 at 01:53:20PM -0400, Jagannathan Raman wrote:

>> diff --git a/.gitmodules b/.gitmodules
>> index 08b1b48..cfeea7c 100644
>> --- a/.gitmodules
>> +++ b/.gitmodules
>> @@ -64,3 +64,6 @@
>>  [submodule "roms/vbootrom"]
>>  	path = roms/vbootrom
>>  	url = https://gitlab.com/qemu-project/vbootrom.git
>> +[submodule "subprojects/libvfio-user"]
>> +	path = subprojects/libvfio-user
>> +	url = https://github.com/nutanix/libvfio-user.git
> 
> Once this is merged I'll set up a
> gitlab.com/qemu-project/libvfio-user.git mirror. This ensures that no
> matter what happens with upstream libvfio-user.git, the source code that
> QEMU builds against will remain archived/available.

Can we do it the other way around? When the series is OK to be merged,
setup the https://gitlab.com/qemu-project/libvfio-user.git mirror and
have the submodule point to it?



^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC server v2 06/11] vfio-user: handle PCI config space accesses
  2021-09-09  7:27     ` Stefan Hajnoczi
@ 2021-09-10 16:22       ` Jag Raman
  2021-09-13 12:13         ` Stefan Hajnoczi
  0 siblings, 1 reply; 108+ messages in thread
From: Jag Raman @ 2021-09-10 16:22 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, John Johnson, thuth, swapnil.ingle, john.levon,
	philmd, qemu-devel, Alex Williamson, Marc-André Lureau,
	thanos.makatos, alex.bennee



> On Sep 9, 2021, at 3:27 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Fri, Aug 27, 2021 at 01:53:25PM -0400, Jagannathan Raman wrote:
>> +static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, char * const buf,
>> +                                     size_t count, loff_t offset,
>> +                                     const bool is_write)
>> +{
>> +    VfuObject *o = vfu_get_private(vfu_ctx);
>> +    uint32_t pci_access_width = sizeof(uint32_t);
>> +    size_t bytes = count;
>> +    uint32_t val = 0;
>> +    char *ptr = buf;
>> +    int len;
>> +
>> +    while (bytes > 0) {
>> +        len = (bytes > pci_access_width) ? pci_access_width : bytes;
>> +        if (is_write) {
>> +            memcpy(&val, ptr, len);
>> +            pci_default_write_config(PCI_DEVICE(o->pci_dev),
>> +                                     offset, val, len);
>> +            trace_vfu_cfg_write(offset, val);
>> +        } else {
>> +            val = pci_default_read_config(PCI_DEVICE(o->pci_dev),
>> +                                          offset, len);
>> +            memcpy(ptr, &val, len);
> 
> pci_default_read_config() returns a host-endian 32-bit value. This code
> looks wrong because it copies different bytes on big- and little-endian
> hosts.

I’ll collect more details on this one, trying to wrap my head around it.

Concerning config space writes, it doesn’t look like we need to
perform any conversion as the store operation automatically happens
in the CPU’s native format when we do something like the following:
PCIDevice->config[addr] = val;

Concerning config read, I observed that pci_default_read_config()
performs le32_to_cpu() conversion. May be we should not rely on
it doing the conversion.

> 
>> +            trace_vfu_cfg_read(offset, val);
>> +        }
> 
> Why call pci_default_read/write_config() directly instead of
> pci_dev->config_read/write()?

That makes sense - we should be calling pci_dev->config_read/write().

After performing pci_dev->config_read(), I’ll convert it to the CPU’s
endianness format using le32_to_cpu(). On big-endian systems,
it would re-order the bytes, and on little-endian systems it would
be a no-op.

--
Jag

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC server v2 08/11] vfio-user: handle PCI BAR accesses
  2021-09-09  7:37     ` Stefan Hajnoczi
@ 2021-09-10 16:36       ` Jag Raman
  0 siblings, 0 replies; 108+ messages in thread
From: Jag Raman @ 2021-09-10 16:36 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, John Johnson, thuth, swapnil.ingle, john.levon,
	philmd, qemu-devel, Alex Williamson, Marc-André Lureau,
	thanos.makatos, alex.bennee



> On Sep 9, 2021, at 3:37 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Fri, Aug 27, 2021 at 01:53:27PM -0400, Jagannathan Raman wrote:
>> +/**
>> + * VFU_OBJECT_BAR_HANDLER - macro for defining handlers for PCI BARs.
>> + *
>> + * To create handler for BAR number 2, VFU_OBJECT_BAR_HANDLER(2) would
>> + * define vfu_object_bar2_handler
>> + */
>> +#define VFU_OBJECT_BAR_HANDLER(BAR_NO)                                         \
>> +    static ssize_t vfu_object_bar##BAR_NO##_handler(vfu_ctx_t *vfu_ctx,        \
>> +                                        char * const buf, size_t count,        \
>> +                                        loff_t offset, const bool is_write)    \
>> +    {                                                                          \
>> +        VfuObject *o = vfu_get_private(vfu_ctx);                               \
>> +        hwaddr addr = (hwaddr)(pci_get_long(o->pci_dev->config +               \
>> +                                            PCI_BASE_ADDRESS_0 +               \
>> +                                            (4 * BAR_NO)) + offset);           \
> 
> Does this handle 64-bit BARs?

It presently only handles 32-bit BARs. We’ll add support for 64-bit BARs in the next rev
of this series.

> 
>> +/**
>> + * vfu_object_register_bars - Identify active BAR regions of pdev and setup
>> + *                            callbacks to handle read/write accesses
>> + */
>> +static void vfu_object_register_bars(vfu_ctx_t *vfu_ctx, PCIDevice *pdev)
>> +{
>> +    uint32_t orig_val, new_val;
>> +    int i, size;
>> +
>> +    for (i = 0; i < PCI_NUM_REGIONS; i++) {
>> +        orig_val = pci_default_read_config(pdev,
>> +                                           PCI_BASE_ADDRESS_0 + (4 * i), 4);
> 
> Same question as in an earlier patch: should we call pdev->read_config()?

Sure, will do.

--
Jag


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC server v2 01/11] vfio-user: build library
  2021-09-10 15:20     ` Philippe Mathieu-Daudé
@ 2021-09-10 17:08       ` Jag Raman
  2021-09-11 22:29       ` John Levon
  1 sibling, 0 replies; 108+ messages in thread
From: Jag Raman @ 2021-09-10 17:08 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé
  Cc: Elena Ufimtseva, John Johnson, thuth, swapnil.ingle, john.levon,
	qemu-devel, Alex Williamson, marcandre.lureau, Stefan Hajnoczi,
	thanos.makatos, alex.bennee



> On Sep 10, 2021, at 11:20 AM, Philippe Mathieu-Daudé <philmd@redhat.com> wrote:
> 
> On 8/27/21 7:53 PM, Jagannathan Raman wrote:
>> add the libvfio-user library as a submodule. build it as a cmake
>> subproject.
>> 
>> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
>> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
>> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
>> ---
>> configure                | 11 +++++++++++
>> meson.build              | 28 ++++++++++++++++++++++++++++
>> .gitmodules              |  3 +++
>> MAINTAINERS              |  7 +++++++
>> hw/remote/meson.build    |  2 ++
>> subprojects/libvfio-user |  1 +
>> 6 files changed, 52 insertions(+)
>> create mode 160000 subprojects/libvfio-user
> 
>> diff --git a/subprojects/libvfio-user b/subprojects/libvfio-user
>> new file mode 160000
>> index 0000000..647c934
>> --- /dev/null
>> +++ b/subprojects/libvfio-user
>> @@ -0,0 +1 @@
>> +Subproject commit 647c9341d2e06266a710ddd075f69c95dd3b8446
>> 
> 
> Could we point to a sha1 of a released tag instead?

OK, will do.

--
Jag

> 


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC server v2 01/11] vfio-user: build library
  2021-09-10 15:20     ` Philippe Mathieu-Daudé
  2021-09-10 17:08       ` Jag Raman
@ 2021-09-11 22:29       ` John Levon
  2021-09-13 10:19         ` Philippe Mathieu-Daudé
  1 sibling, 1 reply; 108+ messages in thread
From: John Levon @ 2021-09-11 22:29 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé
  Cc: elena.ufimtseva, john.g.johnson, thuth, Jagannathan Raman,
	Swapnil Ingle, qemu-devel, alex.williamson, marcandre.lureau,
	stefanha, Thanos Makatos, alex.bennee

On Fri, Sep 10, 2021 at 05:20:09PM +0200, Philippe Mathieu-Daudé wrote:

> On 8/27/21 7:53 PM, Jagannathan Raman wrote:
> > add the libvfio-user library as a submodule. build it as a cmake
> > subproject.
> > 
> > Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> > Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> > Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> > ---
> >  configure                | 11 +++++++++++
> >  meson.build              | 28 ++++++++++++++++++++++++++++
> >  .gitmodules              |  3 +++
> >  MAINTAINERS              |  7 +++++++
> >  hw/remote/meson.build    |  2 ++
> >  subprojects/libvfio-user |  1 +
> >  6 files changed, 52 insertions(+)
> >  create mode 160000 subprojects/libvfio-user
> 
> > diff --git a/subprojects/libvfio-user b/subprojects/libvfio-user
> > new file mode 160000
> > index 0000000..647c934
> > --- /dev/null
> > +++ b/subprojects/libvfio-user
> > @@ -0,0 +1 @@
> > +Subproject commit 647c9341d2e06266a710ddd075f69c95dd3b8446
> 
> Could we point to a sha1 of a released tag instead?

We don't have releases (yet) partly because we haven't yet stabilized the API.

regards
john

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC server v2 01/11] vfio-user: build library
  2021-09-11 22:29       ` John Levon
@ 2021-09-13 10:19         ` Philippe Mathieu-Daudé
  0 siblings, 0 replies; 108+ messages in thread
From: Philippe Mathieu-Daudé @ 2021-09-13 10:19 UTC (permalink / raw)
  To: John Levon, thuth, stefanha
  Cc: elena.ufimtseva, john.g.johnson, Jagannathan Raman,
	Swapnil Ingle, qemu-devel, alex.williamson, marcandre.lureau,
	Thanos Makatos, alex.bennee

On 9/12/21 12:29 AM, John Levon wrote:
> On Fri, Sep 10, 2021 at 05:20:09PM +0200, Philippe Mathieu-Daudé wrote:
>> On 8/27/21 7:53 PM, Jagannathan Raman wrote:
>>> add the libvfio-user library as a submodule. build it as a cmake
>>> subproject.
>>>
>>> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
>>> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
>>> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
>>> ---
>>>  configure                | 11 +++++++++++
>>>  meson.build              | 28 ++++++++++++++++++++++++++++
>>>  .gitmodules              |  3 +++
>>>  MAINTAINERS              |  7 +++++++
>>>  hw/remote/meson.build    |  2 ++
>>>  subprojects/libvfio-user |  1 +
>>>  6 files changed, 52 insertions(+)
>>>  create mode 160000 subprojects/libvfio-user
>>
>>> diff --git a/subprojects/libvfio-user b/subprojects/libvfio-user
>>> new file mode 160000
>>> index 0000000..647c934
>>> --- /dev/null
>>> +++ b/subprojects/libvfio-user
>>> @@ -0,0 +1 @@
>>> +Subproject commit 647c9341d2e06266a710ddd075f69c95dd3b8446
>>
>> Could we point to a sha1 of a released tag instead?
> 
> We don't have releases (yet) partly because we haven't yet stabilized the API.

OK. Maybe acceptable, up to the maintainer then ¯\_(ツ)_/¯



^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC server v2 06/11] vfio-user: handle PCI config space accesses
  2021-09-10 16:22       ` Jag Raman
@ 2021-09-13 12:13         ` Stefan Hajnoczi
  0 siblings, 0 replies; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-09-13 12:13 UTC (permalink / raw)
  To: Jag Raman
  Cc: Elena Ufimtseva, John Johnson, thuth, swapnil.ingle, john.levon,
	philmd, qemu-devel, Alex Williamson, Marc-André Lureau,
	thanos.makatos, alex.bennee

[-- Attachment #1: Type: text/plain, Size: 2215 bytes --]

On Fri, Sep 10, 2021 at 04:22:56PM +0000, Jag Raman wrote:
> 
> 
> > On Sep 9, 2021, at 3:27 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > 
> > On Fri, Aug 27, 2021 at 01:53:25PM -0400, Jagannathan Raman wrote:
> >> +static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, char * const buf,
> >> +                                     size_t count, loff_t offset,
> >> +                                     const bool is_write)
> >> +{
> >> +    VfuObject *o = vfu_get_private(vfu_ctx);
> >> +    uint32_t pci_access_width = sizeof(uint32_t);
> >> +    size_t bytes = count;
> >> +    uint32_t val = 0;
> >> +    char *ptr = buf;
> >> +    int len;
> >> +
> >> +    while (bytes > 0) {
> >> +        len = (bytes > pci_access_width) ? pci_access_width : bytes;
> >> +        if (is_write) {
> >> +            memcpy(&val, ptr, len);
> >> +            pci_default_write_config(PCI_DEVICE(o->pci_dev),
> >> +                                     offset, val, len);
> >> +            trace_vfu_cfg_write(offset, val);
> >> +        } else {
> >> +            val = pci_default_read_config(PCI_DEVICE(o->pci_dev),
> >> +                                          offset, len);
> >> +            memcpy(ptr, &val, len);
> > 
> > pci_default_read_config() returns a host-endian 32-bit value. This code
> > looks wrong because it copies different bytes on big- and little-endian
> > hosts.
> 
> I’ll collect more details on this one, trying to wrap my head around it.
> 
> Concerning config space writes, it doesn’t look like we need to
> perform any conversion as the store operation automatically happens
> in the CPU’s native format when we do something like the following:
> PCIDevice->config[addr] = val;

PCIDevice->config is uint8_t*. Endianness isn't an issue with single
byte accesses.

> 
> Concerning config read, I observed that pci_default_read_config()
> performs le32_to_cpu() conversion. May be we should not rely on
> it doing the conversion.

Yes, ->config_read() returns a host-endian 32-bit value and
->config_write() receives a host-endian 32-bit value (it has a
bit-shifting loop that implicitly handles endianness conversion).

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC server v2 01/11] vfio-user: build library
  2021-09-10 15:21       ` Philippe Mathieu-Daudé
@ 2021-09-13 12:15         ` Stefan Hajnoczi
  0 siblings, 0 replies; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-09-13 12:15 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé
  Cc: elena.ufimtseva, john.g.johnson, thuth, Jagannathan Raman,
	swapnil.ingle, john.levon, qemu-devel, alex.williamson,
	marcandre.lureau, thanos.makatos, alex.bennee

[-- Attachment #1: Type: text/plain, Size: 1084 bytes --]

On Fri, Sep 10, 2021 at 05:21:33PM +0200, Philippe Mathieu-Daudé wrote:
> On 9/8/21 2:25 PM, Stefan Hajnoczi wrote:
> > On Fri, Aug 27, 2021 at 01:53:20PM -0400, Jagannathan Raman wrote:
> 
> >> diff --git a/.gitmodules b/.gitmodules
> >> index 08b1b48..cfeea7c 100644
> >> --- a/.gitmodules
> >> +++ b/.gitmodules
> >> @@ -64,3 +64,6 @@
> >>  [submodule "roms/vbootrom"]
> >>  	path = roms/vbootrom
> >>  	url = https://gitlab.com/qemu-project/vbootrom.git
> >> +[submodule "subprojects/libvfio-user"]
> >> +	path = subprojects/libvfio-user
> >> +	url = https://github.com/nutanix/libvfio-user.git
> > 
> > Once this is merged I'll set up a
> > gitlab.com/qemu-project/libvfio-user.git mirror. This ensures that no
> > matter what happens with upstream libvfio-user.git, the source code that
> > QEMU builds against will remain archived/available.
> 
> Can we do it the other way around? When the series is OK to be merged,
> setup the https://gitlab.com/qemu-project/libvfio-user.git mirror and
> have the submodule point to it?

Yes, good idea.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 04/16] vfio-user: connect vfio proxy to remote server
  2021-09-10  5:25             ` John Johnson
@ 2021-09-13 12:35               ` Stefan Hajnoczi
  2021-09-13 17:23               ` John Johnson
  1 sibling, 0 replies; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-09-13 12:35 UTC (permalink / raw)
  To: John Johnson
  Cc: Elena Ufimtseva, Jag Raman, Swapnil Ingle, John Levon,
	QEMU Devel Mailing List, Alex Williamson, thanos.makatos

[-- Attachment #1: Type: text/plain, Size: 3146 bytes --]

On Fri, Sep 10, 2021 at 05:25:13AM +0000, John Johnson wrote:
> 
> 
> > On Sep 8, 2021, at 11:29 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > 
> > On Thu, Sep 09, 2021 at 05:11:49AM +0000, John Johnson wrote:
> >> 
> >> 
> >> 	I did look at coroutines, but they seemed to work when the sender
> >> is triggering the coroutine on send, not when request packets are arriving
> >> asynchronously to the sends.
> > 
> > This can be done with a receiver coroutine. Its job is to be the only
> > thing that reads vfio-user messages from the socket. A receiver
> > coroutine reads messages from the socket and wakes up the waiting
> > coroutine that yielded from vfio_user_send_recv() or
> > vfio_user_pci_process_req().
> > 
> > (Although vfio_user_pci_process_req() could be called directly from the
> > receiver coroutine, it seems safer to have a separate coroutine that
> > processes requests so that the receiver isn't blocked in case
> > vfio_user_pci_process_req() yields while processing a request.)
> > 
> > Going back to what you mentioned above, the receiver coroutine does
> > something like this:
> > 
> >  if it's a reply
> >      reply = find_reply(...)
> >      qemu_coroutine_enter(reply->co) // instead of signalling reply->cv
> >  else
> >      QSIMPLEQ_INSERT_TAIL(&pending_reqs, request, next);
> >      if (pending_reqs_was_empty) {
> >          qemu_coroutine_enter(process_request_co);
> >      }
> > 
> > The pending_reqs queue holds incoming requests that the
> > process_request_co coroutine processes.
> > 
> 
> 
> 	How do coroutines work across threads?  There can be multiple vCPU
> threads waiting for replies, and I think the receiver coroutine will be
> running in the main loop thread.  Where would a vCPU block waiting for
> a reply?  I think coroutine_yield() returns to its coroutine_enter() caller.

A vCPU thread holding the BQL can iterate the event loop if it has
reached a synchronous point that needs to wait for a reply before
returning. I think we have this situation when a MemoryRegion is
accessed on the proxy device.

For example, block/block-backend.c:blk_prw() kicks off a coroutine and
then runs the event loop until the coroutine finishes:

  Coroutine *co = qemu_coroutine_create(co_entry, &rwco);
  bdrv_coroutine_enter(blk_bs(blk), co);
  BDRV_POLL_WHILE(blk_bs(blk), rwco.ret == NOT_DONE);

BDRV_POLL_WHILE() boils down to a loop like this:

  while ((cond)) {
    aio_poll(ctx, true);
  }

I also want to check that I understand the scenarios in which the
vfio-user communication code is used:

1. vhost-user-server

The vfio-user communication code should run in a given AioContext (it
will be the main loop by default but maybe the user will be able to
configure a specific IOThread in the future).

2. vCPU thread vfio-user clients

The vfio-user communication code is called from the vCPU thread where
the proxy device executes. The MemoryRegion->read()/write() callbacks
are synchronous, so the thread needs to wait for a vfio-user reply
before it can return.

Is this what you had in mind?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 04/16] vfio-user: connect vfio proxy to remote server
  2021-09-10  5:25             ` John Johnson
  2021-09-13 12:35               ` Stefan Hajnoczi
@ 2021-09-13 17:23               ` John Johnson
  2021-09-14 13:06                 ` Stefan Hajnoczi
  1 sibling, 1 reply; 108+ messages in thread
From: John Johnson @ 2021-09-13 17:23 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, Jag Raman, Swapnil Ingle, John Levon,
	QEMU Devel Mailing List, Alex Williamson, thanos.makatos


> 
>> On Sep 9, 2021, at 10:25 PM, John Johnson <john.g.johnson@oracle.com> wrote:
>> 
>> 
>> 
>>> On Sep 8, 2021, at 11:29 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>> 
>>> On Thu, Sep 09, 2021 at 05:11:49AM +0000, John Johnson wrote:
>>>> 
>>>> 
>>>> 	I did look at coroutines, but they seemed to work when the sender
>>>> is triggering the coroutine on send, not when request packets are arriving
>>>> asynchronously to the sends.
>>> 
>>> This can be done with a receiver coroutine. Its job is to be the only
>>> thing that reads vfio-user messages from the socket. A receiver
>>> coroutine reads messages from the socket and wakes up the waiting
>>> coroutine that yielded from vfio_user_send_recv() or
>>> vfio_user_pci_process_req().
>>> 
>>> (Although vfio_user_pci_process_req() could be called directly from the
>>> receiver coroutine, it seems safer to have a separate coroutine that
>>> processes requests so that the receiver isn't blocked in case
>>> vfio_user_pci_process_req() yields while processing a request.)
>>> 
>>> Going back to what you mentioned above, the receiver coroutine does
>>> something like this:
>>> 
>>> if it's a reply
>>>     reply = find_reply(...)
>>>     qemu_coroutine_enter(reply->co) // instead of signalling reply->cv
>>> else
>>>     QSIMPLEQ_INSERT_TAIL(&pending_reqs, request, next);
>>>     if (pending_reqs_was_empty) {
>>>         qemu_coroutine_enter(process_request_co);
>>>     }
>>> 
>>> The pending_reqs queue holds incoming requests that the
>>> process_request_co coroutine processes.
>>> 
>> 
>> 
>> 	How do coroutines work across threads?  There can be multiple vCPU
>> threads waiting for replies, and I think the receiver coroutine will be
>> running in the main loop thread.  Where would a vCPU block waiting for
>> a reply?  I think coroutine_yield() returns to its coroutine_enter() caller
> 
> 
> 
> A vCPU thread holding the BQL can iterate the event loop if it has
> reached a synchronous point that needs to wait for a reply before
> returning. I think we have this situation when a MemoryRegion is
> accessed on the proxy device.
> 
> For example, block/block-backend.c:blk_prw() kicks off a coroutine and
> then runs the event loop until the coroutine finishes:
> 
>   Coroutine *co = qemu_coroutine_create(co_entry, &rwco);
>   bdrv_coroutine_enter(blk_bs(blk), co);
>   BDRV_POLL_WHILE(blk_bs(blk), rwco.ret == NOT_DONE);
> 
> BDRV_POLL_WHILE() boils down to a loop like this:
> 
>   while ((cond)) {
>     aio_poll(ctx, true);
>   }
> 

	I think that would make vCPUs sending requests and the
receiver coroutine all poll on the same socket.  If the “wrong”
routine reads the message, I’d need a second level of synchronization
to pass the message to the “right” one.  e.g., if the vCPU coroutine
reads a request, it needs to pass it to the receiver; if the receiver
coroutine reads a reply, it needs to pass it to a vCPU.

	Avoiding this complexity is one of the reasons I went with
a separate thread that only reads the socket over the mp-qemu model,
which does have the sender poll, but doesn’t need to handle incoming
requests.



> I also want to check that I understand the scenarios in which the
> vfio-user communication code is used:
> 
> 1. vhost-user-server
> 
> The vfio-user communication code should run in a given AioContext (it
> will be the main loop by default but maybe the user will be able to
> configure a specific IOThread in the future).
> 

	Jag would know more, but I believe it runs off the main loop.
Running it in an iothread doesn’t gain much, since it needs BQL to
run the device emulation code.


> 2. vCPU thread vfio-user clients
> 
> The vfio-user communication code is called from the vCPU thread where
> the proxy device executes. The MemoryRegion->read()/write() callbacks
> are synchronous, so the thread needs to wait for a vfio-user reply
> before it can return.
> 
> Is this what you had in mind?

	The client is also called from the main thread - the GET_*
messages from vfio_user_pci_realize() as well as MAP/DEMAP messages
from guest address space change transactions.  It is also called by
the migration thread, which is a separate thread that does not run
holding BQL.

							JJ


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 04/16] vfio-user: connect vfio proxy to remote server
  2021-09-13 17:23               ` John Johnson
@ 2021-09-14 13:06                 ` Stefan Hajnoczi
  2021-09-15  0:21                   ` John Johnson
  0 siblings, 1 reply; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-09-14 13:06 UTC (permalink / raw)
  To: John Johnson
  Cc: Elena Ufimtseva, Jag Raman, Swapnil Ingle, John Levon,
	QEMU Devel Mailing List, Alex Williamson, thanos.makatos

[-- Attachment #1: Type: text/plain, Size: 5579 bytes --]

On Mon, Sep 13, 2021 at 05:23:33PM +0000, John Johnson wrote:
> >> On Sep 9, 2021, at 10:25 PM, John Johnson <john.g.johnson@oracle.com> wrote:
> >>> On Sep 8, 2021, at 11:29 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>> On Thu, Sep 09, 2021 at 05:11:49AM +0000, John Johnson wrote:
> >>>> 	I did look at coroutines, but they seemed to work when the sender
> >>>> is triggering the coroutine on send, not when request packets are arriving
> >>>> asynchronously to the sends.
> >>> 
> >>> This can be done with a receiver coroutine. Its job is to be the only
> >>> thing that reads vfio-user messages from the socket. A receiver
> >>> coroutine reads messages from the socket and wakes up the waiting
> >>> coroutine that yielded from vfio_user_send_recv() or
> >>> vfio_user_pci_process_req().
> >>> 
> >>> (Although vfio_user_pci_process_req() could be called directly from the
> >>> receiver coroutine, it seems safer to have a separate coroutine that
> >>> processes requests so that the receiver isn't blocked in case
> >>> vfio_user_pci_process_req() yields while processing a request.)
> >>> 
> >>> Going back to what you mentioned above, the receiver coroutine does
> >>> something like this:
> >>> 
> >>> if it's a reply
> >>>     reply = find_reply(...)
> >>>     qemu_coroutine_enter(reply->co) // instead of signalling reply->cv
> >>> else
> >>>     QSIMPLEQ_INSERT_TAIL(&pending_reqs, request, next);
> >>>     if (pending_reqs_was_empty) {
> >>>         qemu_coroutine_enter(process_request_co);
> >>>     }
> >>> 
> >>> The pending_reqs queue holds incoming requests that the
> >>> process_request_co coroutine processes.
> >>> 
> >> 
> >> 
> >> 	How do coroutines work across threads?  There can be multiple vCPU
> >> threads waiting for replies, and I think the receiver coroutine will be
> >> running in the main loop thread.  Where would a vCPU block waiting for
> >> a reply?  I think coroutine_yield() returns to its coroutine_enter() caller
> > 
> > 
> > 
> > A vCPU thread holding the BQL can iterate the event loop if it has
> > reached a synchronous point that needs to wait for a reply before
> > returning. I think we have this situation when a MemoryRegion is
> > accessed on the proxy device.
> > 
> > For example, block/block-backend.c:blk_prw() kicks off a coroutine and
> > then runs the event loop until the coroutine finishes:
> > 
> >   Coroutine *co = qemu_coroutine_create(co_entry, &rwco);
> >   bdrv_coroutine_enter(blk_bs(blk), co);
> >   BDRV_POLL_WHILE(blk_bs(blk), rwco.ret == NOT_DONE);
> > 
> > BDRV_POLL_WHILE() boils down to a loop like this:
> > 
> >   while ((cond)) {
> >     aio_poll(ctx, true);
> >   }
> > 
> 
> 	I think that would make vCPUs sending requests and the
> receiver coroutine all poll on the same socket.  If the “wrong”
> routine reads the message, I’d need a second level of synchronization
> to pass the message to the “right” one.  e.g., if the vCPU coroutine
> reads a request, it needs to pass it to the receiver; if the receiver
> coroutine reads a reply, it needs to pass it to a vCPU.
> 
> 	Avoiding this complexity is one of the reasons I went with
> a separate thread that only reads the socket over the mp-qemu model,
> which does have the sender poll, but doesn’t need to handle incoming
> requests.

Only one coroutine reads from the socket, the "receiver" coroutine. In a
previous reply I sketched what the receiver does:

  if it's a reply
      reply = find_reply(...)
      qemu_coroutine_enter(reply->co) // instead of signalling reply->cv
  else
      QSIMPLEQ_INSERT_TAIL(&pending_reqs, request, next);
      if (pending_reqs_was_empty) {
          qemu_coroutine_enter(process_request_co);
      }

The qemu_coroutine_enter(reply->co) call re-enters the coroutine that
was created by the vCPU thread. Is this the "second level of
synchronization" that you described? It's very similar to signalling
reply->cv in the existing patch.

Now I'm actually thinking about whether this can be improved by keeping
the condvar so that the vCPU thread doesn't need to call aio_poll()
(which is awkward because it doesn't drop the BQL and therefore blocks
other vCPUs from making progress). That approach wouldn't require a
dedicated thread for vfio-user.

> > I also want to check that I understand the scenarios in which the
> > vfio-user communication code is used:
> > 
> > 1. vhost-user-server
> > 
> > The vfio-user communication code should run in a given AioContext (it
> > will be the main loop by default but maybe the user will be able to
> > configure a specific IOThread in the future).
> > 
> 
> 	Jag would know more, but I believe it runs off the main loop.
> Running it in an iothread doesn’t gain much, since it needs BQL to
> run the device emulation code.
> 
> 
> > 2. vCPU thread vfio-user clients
> > 
> > The vfio-user communication code is called from the vCPU thread where
> > the proxy device executes. The MemoryRegion->read()/write() callbacks
> > are synchronous, so the thread needs to wait for a vfio-user reply
> > before it can return.
> > 
> > Is this what you had in mind?
> 
> 	The client is also called from the main thread - the GET_*
> messages from vfio_user_pci_realize() as well as MAP/DEMAP messages
> from guest address space change transactions.  It is also called by
> the migration thread, which is a separate thread that does not run
> holding BQL.

Thanks for mentioning those additional cases.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 04/16] vfio-user: connect vfio proxy to remote server
  2021-09-14 13:06                 ` Stefan Hajnoczi
@ 2021-09-15  0:21                   ` John Johnson
  2021-09-15 13:04                     ` Stefan Hajnoczi
  0 siblings, 1 reply; 108+ messages in thread
From: John Johnson @ 2021-09-15  0:21 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, Jag Raman, Swapnil Ingle, John Levon,
	QEMU Devel Mailing List, Alex Williamson, thanos.makatos



> On Sep 14, 2021, at 6:06 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Mon, Sep 13, 2021 at 05:23:33PM +0000, John Johnson wrote:
>>>> On Sep 9, 2021, at 10:25 PM, John Johnson <john.g.johnson@oracle.com> wrote:
>>>>> On Sep 8, 2021, at 11:29 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>> On Thu, Sep 09, 2021 at 05:11:49AM +0000, John Johnson wrote:
>>>>>> 	I did look at coroutines, but they seemed to work when the sender
>>>>>> is triggering the coroutine on send, not when request packets are arriving
>>>>>> asynchronously to the sends.
>>>>> 
>>>>> This can be done with a receiver coroutine. Its job is to be the only
>>>>> thing that reads vfio-user messages from the socket. A receiver
>>>>> coroutine reads messages from the socket and wakes up the waiting
>>>>> coroutine that yielded from vfio_user_send_recv() or
>>>>> vfio_user_pci_process_req().
>>>>> 
>>>>> (Although vfio_user_pci_process_req() could be called directly from the
>>>>> receiver coroutine, it seems safer to have a separate coroutine that
>>>>> processes requests so that the receiver isn't blocked in case
>>>>> vfio_user_pci_process_req() yields while processing a request.)
>>>>> 
>>>>> Going back to what you mentioned above, the receiver coroutine does
>>>>> something like this:
>>>>> 
>>>>> if it's a reply
>>>>>    reply = find_reply(...)
>>>>>    qemu_coroutine_enter(reply->co) // instead of signalling reply->cv
>>>>> else
>>>>>    QSIMPLEQ_INSERT_TAIL(&pending_reqs, request, next);
>>>>>    if (pending_reqs_was_empty) {
>>>>>        qemu_coroutine_enter(process_request_co);
>>>>>    }
>>>>> 
>>>>> The pending_reqs queue holds incoming requests that the
>>>>> process_request_co coroutine processes.
>>>>> 
>>>> 
>>>> 
>>>> 	How do coroutines work across threads?  There can be multiple vCPU
>>>> threads waiting for replies, and I think the receiver coroutine will be
>>>> running in the main loop thread.  Where would a vCPU block waiting for
>>>> a reply?  I think coroutine_yield() returns to its coroutine_enter() caller
>>> 
>>> 
>>> 
>>> A vCPU thread holding the BQL can iterate the event loop if it has
>>> reached a synchronous point that needs to wait for a reply before
>>> returning. I think we have this situation when a MemoryRegion is
>>> accessed on the proxy device.
>>> 
>>> For example, block/block-backend.c:blk_prw() kicks off a coroutine and
>>> then runs the event loop until the coroutine finishes:
>>> 
>>>  Coroutine *co = qemu_coroutine_create(co_entry, &rwco);
>>>  bdrv_coroutine_enter(blk_bs(blk), co);
>>>  BDRV_POLL_WHILE(blk_bs(blk), rwco.ret == NOT_DONE);
>>> 
>>> BDRV_POLL_WHILE() boils down to a loop like this:
>>> 
>>>  while ((cond)) {
>>>    aio_poll(ctx, true);
>>>  }
>>> 
>> 
>> 	I think that would make vCPUs sending requests and the
>> receiver coroutine all poll on the same socket.  If the “wrong”
>> routine reads the message, I’d need a second level of synchronization
>> to pass the message to the “right” one.  e.g., if the vCPU coroutine
>> reads a request, it needs to pass it to the receiver; if the receiver
>> coroutine reads a reply, it needs to pass it to a vCPU.
>> 
>> 	Avoiding this complexity is one of the reasons I went with
>> a separate thread that only reads the socket over the mp-qemu model,
>> which does have the sender poll, but doesn’t need to handle incoming
>> requests.
> 
> Only one coroutine reads from the socket, the "receiver" coroutine. In a
> previous reply I sketched what the receiver does:
> 
>  if it's a reply
>      reply = find_reply(...)
>      qemu_coroutine_enter(reply->co) // instead of signalling reply->cv
>  else
>      QSIMPLEQ_INSERT_TAIL(&pending_reqs, request, next);
>      if (pending_reqs_was_empty) {
>          qemu_coroutine_enter(process_request_co);
>      }
> 

	Sorry, I was assuming when you said the coroutine will block with
aio_poll(), you implied it would also read messages from the socket.
 

> The qemu_coroutine_enter(reply->co) call re-enters the coroutine that
> was created by the vCPU thread. Is this the "second level of
> synchronization" that you described? It's very similar to signalling
> reply->cv in the existing patch.
> 

	Yes, the only difference is it would be woken on each message,
even though it doesn’t read them.  Which is what I think you’re addressing
below.


> Now I'm actually thinking about whether this can be improved by keeping
> the condvar so that the vCPU thread doesn't need to call aio_poll()
> (which is awkward because it doesn't drop the BQL and therefore blocks
> other vCPUs from making progress). That approach wouldn't require a
> dedicated thread for vfio-user.
> 

	Wouldn’t you need to acquire BQL twice for every vCPU reply: once to
run the receiver coroutine, and once when the vCPU thread wakes up and wants
to return to the VFIO code.  The migration thread would also add a BQL
dependency, where it didn’t have one before.

	Is your objection with using an iothread, or using a separate thread?
I can change to using qemu_thread_create() and running aio_poll() from the
thread routine, instead of creating an iothread.


	On a related subject:

On Aug 24, 2021, at 8:14 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:

>> +    ret = qio_channel_readv_full(proxy->ioc, &iov, 1, &fdp, &numfds,
>> +                                 &local_err);
> 
> This is a blocking call. My understanding is that the IOThread is shared
> by all vfio-user devices, so other devices will have to wait if one of
> them is acting up (e.g. the device emulation process sent less than
> sizeof(msg) bytes).


	This shouldn’t block if the emulation process sends less than sizeof(msg)
bytes.  qio_channel_readv() will eventually call recvmsg(), which only blocks a
short read if MSG_WAITALL is set, and it’s not set in this case.  recvmsg() will
return the data available, and vfio_user_recv() will treat a short read as an error.

								JJ


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 04/16] vfio-user: connect vfio proxy to remote server
  2021-09-15  0:21                   ` John Johnson
@ 2021-09-15 13:04                     ` Stefan Hajnoczi
  2021-09-15 19:14                       ` John Johnson
  0 siblings, 1 reply; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-09-15 13:04 UTC (permalink / raw)
  To: John Johnson
  Cc: Elena Ufimtseva, Jag Raman, Swapnil Ingle, John Levon,
	QEMU Devel Mailing List, Alex Williamson, thanos.makatos

[-- Attachment #1: Type: text/plain, Size: 12349 bytes --]

On Wed, Sep 15, 2021 at 12:21:10AM +0000, John Johnson wrote:
> 
> 
> > On Sep 14, 2021, at 6:06 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > 
> > On Mon, Sep 13, 2021 at 05:23:33PM +0000, John Johnson wrote:
> >>>> On Sep 9, 2021, at 10:25 PM, John Johnson <john.g.johnson@oracle.com> wrote:
> >>>>> On Sep 8, 2021, at 11:29 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>>>> On Thu, Sep 09, 2021 at 05:11:49AM +0000, John Johnson wrote:
> >>>>>> 	I did look at coroutines, but they seemed to work when the sender
> >>>>>> is triggering the coroutine on send, not when request packets are arriving
> >>>>>> asynchronously to the sends.
> >>>>> 
> >>>>> This can be done with a receiver coroutine. Its job is to be the only
> >>>>> thing that reads vfio-user messages from the socket. A receiver
> >>>>> coroutine reads messages from the socket and wakes up the waiting
> >>>>> coroutine that yielded from vfio_user_send_recv() or
> >>>>> vfio_user_pci_process_req().
> >>>>> 
> >>>>> (Although vfio_user_pci_process_req() could be called directly from the
> >>>>> receiver coroutine, it seems safer to have a separate coroutine that
> >>>>> processes requests so that the receiver isn't blocked in case
> >>>>> vfio_user_pci_process_req() yields while processing a request.)
> >>>>> 
> >>>>> Going back to what you mentioned above, the receiver coroutine does
> >>>>> something like this:
> >>>>> 
> >>>>> if it's a reply
> >>>>>    reply = find_reply(...)
> >>>>>    qemu_coroutine_enter(reply->co) // instead of signalling reply->cv
> >>>>> else
> >>>>>    QSIMPLEQ_INSERT_TAIL(&pending_reqs, request, next);
> >>>>>    if (pending_reqs_was_empty) {
> >>>>>        qemu_coroutine_enter(process_request_co);
> >>>>>    }
> >>>>> 
> >>>>> The pending_reqs queue holds incoming requests that the
> >>>>> process_request_co coroutine processes.
> >>>>> 
> >>>> 
> >>>> 
> >>>> 	How do coroutines work across threads?  There can be multiple vCPU
> >>>> threads waiting for replies, and I think the receiver coroutine will be
> >>>> running in the main loop thread.  Where would a vCPU block waiting for
> >>>> a reply?  I think coroutine_yield() returns to its coroutine_enter() caller
> >>> 
> >>> 
> >>> 
> >>> A vCPU thread holding the BQL can iterate the event loop if it has
> >>> reached a synchronous point that needs to wait for a reply before
> >>> returning. I think we have this situation when a MemoryRegion is
> >>> accessed on the proxy device.
> >>> 
> >>> For example, block/block-backend.c:blk_prw() kicks off a coroutine and
> >>> then runs the event loop until the coroutine finishes:
> >>> 
> >>>  Coroutine *co = qemu_coroutine_create(co_entry, &rwco);
> >>>  bdrv_coroutine_enter(blk_bs(blk), co);
> >>>  BDRV_POLL_WHILE(blk_bs(blk), rwco.ret == NOT_DONE);
> >>> 
> >>> BDRV_POLL_WHILE() boils down to a loop like this:
> >>> 
> >>>  while ((cond)) {
> >>>    aio_poll(ctx, true);
> >>>  }
> >>> 
> >> 
> >> 	I think that would make vCPUs sending requests and the
> >> receiver coroutine all poll on the same socket.  If the “wrong”
> >> routine reads the message, I’d need a second level of synchronization
> >> to pass the message to the “right” one.  e.g., if the vCPU coroutine
> >> reads a request, it needs to pass it to the receiver; if the receiver
> >> coroutine reads a reply, it needs to pass it to a vCPU.
> >> 
> >> 	Avoiding this complexity is one of the reasons I went with
> >> a separate thread that only reads the socket over the mp-qemu model,
> >> which does have the sender poll, but doesn’t need to handle incoming
> >> requests.
> > 
> > Only one coroutine reads from the socket, the "receiver" coroutine. In a
> > previous reply I sketched what the receiver does:
> > 
> >  if it's a reply
> >      reply = find_reply(...)
> >      qemu_coroutine_enter(reply->co) // instead of signalling reply->cv
> >  else
> >      QSIMPLEQ_INSERT_TAIL(&pending_reqs, request, next);
> >      if (pending_reqs_was_empty) {
> >          qemu_coroutine_enter(process_request_co);
> >      }
> > 
> 
> 	Sorry, I was assuming when you said the coroutine will block with
> aio_poll(), you implied it would also read messages from the socket.

The vCPU thread calls aio_poll() outside the coroutine, similar to the
block/block-backend.c:blk_prw() example I posted above:

  Coroutine *co = qemu_coroutine_create(co_entry, &rwco);
  bdrv_coroutine_enter(blk_bs(blk), co);
  BDRV_POLL_WHILE(blk_bs(blk), rwco.ret == NOT_DONE);

(BDRV_POLL_WHILE() is a aio_poll() loop.)

The coroutine isn't aware of aio_poll(), it just yields when it needs to
wait.

> > The qemu_coroutine_enter(reply->co) call re-enters the coroutine that
> > was created by the vCPU thread. Is this the "second level of
> > synchronization" that you described? It's very similar to signalling
> > reply->cv in the existing patch.
> > 
> 
> 	Yes, the only difference is it would be woken on each message,
> even though it doesn’t read them.  Which is what I think you’re addressing
> below.
>
> > Now I'm actually thinking about whether this can be improved by keeping
> > the condvar so that the vCPU thread doesn't need to call aio_poll()
> > (which is awkward because it doesn't drop the BQL and therefore blocks
> > other vCPUs from making progress). That approach wouldn't require a
> > dedicated thread for vfio-user.
> > 
> 
> 	Wouldn’t you need to acquire BQL twice for every vCPU reply: once to
> run the receiver coroutine, and once when the vCPU thread wakes up and wants
> to return to the VFIO code.  The migration thread would also add a BQL
> dependency, where it didn’t have one before.

If aio_poll() is used then the vCPU thread doesn't drop the BQL at all.
The vCPU thread sends the message and waits for the reply while other
BQL threads are locked out.

If a condvar or similar mechanism is used then the vCPU sends the
message, drops the BQL, and waits on the condvar. The main loop thread
runs the receiver coroutine and re-enters the coroutine, which signals
the condvar. The vCPU then re-acquires the BQL.

> 	Is your objection with using an iothread, or using a separate thread?
> I can change to using qemu_thread_create() and running aio_poll() from the
> thread routine, instead of creating an iothread.

The vfio-user communication code shouldn't need to worry about threads
or locks. The code can be written in terms of AioContext so the caller
can use it from various environments without hardcoding details of the
BQL or threads into the communication code. This makes it easier to
understand and less tightly coupled.

I'll try to sketch how that could work:

The main concept is VFIOProxy, which has a QIOChannel (the socket
connection) and its main API is:

  void coroutine_fn vfio_user_co_send_recv(VFIOProxy *proxy,
          VFIOUserHdr *msg, VFIOUserFDs *fds, int rsize, int flags);

There is also a request callback for processing incoming requests:

  void coroutine_fn (*request)(void *opaque, char *buf,
                              VFIOUserFDs *fds);

The main loop thread can either use vfio_user_co_send_recv() from
coroutine context or use this blocking wrapper:

  typedef struct {
      VFIOProxy *proxy;
      VFIOUserHdr *msg;
      VFIOUserFDs *fds;
      int rsize;
      int flags;
      bool done;
  } VFIOUserSendRecvData;

  static void coroutine_fn vfu_send_recv_co(void *opaque)
  {
      VFIOUserSendRecvData *data = opaque;
      vfio_user_co_send_recv(data->proxy, data->msg, data->fds,
                             data->rsize, data->flags);
      data->done = true;
  }

  /* A blocking version of vfio_user_co_send_recv() */
  void vfio_user_send_recv(VFIOProxy *proxy, VFIOUserHdr *msg,
                           VFIOUserFDs *fds, int rsize, int flags)
  {
      VFIOUserSendRecvData data = {
          .proxy = proxy,
	  .msg = msg,
	  .fds = fds,
	  .rsize = rsize,
	  .flags = flags,
      };
      Coroutine *co = qemu_coroutine_create(vfu_send_recv_co, &data);
      qemu_coroutine_enter(co);
      while (!data.done) {
          aio_poll(proxy->ioc->ctx, true);
      }
  }

The vCPU thread can use vfio_user_send_recv() if it wants, although the
BQL will be held, preventing other threads from making progress. That
can be avoided by writing a similar wrapper that uses a QemuSemaphore:

  typedef struct {
      VFIOProxy *proxy;
      VFIOUserHdr *msg;
      VFIOUserFDs *fds;
      int rsize;
      int flags;
      QemuSemaphore sem;
  } VFIOUserSendRecvData;

  static void coroutine_fn vfu_send_recv_co(void *opaque)
  {
      VFIOUserSendRecvData *data = opaque;
      vfio_user_co_send_recv(data->proxy, data->msg, data->fds,
                             data->rsize, data->flags);
      qemu_sem_post(&data->sem);
  }

  /*
   * A blocking version of vfio_user_co_send_recv() that relies on
   * another thread to run the event loop. This can be used from vCPU
   * threads to avoid hogging the BQL.
   */
  void vfio_user_vcpu_send_recv(VFIOProxy *proxy, VFIOUserHdr *msg,
                                VFIOUserFDs *fds, int rsize, int flags)
  {
      VFIOUserSendRecvData data = {
          .proxy = proxy,
	  .msg = msg,
	  .fds = fds,
	  .rsize = rsize,
	  .flags = flags,
      };
      Coroutine *co = qemu_coroutine_create(vfu_vcpu_send_recv_co, &data);

      qemu_sem_init(&data.sem, 0);

      qemu_coroutine_enter(co);

      qemu_mutex_unlock_iothread();
      qemu_sem_wait(&data.sem);
      qemu_mutex_lock_iothread();

      qemu_sem_destroy(&data.sem);
  }

With vfio_user_vcpu_send_recv() the vCPU thread doesn't call aio_poll()
itself but instead relies on the main loop thread to run the event loop.

By writing coroutines that run in proxy->ioc->ctx we keep the threading
model and locking in the caller. The communication code isn't aware of
or tied to specific threads. It's possible to drop proxy->lock because
state is only changed from within the AioContext, not multiple threads
that may run in parallel. I think this makes the communication code
simpler and cleaner.

It's possible to use IOThreads with this approach: set the QIOChannel's
AioContext to the IOThread AioContext. However, I don't think we can do
this in the vhost-user server yet because QEMU's device models expect to
run with the BQL and not in an IOThread.

I didn't go into detail about how vfio_user_co_send_recv() is
implemented. Please let me know if you want me to share ideas about
that, but it's what we've already discussed with a "receiver" coroutine
that re-enters the reply coroutines or calls ->request(). A CoMutex is
needed to around qio_channel_write_all() to ensure that coroutines
sending messages don't interleave partial writes if the socket sndbuf is
exhausted.

> 	On a related subject:
> 
> On Aug 24, 2021, at 8:14 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> >> +    ret = qio_channel_readv_full(proxy->ioc, &iov, 1, &fdp, &numfds,
> >> +                                 &local_err);
> > 
> > This is a blocking call. My understanding is that the IOThread is shared
> > by all vfio-user devices, so other devices will have to wait if one of
> > them is acting up (e.g. the device emulation process sent less than
> > sizeof(msg) bytes).
> 
> 
> 	This shouldn’t block if the emulation process sends less than sizeof(msg)
> bytes.  qio_channel_readv() will eventually call recvmsg(), which only blocks a
> short read if MSG_WAITALL is set, and it’s not set in this case.  recvmsg() will
> return the data available, and vfio_user_recv() will treat a short read as an error.

That's true but vfio_user_recv() can still block layer on: if only
sizeof(msg) bytes are available and msg.size > sizeof(msg) then the
second call blocks.

  msgleft = msg.size - sizeof(msg);
  if (msgleft != 0) {
      ret = qio_channel_read(proxy->ioc, data, msgleft, &local_err);

I think either code should be non-blocking or it shouldn't be. Writing
code that is partially non-blocking is asking for trouble because it's
not obvious where it can block and misbehaving or malicious programs can
cause it to block.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 04/16] vfio-user: connect vfio proxy to remote server
  2021-09-15 13:04                     ` Stefan Hajnoczi
@ 2021-09-15 19:14                       ` John Johnson
  2021-09-16 11:49                         ` Stefan Hajnoczi
  0 siblings, 1 reply; 108+ messages in thread
From: John Johnson @ 2021-09-15 19:14 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, Jag Raman, Swapnil Ingle, John Levon,
	QEMU Devel Mailing List, Alex Williamson, thanos.makatos



> On Sep 15, 2021, at 6:04 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Wed, Sep 15, 2021 at 12:21:10AM +0000, John Johnson wrote:
>> 
>> 
>>> On Sep 14, 2021, at 6:06 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>> 
>>> On Mon, Sep 13, 2021 at 05:23:33PM +0000, John Johnson wrote:
>>>>>> On Sep 9, 2021, at 10:25 PM, John Johnson <john.g.johnson@oracle.com> wrote:
>>>>>>> On Sep 8, 2021, at 11:29 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>>>> On Thu, Sep 09, 2021 at 05:11:49AM +0000, John Johnson wrote:
>>>>>>>> 	I did look at coroutines, but they seemed to work when the sender
>>>>>>>> is triggering the coroutine on send, not when request packets are arriving
>>>>>>>> asynchronously to the sends.
>>>>>>> 
>>>>>>> This can be done with a receiver coroutine. Its job is to be the only
>>>>>>> thing that reads vfio-user messages from the socket. A receiver
>>>>>>> coroutine reads messages from the socket and wakes up the waiting
>>>>>>> coroutine that yielded from vfio_user_send_recv() or
>>>>>>> vfio_user_pci_process_req().
>>>>>>> 
>>>>>>> (Although vfio_user_pci_process_req() could be called directly from the
>>>>>>> receiver coroutine, it seems safer to have a separate coroutine that
>>>>>>> processes requests so that the receiver isn't blocked in case
>>>>>>> vfio_user_pci_process_req() yields while processing a request.)
>>>>>>> 
>>>>>>> Going back to what you mentioned above, the receiver coroutine does
>>>>>>> something like this:
>>>>>>> 
>>>>>>> if it's a reply
>>>>>>>   reply = find_reply(...)
>>>>>>>   qemu_coroutine_enter(reply->co) // instead of signalling reply->cv
>>>>>>> else
>>>>>>>   QSIMPLEQ_INSERT_TAIL(&pending_reqs, request, next);
>>>>>>>   if (pending_reqs_was_empty) {
>>>>>>>       qemu_coroutine_enter(process_request_co);
>>>>>>>   }
>>>>>>> 
>>>>>>> The pending_reqs queue holds incoming requests that the
>>>>>>> process_request_co coroutine processes.
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 	How do coroutines work across threads?  There can be multiple vCPU
>>>>>> threads waiting for replies, and I think the receiver coroutine will be
>>>>>> running in the main loop thread.  Where would a vCPU block waiting for
>>>>>> a reply?  I think coroutine_yield() returns to its coroutine_enter() caller
>>>>> 
>>>>> 
>>>>> 
>>>>> A vCPU thread holding the BQL can iterate the event loop if it has
>>>>> reached a synchronous point that needs to wait for a reply before
>>>>> returning. I think we have this situation when a MemoryRegion is
>>>>> accessed on the proxy device.
>>>>> 
>>>>> For example, block/block-backend.c:blk_prw() kicks off a coroutine and
>>>>> then runs the event loop until the coroutine finishes:
>>>>> 
>>>>> Coroutine *co = qemu_coroutine_create(co_entry, &rwco);
>>>>> bdrv_coroutine_enter(blk_bs(blk), co);
>>>>> BDRV_POLL_WHILE(blk_bs(blk), rwco.ret == NOT_DONE);
>>>>> 
>>>>> BDRV_POLL_WHILE() boils down to a loop like this:
>>>>> 
>>>>> while ((cond)) {
>>>>>   aio_poll(ctx, true);
>>>>> }
>>>>> 
>>>> 
>>>> 	I think that would make vCPUs sending requests and the
>>>> receiver coroutine all poll on the same socket.  If the “wrong”
>>>> routine reads the message, I’d need a second level of synchronization
>>>> to pass the message to the “right” one.  e.g., if the vCPU coroutine
>>>> reads a request, it needs to pass it to the receiver; if the receiver
>>>> coroutine reads a reply, it needs to pass it to a vCPU.
>>>> 
>>>> 	Avoiding this complexity is one of the reasons I went with
>>>> a separate thread that only reads the socket over the mp-qemu model,
>>>> which does have the sender poll, but doesn’t need to handle incoming
>>>> requests.
>>> 
>>> Only one coroutine reads from the socket, the "receiver" coroutine. In a
>>> previous reply I sketched what the receiver does:
>>> 
>>> if it's a reply
>>>     reply = find_reply(...)
>>>     qemu_coroutine_enter(reply->co) // instead of signalling reply->cv
>>> else
>>>     QSIMPLEQ_INSERT_TAIL(&pending_reqs, request, next);
>>>     if (pending_reqs_was_empty) {
>>>         qemu_coroutine_enter(process_request_co);
>>>     }
>>> 
>> 
>> 	Sorry, I was assuming when you said the coroutine will block with
>> aio_poll(), you implied it would also read messages from the socket.
> 
> The vCPU thread calls aio_poll() outside the coroutine, similar to the
> block/block-backend.c:blk_prw() example I posted above:
> 
>  Coroutine *co = qemu_coroutine_create(co_entry, &rwco);
>  bdrv_coroutine_enter(blk_bs(blk), co);
>  BDRV_POLL_WHILE(blk_bs(blk), rwco.ret == NOT_DONE);
> 
> (BDRV_POLL_WHILE() is a aio_poll() loop.)
> 
> The coroutine isn't aware of aio_poll(), it just yields when it needs to
> wait.
> 
>>> The qemu_coroutine_enter(reply->co) call re-enters the coroutine that
>>> was created by the vCPU thread. Is this the "second level of
>>> synchronization" that you described? It's very similar to signalling
>>> reply->cv in the existing patch.
>>> 
>> 
>> 	Yes, the only difference is it would be woken on each message,
>> even though it doesn’t read them.  Which is what I think you’re addressing
>> below.
>> 
>>> Now I'm actually thinking about whether this can be improved by keeping
>>> the condvar so that the vCPU thread doesn't need to call aio_poll()
>>> (which is awkward because it doesn't drop the BQL and therefore blocks
>>> other vCPUs from making progress). That approach wouldn't require a
>>> dedicated thread for vfio-user.
>>> 
>> 
>> 	Wouldn’t you need to acquire BQL twice for every vCPU reply: once to
>> run the receiver coroutine, and once when the vCPU thread wakes up and wants
>> to return to the VFIO code.  The migration thread would also add a BQL
>> dependency, where it didn’t have one before.
> 
> If aio_poll() is used then the vCPU thread doesn't drop the BQL at all.
> The vCPU thread sends the message and waits for the reply while other
> BQL threads are locked out.
> 
> If a condvar or similar mechanism is used then the vCPU sends the
> message, drops the BQL, and waits on the condvar. The main loop thread
> runs the receiver coroutine and re-enters the coroutine, which signals
> the condvar. The vCPU then re-acquires the BQL.
> 

	I understand this.  The point I was trying to make was you'd need
to acquire BQL twice for every reply: once by the main loop before it runs
the receiver coroutine and once after the vCPU wakes up.  That would seem
to increase latency over the iothread model.


>> 	Is your objection with using an iothread, or using a separate thread?
>> I can change to using qemu_thread_create() and running aio_poll() from the
>> thread routine, instead of creating an iothread.
> 
> The vfio-user communication code shouldn't need to worry about threads
> or locks. The code can be written in terms of AioContext so the caller
> can use it from various environments without hardcoding details of the
> BQL or threads into the communication code. This makes it easier to
> understand and less tightly coupled.
> 
> I'll try to sketch how that could work:
> 
> The main concept is VFIOProxy, which has a QIOChannel (the socket
> connection) and its main API is:
> 
>  void coroutine_fn vfio_user_co_send_recv(VFIOProxy *proxy,
>          VFIOUserHdr *msg, VFIOUserFDs *fds, int rsize, int flags);
> 
> There is also a request callback for processing incoming requests:
> 
>  void coroutine_fn (*request)(void *opaque, char *buf,
>                              VFIOUserFDs *fds);
> 
> The main loop thread can either use vfio_user_co_send_recv() from
> coroutine context or use this blocking wrapper:
> 
>  typedef struct {
>      VFIOProxy *proxy;
>      VFIOUserHdr *msg;
>      VFIOUserFDs *fds;
>      int rsize;
>      int flags;
>      bool done;
>  } VFIOUserSendRecvData;
> 
>  static void coroutine_fn vfu_send_recv_co(void *opaque)
>  {
>      VFIOUserSendRecvData *data = opaque;
>      vfio_user_co_send_recv(data->proxy, data->msg, data->fds,
>                             data->rsize, data->flags);
>      data->done = true;
>  }
> 
>  /* A blocking version of vfio_user_co_send_recv() */
>  void vfio_user_send_recv(VFIOProxy *proxy, VFIOUserHdr *msg,
>                           VFIOUserFDs *fds, int rsize, int flags)
>  {
>      VFIOUserSendRecvData data = {
>          .proxy = proxy,
> 	  .msg = msg,
> 	  .fds = fds,
> 	  .rsize = rsize,
> 	  .flags = flags,
>      };
>      Coroutine *co = qemu_coroutine_create(vfu_send_recv_co, &data);
>      qemu_coroutine_enter(co);
>      while (!data.done) {
>          aio_poll(proxy->ioc->ctx, true);
>      }
>  }
> 
> The vCPU thread can use vfio_user_send_recv() if it wants, although the
> BQL will be held, preventing other threads from making progress. That
> can be avoided by writing a similar wrapper that uses a QemuSemaphore:
> 
>  typedef struct {
>      VFIOProxy *proxy;
>      VFIOUserHdr *msg;
>      VFIOUserFDs *fds;
>      int rsize;
>      int flags;
>      QemuSemaphore sem;
>  } VFIOUserSendRecvData;
> 
>  static void coroutine_fn vfu_send_recv_co(void *opaque)
>  {
>      VFIOUserSendRecvData *data = opaque;
>      vfio_user_co_send_recv(data->proxy, data->msg, data->fds,
>                             data->rsize, data->flags);
>      qemu_sem_post(&data->sem);
>  }
> 
>  /*
>   * A blocking version of vfio_user_co_send_recv() that relies on
>   * another thread to run the event loop. This can be used from vCPU
>   * threads to avoid hogging the BQL.
>   */
>  void vfio_user_vcpu_send_recv(VFIOProxy *proxy, VFIOUserHdr *msg,
>                                VFIOUserFDs *fds, int rsize, int flags)
>  {
>      VFIOUserSendRecvData data = {
>          .proxy = proxy,
> 	  .msg = msg,
> 	  .fds = fds,
> 	  .rsize = rsize,
> 	  .flags = flags,
>      };
>      Coroutine *co = qemu_coroutine_create(vfu_vcpu_send_recv_co, &data);
> 
>      qemu_sem_init(&data.sem, 0);
> 
>      qemu_coroutine_enter(co);
> 
>      qemu_mutex_unlock_iothread();
>      qemu_sem_wait(&data.sem);
>      qemu_mutex_lock_iothread();
> 
>      qemu_sem_destroy(&data.sem);
>  }
> 
> With vfio_user_vcpu_send_recv() the vCPU thread doesn't call aio_poll()
> itself but instead relies on the main loop thread to run the event loop.
> 

	I think this means I need 2 send algorithms: one for when called
from the main loop, and another for when called outside the main loop
(vCPU or migration).  I can’t use the semaphore version from the main
loop, since blocking the main loop would prevent the receiver routine
from being scheduled, so I’d want to use aio_poll() there.

	Some vfio_user calls can come from either place (e.g., realize
uses REGION_READ to read the device config space, and vCPU uses it
on a guest load to the device), so I’d need to detect which thread I’m
running in to choose the right sender.


> By writing coroutines that run in proxy->ioc->ctx we keep the threading
> model and locking in the caller. The communication code isn't aware of
> or tied to specific threads. It's possible to drop proxy->lock because
> state is only changed from within the AioContext, not multiple threads
> that may run in parallel. I think this makes the communication code
> simpler and cleaner.
> 
> It's possible to use IOThreads with this approach: set the QIOChannel's
> AioContext to the IOThread AioContext. However, I don't think we can do
> this in the vhost-user server yet because QEMU's device models expect to
> run with the BQL and not in an IOThread.
> 
> I didn't go into detail about how vfio_user_co_send_recv() is
> implemented. Please let me know if you want me to share ideas about
> that, but it's what we've already discussed with a "receiver" coroutine
> that re-enters the reply coroutines or calls ->request(). A CoMutex is
> needed to around qio_channel_write_all() to ensure that coroutines
> sending messages don't interleave partial writes if the socket sndbuf is
> exhausted.
> 

	Here is where I questioned how coroutines work across threads.
When the reply waiter is not the main loop, would the receiver coroutine
re-enter the reply coroutine or signal the condvar it is waiting on?


>> 	On a related subject:
>> 
>> On Aug 24, 2021, at 8:14 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>> 
>>>> +    ret = qio_channel_readv_full(proxy->ioc, &iov, 1, &fdp, &numfds,
>>>> +                                 &local_err);
>>> 
>>> This is a blocking call. My understanding is that the IOThread is shared
>>> by all vfio-user devices, so other devices will have to wait if one of
>>> them is acting up (e.g. the device emulation process sent less than
>>> sizeof(msg) bytes).
>> 
>> 
>> 	This shouldn’t block if the emulation process sends less than sizeof(msg)
>> bytes.  qio_channel_readv() will eventually call recvmsg(), which only blocks a
>> short read if MSG_WAITALL is set, and it’s not set in this case.  recvmsg() will
>> return the data available, and vfio_user_recv() will treat a short read as an error.
> 
> That's true but vfio_user_recv() can still block layer on: if only
> sizeof(msg) bytes are available and msg.size > sizeof(msg) then the
> second call blocks.
> 
>  msgleft = msg.size - sizeof(msg);
>  if (msgleft != 0) {
>      ret = qio_channel_read(proxy->ioc, data, msgleft, &local_err);
> 
> I think either code should be non-blocking or it shouldn't be. Writing
> code that is partially non-blocking is asking for trouble because it's
> not obvious where it can block and misbehaving or malicious programs can
> cause it to block.
> 

	I wonder if I should just go fully non-blocking, and have the
senders queue messages for the sending routine, and have the receiving
routine either signal a reply waiter or schedule a request handling
routine.

								JJ


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC v2 04/16] vfio-user: connect vfio proxy to remote server
  2021-09-15 19:14                       ` John Johnson
@ 2021-09-16 11:49                         ` Stefan Hajnoczi
  0 siblings, 0 replies; 108+ messages in thread
From: Stefan Hajnoczi @ 2021-09-16 11:49 UTC (permalink / raw)
  To: John Johnson
  Cc: Elena Ufimtseva, Jag Raman, Swapnil Ingle, John Levon,
	QEMU Devel Mailing List, Alex Williamson, thanos.makatos

[-- Attachment #1: Type: text/plain, Size: 16225 bytes --]

On Wed, Sep 15, 2021 at 07:14:30PM +0000, John Johnson wrote:
> 
> 
> > On Sep 15, 2021, at 6:04 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > 
> > On Wed, Sep 15, 2021 at 12:21:10AM +0000, John Johnson wrote:
> >> 
> >> 
> >>> On Sep 14, 2021, at 6:06 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>> 
> >>> On Mon, Sep 13, 2021 at 05:23:33PM +0000, John Johnson wrote:
> >>>>>> On Sep 9, 2021, at 10:25 PM, John Johnson <john.g.johnson@oracle.com> wrote:
> >>>>>>> On Sep 8, 2021, at 11:29 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>>>>>> On Thu, Sep 09, 2021 at 05:11:49AM +0000, John Johnson wrote:
> >>>>>>>> 	I did look at coroutines, but they seemed to work when the sender
> >>>>>>>> is triggering the coroutine on send, not when request packets are arriving
> >>>>>>>> asynchronously to the sends.
> >>>>>>> 
> >>>>>>> This can be done with a receiver coroutine. Its job is to be the only
> >>>>>>> thing that reads vfio-user messages from the socket. A receiver
> >>>>>>> coroutine reads messages from the socket and wakes up the waiting
> >>>>>>> coroutine that yielded from vfio_user_send_recv() or
> >>>>>>> vfio_user_pci_process_req().
> >>>>>>> 
> >>>>>>> (Although vfio_user_pci_process_req() could be called directly from the
> >>>>>>> receiver coroutine, it seems safer to have a separate coroutine that
> >>>>>>> processes requests so that the receiver isn't blocked in case
> >>>>>>> vfio_user_pci_process_req() yields while processing a request.)
> >>>>>>> 
> >>>>>>> Going back to what you mentioned above, the receiver coroutine does
> >>>>>>> something like this:
> >>>>>>> 
> >>>>>>> if it's a reply
> >>>>>>>   reply = find_reply(...)
> >>>>>>>   qemu_coroutine_enter(reply->co) // instead of signalling reply->cv
> >>>>>>> else
> >>>>>>>   QSIMPLEQ_INSERT_TAIL(&pending_reqs, request, next);
> >>>>>>>   if (pending_reqs_was_empty) {
> >>>>>>>       qemu_coroutine_enter(process_request_co);
> >>>>>>>   }
> >>>>>>> 
> >>>>>>> The pending_reqs queue holds incoming requests that the
> >>>>>>> process_request_co coroutine processes.
> >>>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> 	How do coroutines work across threads?  There can be multiple vCPU
> >>>>>> threads waiting for replies, and I think the receiver coroutine will be
> >>>>>> running in the main loop thread.  Where would a vCPU block waiting for
> >>>>>> a reply?  I think coroutine_yield() returns to its coroutine_enter() caller
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> A vCPU thread holding the BQL can iterate the event loop if it has
> >>>>> reached a synchronous point that needs to wait for a reply before
> >>>>> returning. I think we have this situation when a MemoryRegion is
> >>>>> accessed on the proxy device.
> >>>>> 
> >>>>> For example, block/block-backend.c:blk_prw() kicks off a coroutine and
> >>>>> then runs the event loop until the coroutine finishes:
> >>>>> 
> >>>>> Coroutine *co = qemu_coroutine_create(co_entry, &rwco);
> >>>>> bdrv_coroutine_enter(blk_bs(blk), co);
> >>>>> BDRV_POLL_WHILE(blk_bs(blk), rwco.ret == NOT_DONE);
> >>>>> 
> >>>>> BDRV_POLL_WHILE() boils down to a loop like this:
> >>>>> 
> >>>>> while ((cond)) {
> >>>>>   aio_poll(ctx, true);
> >>>>> }
> >>>>> 
> >>>> 
> >>>> 	I think that would make vCPUs sending requests and the
> >>>> receiver coroutine all poll on the same socket.  If the “wrong”
> >>>> routine reads the message, I’d need a second level of synchronization
> >>>> to pass the message to the “right” one.  e.g., if the vCPU coroutine
> >>>> reads a request, it needs to pass it to the receiver; if the receiver
> >>>> coroutine reads a reply, it needs to pass it to a vCPU.
> >>>> 
> >>>> 	Avoiding this complexity is one of the reasons I went with
> >>>> a separate thread that only reads the socket over the mp-qemu model,
> >>>> which does have the sender poll, but doesn’t need to handle incoming
> >>>> requests.
> >>> 
> >>> Only one coroutine reads from the socket, the "receiver" coroutine. In a
> >>> previous reply I sketched what the receiver does:
> >>> 
> >>> if it's a reply
> >>>     reply = find_reply(...)
> >>>     qemu_coroutine_enter(reply->co) // instead of signalling reply->cv
> >>> else
> >>>     QSIMPLEQ_INSERT_TAIL(&pending_reqs, request, next);
> >>>     if (pending_reqs_was_empty) {
> >>>         qemu_coroutine_enter(process_request_co);
> >>>     }
> >>> 
> >> 
> >> 	Sorry, I was assuming when you said the coroutine will block with
> >> aio_poll(), you implied it would also read messages from the socket.
> > 
> > The vCPU thread calls aio_poll() outside the coroutine, similar to the
> > block/block-backend.c:blk_prw() example I posted above:
> > 
> >  Coroutine *co = qemu_coroutine_create(co_entry, &rwco);
> >  bdrv_coroutine_enter(blk_bs(blk), co);
> >  BDRV_POLL_WHILE(blk_bs(blk), rwco.ret == NOT_DONE);
> > 
> > (BDRV_POLL_WHILE() is a aio_poll() loop.)
> > 
> > The coroutine isn't aware of aio_poll(), it just yields when it needs to
> > wait.
> > 
> >>> The qemu_coroutine_enter(reply->co) call re-enters the coroutine that
> >>> was created by the vCPU thread. Is this the "second level of
> >>> synchronization" that you described? It's very similar to signalling
> >>> reply->cv in the existing patch.
> >>> 
> >> 
> >> 	Yes, the only difference is it would be woken on each message,
> >> even though it doesn’t read them.  Which is what I think you’re addressing
> >> below.
> >> 
> >>> Now I'm actually thinking about whether this can be improved by keeping
> >>> the condvar so that the vCPU thread doesn't need to call aio_poll()
> >>> (which is awkward because it doesn't drop the BQL and therefore blocks
> >>> other vCPUs from making progress). That approach wouldn't require a
> >>> dedicated thread for vfio-user.
> >>> 
> >> 
> >> 	Wouldn’t you need to acquire BQL twice for every vCPU reply: once to
> >> run the receiver coroutine, and once when the vCPU thread wakes up and wants
> >> to return to the VFIO code.  The migration thread would also add a BQL
> >> dependency, where it didn’t have one before.
> > 
> > If aio_poll() is used then the vCPU thread doesn't drop the BQL at all.
> > The vCPU thread sends the message and waits for the reply while other
> > BQL threads are locked out.
> > 
> > If a condvar or similar mechanism is used then the vCPU sends the
> > message, drops the BQL, and waits on the condvar. The main loop thread
> > runs the receiver coroutine and re-enters the coroutine, which signals
> > the condvar. The vCPU then re-acquires the BQL.
> > 
> 
> 	I understand this.  The point I was trying to make was you'd need
> to acquire BQL twice for every reply: once by the main loop before it runs
> the receiver coroutine and once after the vCPU wakes up.  That would seem
> to increase latency over the iothread model.

Yes, but if minimizing latency is critical then you can use the
aio_poll() approach. It's fastest since it doesn't context switch or
drop the BQL.

Regarding vfio-user performance in general, devices should use ioeventfd
and/or mmap regions to avoid going through the VMM in
performance-critical code paths.

> >> 	Is your objection with using an iothread, or using a separate thread?
> >> I can change to using qemu_thread_create() and running aio_poll() from the
> >> thread routine, instead of creating an iothread.
> > 
> > The vfio-user communication code shouldn't need to worry about threads
> > or locks. The code can be written in terms of AioContext so the caller
> > can use it from various environments without hardcoding details of the
> > BQL or threads into the communication code. This makes it easier to
> > understand and less tightly coupled.
> > 
> > I'll try to sketch how that could work:
> > 
> > The main concept is VFIOProxy, which has a QIOChannel (the socket
> > connection) and its main API is:
> > 
> >  void coroutine_fn vfio_user_co_send_recv(VFIOProxy *proxy,
> >          VFIOUserHdr *msg, VFIOUserFDs *fds, int rsize, int flags);
> > 
> > There is also a request callback for processing incoming requests:
> > 
> >  void coroutine_fn (*request)(void *opaque, char *buf,
> >                              VFIOUserFDs *fds);
> > 
> > The main loop thread can either use vfio_user_co_send_recv() from
> > coroutine context or use this blocking wrapper:
> > 
> >  typedef struct {
> >      VFIOProxy *proxy;
> >      VFIOUserHdr *msg;
> >      VFIOUserFDs *fds;
> >      int rsize;
> >      int flags;
> >      bool done;
> >  } VFIOUserSendRecvData;
> > 
> >  static void coroutine_fn vfu_send_recv_co(void *opaque)
> >  {
> >      VFIOUserSendRecvData *data = opaque;
> >      vfio_user_co_send_recv(data->proxy, data->msg, data->fds,
> >                             data->rsize, data->flags);
> >      data->done = true;
> >  }
> > 
> >  /* A blocking version of vfio_user_co_send_recv() */
> >  void vfio_user_send_recv(VFIOProxy *proxy, VFIOUserHdr *msg,
> >                           VFIOUserFDs *fds, int rsize, int flags)
> >  {
> >      VFIOUserSendRecvData data = {
> >          .proxy = proxy,
> > 	  .msg = msg,
> > 	  .fds = fds,
> > 	  .rsize = rsize,
> > 	  .flags = flags,
> >      };
> >      Coroutine *co = qemu_coroutine_create(vfu_send_recv_co, &data);
> >      qemu_coroutine_enter(co);
> >      while (!data.done) {
> >          aio_poll(proxy->ioc->ctx, true);
> >      }
> >  }
> > 
> > The vCPU thread can use vfio_user_send_recv() if it wants, although the
> > BQL will be held, preventing other threads from making progress. That
> > can be avoided by writing a similar wrapper that uses a QemuSemaphore:
> > 
> >  typedef struct {
> >      VFIOProxy *proxy;
> >      VFIOUserHdr *msg;
> >      VFIOUserFDs *fds;
> >      int rsize;
> >      int flags;
> >      QemuSemaphore sem;
> >  } VFIOUserSendRecvData;
> > 
> >  static void coroutine_fn vfu_send_recv_co(void *opaque)
> >  {
> >      VFIOUserSendRecvData *data = opaque;
> >      vfio_user_co_send_recv(data->proxy, data->msg, data->fds,
> >                             data->rsize, data->flags);
> >      qemu_sem_post(&data->sem);
> >  }
> > 
> >  /*
> >   * A blocking version of vfio_user_co_send_recv() that relies on
> >   * another thread to run the event loop. This can be used from vCPU
> >   * threads to avoid hogging the BQL.
> >   */
> >  void vfio_user_vcpu_send_recv(VFIOProxy *proxy, VFIOUserHdr *msg,
> >                                VFIOUserFDs *fds, int rsize, int flags)
> >  {
> >      VFIOUserSendRecvData data = {
> >          .proxy = proxy,
> > 	  .msg = msg,
> > 	  .fds = fds,
> > 	  .rsize = rsize,
> > 	  .flags = flags,
> >      };
> >      Coroutine *co = qemu_coroutine_create(vfu_vcpu_send_recv_co, &data);
> > 
> >      qemu_sem_init(&data.sem, 0);
> > 
> >      qemu_coroutine_enter(co);
> > 
> >      qemu_mutex_unlock_iothread();
> >      qemu_sem_wait(&data.sem);
> >      qemu_mutex_lock_iothread();
> > 
> >      qemu_sem_destroy(&data.sem);
> >  }
> > 
> > With vfio_user_vcpu_send_recv() the vCPU thread doesn't call aio_poll()
> > itself but instead relies on the main loop thread to run the event loop.
> > 
> 
> 	I think this means I need 2 send algorithms: one for when called
> from the main loop, and another for when called outside the main loop
> (vCPU or migration).  I can’t use the semaphore version from the main
> loop, since blocking the main loop would prevent the receiver routine
> from being scheduled, so I’d want to use aio_poll() there.
>
> 	Some vfio_user calls can come from either place (e.g., realize
> uses REGION_READ to read the device config space, and vCPU uses it
> on a guest load to the device), so I’d need to detect which thread I’m
> running in to choose the right sender.

The semaphore version is not really necessary, although it allows other
BQL threads to make progress while we wait for a reply (but is slower,
as you pointed out).

The aio_poll() approach can be used from either thread.

> > By writing coroutines that run in proxy->ioc->ctx we keep the threading
> > model and locking in the caller. The communication code isn't aware of
> > or tied to specific threads. It's possible to drop proxy->lock because
> > state is only changed from within the AioContext, not multiple threads
> > that may run in parallel. I think this makes the communication code
> > simpler and cleaner.
> > 
> > It's possible to use IOThreads with this approach: set the QIOChannel's
> > AioContext to the IOThread AioContext. However, I don't think we can do
> > this in the vhost-user server yet because QEMU's device models expect to
> > run with the BQL and not in an IOThread.
> > 
> > I didn't go into detail about how vfio_user_co_send_recv() is
> > implemented. Please let me know if you want me to share ideas about
> > that, but it's what we've already discussed with a "receiver" coroutine
> > that re-enters the reply coroutines or calls ->request(). A CoMutex is
> > needed to around qio_channel_write_all() to ensure that coroutines
> > sending messages don't interleave partial writes if the socket sndbuf is
> > exhausted.
> > 
> 
> 	Here is where I questioned how coroutines work across threads.
> When the reply waiter is not the main loop, would the receiver coroutine
> re-enter the reply coroutine or signal the condvar it is waiting on?

The global AioContext (qemu_get_aio_context()) is shared by multiple
thread and mutual exclusion is provided by the BQL. Any of the threads
can run the coroutines and during a coroutine's lifetime it can run in
multiple threads (but never simultaneously in multiple threads).

The vfu_send_recv_co() coroutine above starts in the vCPU thread but
then yields and is re-entered from the main loop thread. The receiver
coroutine re-enters vfu_send_recv_co() in the main loop thread where it
calls qemu_sem_post() to wake up the vCPU thread.

(IOThread AioContexts are not shared by multiple threads, so in that
case we don't need to worry about threads since everything is done in
the IOThread.)

> >> 	On a related subject:
> >> 
> >> On Aug 24, 2021, at 8:14 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >> 
> >>>> +    ret = qio_channel_readv_full(proxy->ioc, &iov, 1, &fdp, &numfds,
> >>>> +                                 &local_err);
> >>> 
> >>> This is a blocking call. My understanding is that the IOThread is shared
> >>> by all vfio-user devices, so other devices will have to wait if one of
> >>> them is acting up (e.g. the device emulation process sent less than
> >>> sizeof(msg) bytes).
> >> 
> >> 
> >> 	This shouldn’t block if the emulation process sends less than sizeof(msg)
> >> bytes.  qio_channel_readv() will eventually call recvmsg(), which only blocks a
> >> short read if MSG_WAITALL is set, and it’s not set in this case.  recvmsg() will
> >> return the data available, and vfio_user_recv() will treat a short read as an error.
> > 
> > That's true but vfio_user_recv() can still block layer on: if only
> > sizeof(msg) bytes are available and msg.size > sizeof(msg) then the
> > second call blocks.
> > 
> >  msgleft = msg.size - sizeof(msg);
> >  if (msgleft != 0) {
> >      ret = qio_channel_read(proxy->ioc, data, msgleft, &local_err);
> > 
> > I think either code should be non-blocking or it shouldn't be. Writing
> > code that is partially non-blocking is asking for trouble because it's
> > not obvious where it can block and misbehaving or malicious programs can
> > cause it to block.
> > 
> 
> 	I wonder if I should just go fully non-blocking, and have the
> senders queue messages for the sending routine, and have the receiving
> routine either signal a reply waiter or schedule a request handling
> routine.

That sounds good.

If messages are sent from coroutine context rather than plain functions
then a separate sender isn't needed, they can use a CoMutex for mutual
exclusion/queuing instead of an explicit send queue and sender routine.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

end of thread, other threads:[~2021-09-16 11:50 UTC | newest]

Thread overview: 108+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-16 16:42 [PATCH RFC v2 00/16] vfio-user implementation Elena Ufimtseva
2021-08-16 16:42 ` [PATCH RFC v2 01/16] vfio-user: introduce vfio-user protocol specification Elena Ufimtseva
2021-08-17 23:04   ` Alex Williamson
2021-08-19  9:28     ` Swapnil Ingle
2021-08-19 15:32     ` John Johnson
2021-08-19 16:26       ` Alex Williamson
2021-08-16 16:42 ` [PATCH RFC v2 02/16] vfio-user: add VFIO base abstract class Elena Ufimtseva
2021-08-16 16:42 ` [PATCH RFC v2 03/16] vfio-user: Define type vfio_user_pci_dev_info Elena Ufimtseva
2021-08-24 13:52   ` Stefan Hajnoczi
2021-08-16 16:42 ` [PATCH RFC v2 04/16] vfio-user: connect vfio proxy to remote server Elena Ufimtseva
2021-08-18 18:47   ` Alex Williamson
2021-08-19 14:10     ` John Johnson
2021-08-24 14:15   ` Stefan Hajnoczi
2021-08-30  3:00     ` John Johnson
2021-09-07 13:21       ` Stefan Hajnoczi
2021-09-09  5:11         ` John Johnson
2021-09-09  6:29           ` Stefan Hajnoczi
2021-09-10  5:25             ` John Johnson
2021-09-13 12:35               ` Stefan Hajnoczi
2021-09-13 17:23               ` John Johnson
2021-09-14 13:06                 ` Stefan Hajnoczi
2021-09-15  0:21                   ` John Johnson
2021-09-15 13:04                     ` Stefan Hajnoczi
2021-09-15 19:14                       ` John Johnson
2021-09-16 11:49                         ` Stefan Hajnoczi
2021-08-16 16:42 ` [PATCH RFC v2 05/16] vfio-user: define VFIO Proxy and communication functions Elena Ufimtseva
2021-08-24 15:14   ` Stefan Hajnoczi
2021-08-30  3:04     ` John Johnson
2021-09-07 13:35       ` Stefan Hajnoczi
2021-08-16 16:42 ` [PATCH RFC v2 06/16] vfio-user: negotiate version with remote server Elena Ufimtseva
2021-08-24 15:59   ` Stefan Hajnoczi
2021-08-30  3:08     ` John Johnson
2021-09-07 13:52       ` Stefan Hajnoczi
2021-08-16 16:42 ` [PATCH RFC v2 07/16] vfio-user: get device info Elena Ufimtseva
2021-08-24 16:04   ` Stefan Hajnoczi
2021-08-30  3:11     ` John Johnson
2021-09-07 13:54       ` Stefan Hajnoczi
2021-08-16 16:42 ` [PATCH RFC v2 08/16] vfio-user: get region info Elena Ufimtseva
2021-09-07 14:31   ` Stefan Hajnoczi
2021-09-09  5:35     ` John Johnson
2021-09-09  5:59       ` Stefan Hajnoczi
2021-08-16 16:42 ` [PATCH RFC v2 09/16] vfio-user: region read/write Elena Ufimtseva
2021-09-07 14:41   ` Stefan Hajnoczi
2021-09-07 17:24   ` John Levon
2021-09-09  6:00     ` John Johnson
2021-09-09 12:05       ` John Levon
2021-09-10  6:07         ` John Johnson
2021-09-10 12:16           ` John Levon
2021-08-16 16:42 ` [PATCH RFC v2 10/16] vfio-user: pci_user_realize PCI setup Elena Ufimtseva
2021-09-07 15:00   ` Stefan Hajnoczi
2021-08-16 16:42 ` [PATCH RFC v2 11/16] vfio-user: get and set IRQs Elena Ufimtseva
2021-09-07 15:14   ` Stefan Hajnoczi
2021-09-09  5:50     ` John Johnson
2021-09-09 13:50       ` Stefan Hajnoczi
2021-08-16 16:42 ` [PATCH RFC v2 12/16] vfio-user: proxy container connect/disconnect Elena Ufimtseva
2021-09-08  8:30   ` Stefan Hajnoczi
2021-08-16 16:42 ` [PATCH RFC v2 13/16] vfio-user: dma map/unmap operations Elena Ufimtseva
2021-09-08  9:16   ` Stefan Hajnoczi
2021-08-16 16:42 ` [PATCH RFC v2 14/16] vfio-user: dma read/write operations Elena Ufimtseva
2021-09-08  9:51   ` Stefan Hajnoczi
2021-09-08 11:03     ` John Levon
2021-08-16 16:42 ` [PATCH RFC v2 15/16] vfio-user: pci reset Elena Ufimtseva
2021-09-08  9:56   ` Stefan Hajnoczi
2021-08-16 16:42 ` [PATCH RFC v2 16/16] vfio-user: migration support Elena Ufimtseva
2021-09-08 10:04   ` Stefan Hajnoczi
2021-08-27 17:53 ` [PATCH RFC server v2 00/11] vfio-user server in QEMU Jagannathan Raman
2021-08-27 17:53   ` [PATCH RFC server v2 01/11] vfio-user: build library Jagannathan Raman
2021-08-27 18:05     ` Jag Raman
2021-09-08 12:25     ` Stefan Hajnoczi
2021-09-10 15:21       ` Philippe Mathieu-Daudé
2021-09-13 12:15         ` Stefan Hajnoczi
2021-09-10 15:20     ` Philippe Mathieu-Daudé
2021-09-10 17:08       ` Jag Raman
2021-09-11 22:29       ` John Levon
2021-09-13 10:19         ` Philippe Mathieu-Daudé
2021-08-27 17:53   ` [PATCH RFC server v2 02/11] vfio-user: define vfio-user object Jagannathan Raman
2021-09-08 12:37     ` Stefan Hajnoczi
2021-09-10 14:04       ` Jag Raman
2021-08-27 17:53   ` [PATCH RFC server v2 03/11] vfio-user: instantiate vfio-user context Jagannathan Raman
2021-09-08 12:40     ` Stefan Hajnoczi
2021-09-10 14:58       ` Jag Raman
2021-08-27 17:53   ` [PATCH RFC server v2 04/11] vfio-user: find and init PCI device Jagannathan Raman
2021-09-08 12:43     ` Stefan Hajnoczi
2021-09-10 15:02       ` Jag Raman
2021-08-27 17:53   ` [PATCH RFC server v2 05/11] vfio-user: run vfio-user context Jagannathan Raman
2021-09-08 12:58     ` Stefan Hajnoczi
2021-09-08 13:37       ` John Levon
2021-09-08 15:02         ` Stefan Hajnoczi
2021-09-08 15:21           ` John Levon
2021-09-08 15:46             ` Stefan Hajnoczi
2021-08-27 17:53   ` [PATCH RFC server v2 06/11] vfio-user: handle PCI config space accesses Jagannathan Raman
2021-09-09  7:27     ` Stefan Hajnoczi
2021-09-10 16:22       ` Jag Raman
2021-09-13 12:13         ` Stefan Hajnoczi
2021-08-27 17:53   ` [PATCH RFC server v2 07/11] vfio-user: handle DMA mappings Jagannathan Raman
2021-09-09  7:29     ` Stefan Hajnoczi
2021-08-27 17:53   ` [PATCH RFC server v2 08/11] vfio-user: handle PCI BAR accesses Jagannathan Raman
2021-09-09  7:37     ` Stefan Hajnoczi
2021-09-10 16:36       ` Jag Raman
2021-08-27 17:53   ` [PATCH RFC server v2 09/11] vfio-user: handle device interrupts Jagannathan Raman
2021-09-09  7:40     ` Stefan Hajnoczi
2021-08-27 17:53   ` [PATCH RFC server v2 10/11] vfio-user: register handlers to facilitate migration Jagannathan Raman
2021-09-09  8:14     ` Stefan Hajnoczi
2021-08-27 17:53   ` [PATCH RFC server v2 11/11] vfio-user: acceptance test Jagannathan Raman
2021-09-08 10:08   ` [PATCH RFC server v2 00/11] vfio-user server in QEMU Stefan Hajnoczi
2021-09-08 12:06     ` Jag Raman
2021-09-09  8:17   ` Stefan Hajnoczi
2021-09-10 14:02     ` Jag Raman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.