KVM Archive on lore.kernel.org
 help / color / Atom feed
From: Stefan Hajnoczi <stefanha@redhat.com>
To: John Levon <john.levon@nutanix.com>
Cc: "libvfio-user-devel@nongnu.org" <libvfio-user-devel@nongnu.org>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	Cornelia Huck <cohuck@redhat.com>,
	"john.g.johnson@oracle.com" <john.g.johnson@oracle.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>
Subject: Re: [DRAFT RFC] ioeventfd/ioregionfd support in vfio-user
Date: Tue, 26 Jan 2021 11:01:04 +0000
Message-ID: <20210126110104.GA250425@stefanha-x1.localdomain> (raw)
In-Reply-To: <20210121160905.GA314820@sent>


[-- Attachment #1: Type: text/plain, Size: 19982 bytes --]

On Thu, Jan 21, 2021 at 04:09:09PM +0000, John Levon wrote:
> 
> Hi, please take a look. For context, this addition is against the current
> https://github.com/nutanix/libvfio-user/blob/master/docs/vfio-user.rst
> specification.
> 
> kvm@ readers: Stefan suggested this may be of interest from a VFIO point of
> view, in case there is any potential cross-over in defining how to wire up these
> fds.

After talking to Alex Williamson about including these APIs in the
kernel VFIO ioctls it became clear to me that there is an architectural
difference between classic vfio-pci devices and vfio-user devices:

The current model with kernel VFIO is that the VMM sets up ioeventfds
using ioctl(VFIO_DEVICE_IOEVENTFD). The VMM has device-specific
knowledge that allows it to add ioeventfds for specific hardware
registers.

In vfio-user the VMM has no knowledge of the specific device. Instead
the device tells the VMM how to configure ioeventfds and ioregionfds.
There are a few reasons for this:

1. It is not practical to add knowledge of every vfio-user device
   implementation into the VMM. A cleaner solution is for the device to
   advertise its desired configuration to the VMM instead.

2. The device implementation may prefer a specific ioeventfd or
   ioregionfd configuration to fit its multi-threading architecture. A
   multi-threaded (multi-queue) device implementation may want
   ioeventfds and ioregionfds to be configured so that the right thread
   receives certain MMIO/PIO accesses.

3. The device implementation may want to specify ioregionfd's user_data
   value. This is the application-specific value that is passed along
   with an MMIO/PIO access.

I think mdev fits in somewhere between these two models. I can imagine
complex mdev devices wanting to control the ioeventfd and ioregionfd
configuration just like vfio-user. But in simple cases it doesn't
matter.

I suggest including John's ioeventfd/ioregionfd work into the kernel
VFIO interface so mdev devices can take advantage of it too.

> Note that this is a new message instead of adding a new region capability to
> VFIO_USER_DEVICE_GET_REGION_INFO: with a new capability, there's no way for the
> server to know if ioeventfd/ioregionfd is actually used/requested by the client
> (the client would just have to close those fds if it didn't want to use FDs). An
> explicit new call is much clearer for this.
> 
> The ioregionfd feature itself is at proposal stage, so there's a good chance of
> further changes depending on that.
> 
> I also have these pending issues so far:
> 
> 1) I'm not familiar with CCW bus, so don't know if
> KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY flag makes sense or is supportable in
> vfio-user context
> 
> 2) if ioregionfd subsumes all ioeventfd use cases, we can remove the distinction
> from this API, and the client caller can translate to ioregionfd or ioeventfd as
> needed
> 
> 3) is it neccessary for the client to indicate support (e.g. lacking ioregionfd
> support) ?
> 
> 4) (ioregionfd issue) region_id is 4 bytes, which seems a little awkward from
> the server side: this has to encode both the region ID as well as the offset of
> the sub-region within that region. Can this be 64 bits wide?

The user_data field in Elena's ioregionfd patches is 64 bits. Does this
solve the problem?

> 
> thanks
> john
> 
> (NB: I edited the diff so the new sections are more readable.)
> 
> diff --git a/docs/vfio-user.rst b/docs/vfio-user.rst
> index e3adc7a..e7274a2 100644
> --- a/docs/vfio-user.rst
> +++ b/docs/vfio-user.rst
> @@ -161,6 +161,17 @@ in the region info reply of a device-specific region. Such regions are reflected
>  in ``struct vfio_device_info.num_regions``. Thus, for PCI devices this value can
>  be equal to, or higher than, VFIO_PCI_NUM_REGIONS.
>  
> +Region I/O via file descriptors
> +-------------------------------
> +
> +For unmapped regions, region I/O from the client is done via
> +VFIO_USER_REGION_READ/WRITE.  As an optimization, ioeventfds or ioregionfds may
> +be configured for sub-regions of some regions. A client may request information
> +on these sub-regions via VFIO_USER_DEVICE_GET_REGION_IO_FDS; by configuring the
> +returned file descriptors as ioeventfds or ioregionfds, the server can be
> +directly notified of I/O (for example, by KVM) without taking a trip through the
> +client.
> +
>  Interrupts
>  ^^^^^^^^^^
>  The client uses VFIO_USER_DEVICE_GET_IRQ_INFO messages to query the server for
> @@ -293,37 +304,39 @@ Commands
>  The following table lists the VFIO message command IDs, and whether the
>  message command is sent from the client or the server.
>  
> -+----------------------------------+---------+-------------------+
> -| Name                             | Command | Request Direction |
> -+==================================+=========+===================+
> -| VFIO_USER_VERSION                | 1       | client -> server  |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_DMA_MAP                | 2       | client -> server  |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_DMA_UNMAP              | 3       | client -> server  |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_DEVICE_GET_INFO        | 4       | client -> server  |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_DEVICE_GET_REGION_INFO | 5       | client -> server  |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_DEVICE_GET_IRQ_INFO    | 6       | client -> server  |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_DEVICE_SET_IRQS        | 7       | client -> server  |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_REGION_READ            | 8       | client -> server  |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_REGION_WRITE           | 9       | client -> server  |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_DMA_READ               | 10      | server -> client  |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_DMA_WRITE              | 11      | server -> client  |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_VM_INTERRUPT           | 12      | server -> client  |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_DEVICE_RESET           | 13      | client -> server  |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_DIRTY_PAGES            | 14      | client -> server  |
> -+----------------------------------+---------+-------------------+
> ++------------------------------------+---------+-------------------+
> +| Name                               | Command | Request Direction |
> ++====================================+=========+===================+
> +| VFIO_USER_VERSION                  | 1       | client -> server  |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_DMA_MAP                  | 2       | client -> server  |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_DMA_UNMAP                | 3       | client -> server  |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_DEVICE_GET_INFO          | 4       | client -> server  |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_DEVICE_GET_REGION_INFO   | 5       | client -> server  |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_DEVICE_GET_REGION_IO_FDS | 6       | client -> server  |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_DEVICE_GET_IRQ_INFO      | 7       | client -> server  |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_DEVICE_SET_IRQS          | 8       | client -> server  |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_REGION_READ              | 9       | client -> server  |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_REGION_WRITE             | 10      | client -> server  |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_DMA_READ                 | 11      | server -> client  |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_DMA_WRITE                | 12      | server -> client  |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_VM_INTERRUPT             | 13      | server -> client  |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_DEVICE_RESET             | 14      | client -> server  |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_DIRTY_PAGES              | 15      | client -> server  |
> ++------------------------------------+---------+-------------------+
>  
>  
>  .. Note:: Some VFIO defines cannot be reused since their values are
> @@ -1130,6 +1143,161 @@ client must write data on the same order and transction size as read.
>  If an error occurs then the server must fail the read or write operation. It is
>  an implementation detail of the client how to handle errors.
>  
> VFIO_USER_DEVICE_GET_REGION_IO_FDS
> ----------------------------------
> 
> Message format
> ^^^^^^^^^^^^^^
> 
> +--------------+------------------------+
> | Name         | Value                  |
> +==============+========================+
> | Message ID   | <ID>                   |
> +--------------+------------------------+
> | Command      | 6                      |
> +--------------+------------------------+
> | Message size | 32 + subregion info    |
> +--------------+------------------------+
> | Flags        | Reply bit set in reply |
> +--------------+------------------------+
> | Error        | 0/errno                |
> +--------------+------------------------+
> | Region info  | Region IO FD info      |
> +--------------+------------------------+
> 
> Clients can access regions via VFIO_USER_REGION_READ/WRITE or, if provided, by
> mmap()ing a file descriptor provided by the server.
> 
> VFIO_USER_DEVICE_GET_REGION_IO_FDS provides an alternative access mechanism via
> file descriptors. This is an optional feature intended for performance
> improvements where an underlying sub-system (such as KVM) supports communication
> across such file descriptors to the vfio-user server, without needing to
> round-trip through the client.
> 
> The server returns an array describing sub-regions of the given region along
> with the server specifies a set of sub-regions and the requested file descriptor
> notification mechanism to use for that sub-region.  Each sub-region in the
> response message may choose to use a different method, as defined below.  The
> two mechanisms supported in this specification are ioeventfds and ioregionfds.
> 
> A client should then hook up the returned file descriptors with the notification
> method requested.
> 
> Region IO FD info format
> ^^^^^^^^^^^^^^^^^^^^^^^^
> 
> +------------+--------+------+
> | Name       | Offset | Size |
> +============+========+======+
> | argsz      | 16     | 4    |
> +------------+--------+------+
> | flags      | 20     | 4    |
> +------------+--------+------+
> | index      | 24     | 4    |
> +------------+--------+------+
> | count      | 28     | 4    |
> +------------+--------+------+
> | subregions | 32     | ...  |
> +------------+--------+------+
> 
> * *argsz* is the size of the region IO FD info structure plus the
>   total size of the subregion array. Thus, each array entry "i" is at offset
>     i * ((argsz - 32) / count)
> * *flags* must be zero
> * *index* is the index of memory region being queried
> * *count* is the number of sub-regions in the array
> * *subregions* is the array of Sub-Region IO FD info structures
> 
> The client must set ``flags`` to zero and specify the region being queried in
> the ``index``.
> 
> The client sets the ``argsz`` field to indicate the maximum size of the response
> that the server can send, which must be at least the size of the response header
> plus space for the subregion array. If the full response size exceeds ``argsz``,
> then the server must respond only with the response header and the Region IO FD
> info structure, setting in ``argsz`` the buffer size required to store the full
> response. In this case, no file descriptors are passed back.  The client then
> retries the operation with a larger receive buffer.
> 
> The reply message will additionally include at least one file descriptor in the
> ancillary data. Note that more than one subregion may share the same file
> descriptor.
> 
> Each sub-region given in the response has one of two possible structures,
> depending whether *type* is `VFIO_USER_IO_FD_TYPE_IOEVENTFD` or
> `VFIO_USER_IO_FD_TYPE_IOREGIONFD`:
> 
> Sub-Region IO FD info format (ioeventfd)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> +-----------+--------+------+
> | Name      | Offset | Size |
> +===========+========+======+
> | offset    | 0      | 8    |
> +-----------+--------+------+
> | size      | 8      | 8    |
> +-----------+--------+------+
> | fd_index  | 16     | 4    |
> +-----------+--------+------+
> | type      | 20     | 4    |
> +-----------+--------+------+
> | flags     | 24     | 4    |
> +-----------+--------+------+
> | padding   | 28     | 4    |
> +-----------+--------+------+
> | datamatch | 32     | 8    |
> +-----------+--------+------+
> 
> * *offset* is the offset of the start of the sub-region within the region
>   requested ("physical address offset" for the region)
> * *size* is the length of the sub-region. This may be zero if the access size is
>   not relevant, which may allow for optimizations
> * *fd_index* is the index in the ancillary data of the FD to use for ioeventfd
>   notification; it may be shared.
> * *type* is `VFIO_USER_IO_FD_TYPE_IOEVENTFD`
> * *flags* is any of:
>   * `KVM_IOEVENTFD_FLAG_DATAMATCH`
>   * `KVM_IOEVENTFD_FLAG_PIO`
>   * `KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY` (FIXME: makes sense?)
> * *datamatch* is the datamatch value if needed
> 
> See https://www.kernel.org/doc/Documentation/virtual/kvm/api.txt 4.59
> KVM_IOEVENTFD for further context on the ioeventfd-specific fields.
> 
> Sub-Region IO FD info format (ioregionfd)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> +-----------+--------+------+
> | Name      | Offset | Size |
> +===========+========+======+
> | offset    | 0      | 8    |
> +-----------+--------+------+
> | size      | 8      | 8    |
> +-----------+--------+------+
> | fd_index  | 16     | 4    |
> +-----------+--------+------+
> | type      | 20     | 4    |
> +-----------+--------+------+
> | flags     | 24     | 4    |
> +-----------+--------+------+
> | region_id | 28     | 4    |
> +-----------+--------+------+
> 
> * *offset* is the offset of the start of the sub-region within the region
>   requested ("physical address offset" for the region)
> * *size* is the length of the sub-region. FIXME: may allow zero?
> * *fd_index* is the index in the ancillary data of the FD to use for ioregionfd
>   messages; it may be shared
> * *type* is `VFIO_USER_IO_FD_TYPE_IOREGIONFD`
> * *flags* is any of:
>   * `KVM_IOREGIONFD_FLAG_PIO`
>   * `KVM_IOREGIONFD_FLAG_POSTED_WRITES`
> * *region_id* is an opaque value passed back to the server via a message on the
>   file descriptor
> 
> See https://www.spinics.net/lists/kvm/msg208139.html (FIXME) for further context
> on the ioregionfd-specific fields.
> 
>  VFIO_USER_DEVICE_GET_IRQ_INFO
>  -----------------------------
>  
> @@ -1141,7 +1309,7 @@ Message format
>  +==============+========================+
>  | Message ID   | <ID>                   |
>  +--------------+------------------------+
> -| Command      | 6                      |
> +| Command      | 7                      |
>  +--------------+------------------------+
>  | Message size | 32                     |
>  +--------------+------------------------+
> @@ -1212,7 +1380,7 @@ Message format
>  +==============+========================+
>  | Message ID   | <ID>                   |
>  +--------------+------------------------+
> -| Command      | 7                      |
> +| Command      | 8                      |
>  +--------------+------------------------+
>  | Message size | 36 + any data          |
>  +--------------+------------------------+
> @@ -1370,7 +1538,7 @@ Message format
>  +==============+========================+
>  | Message ID   | <ID>                   |
>  +--------------+------------------------+
> -| Command      | 8                      |
> +| Command      | 9                      |
>  +--------------+------------------------+
>  | Message size | 32 + data size         |
>  +--------------+------------------------+
> @@ -1397,7 +1565,7 @@ Message format
>  +==============+========================+
>  | Message ID   | <ID>                   |
>  +--------------+------------------------+
> -| Command      | 9                      |
> +| Command      | 10                     |
>  +--------------+------------------------+
>  | Message size | 32 + data size         |
>  +--------------+------------------------+
> @@ -1424,7 +1592,7 @@ Message format
>  +==============+========================+
>  | Message ID   | <ID>                   |
>  +--------------+------------------------+
> -| Command      | 10                     |
> +| Command      | 11                     |
>  +--------------+------------------------+
>  | Message size | 28 + data size         |
>  +--------------+------------------------+
> @@ -1451,7 +1619,7 @@ Message format
>  +==============+========================+
>  | Message ID   | <ID>                   |
>  +--------------+------------------------+
> -| Command      | 11                     |
> +| Command      | 12                     |
>  +--------------+------------------------+
>  | Message size | 28 + data size         |
>  +--------------+------------------------+
> @@ -1478,7 +1646,7 @@ Message format
>  +================+========================+
>  | Message ID     | <ID>                   |
>  +----------------+------------------------+
> -| Command        | 12                     |
> +| Command        | 13                     |
>  +----------------+------------------------+
>  | Message size   | 20                     |
>  +----------------+------------------------+
> @@ -1515,7 +1683,7 @@ Message format
>  +==============+========================+
>  | Message ID   | <ID>                   |
>  +--------------+------------------------+
> -| Command      | 13                     |
> +| Command      | 14                     |
>  +--------------+------------------------+
>  | Message size | 16                     |
>  +--------------+------------------------+
> @@ -1537,7 +1705,7 @@ Message format
>  +====================+========================+
>  | Message ID         | <ID>                   |
>  +--------------------+------------------------+
> -| Command            | 14                     |
> +| Command            | 15                     |
>  +--------------------+------------------------+
>  | Message size       | 16                     |
>  +--------------------+------------------------+
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  reply index

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-21 16:09 John Levon
2021-01-26 11:01 ` Stefan Hajnoczi [this message]
2021-01-28 14:38   ` John Levon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210126110104.GA250425@stefanha-x1.localdomain \
    --to=stefanha@redhat.com \
    --cc=cohuck@redhat.com \
    --cc=john.g.johnson@oracle.com \
    --cc=john.levon@nutanix.com \
    --cc=kvm@vger.kernel.org \
    --cc=libvfio-user-devel@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

KVM Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/kvm/0 kvm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 kvm kvm/ https://lore.kernel.org/kvm \
		kvm@vger.kernel.org
	public-inbox-index kvm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.kvm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git