From: Stefan Hajnoczi <stefanha@redhat.com>
To: John Levon <john.levon@nutanix.com>
Cc: "libvfio-user-devel@nongnu.org" <libvfio-user-devel@nongnu.org>,
"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
Cornelia Huck <cohuck@redhat.com>,
"john.g.johnson@oracle.com" <john.g.johnson@oracle.com>,
"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>
Subject: Re: [DRAFT RFC] ioeventfd/ioregionfd support in vfio-user
Date: Tue, 26 Jan 2021 11:01:04 +0000 [thread overview]
Message-ID: <20210126110104.GA250425@stefanha-x1.localdomain> (raw)
In-Reply-To: <20210121160905.GA314820@sent>
[-- Attachment #1: Type: text/plain, Size: 19982 bytes --]
On Thu, Jan 21, 2021 at 04:09:09PM +0000, John Levon wrote:
>
> Hi, please take a look. For context, this addition is against the current
> https://github.com/nutanix/libvfio-user/blob/master/docs/vfio-user.rst
> specification.
>
> kvm@ readers: Stefan suggested this may be of interest from a VFIO point of
> view, in case there is any potential cross-over in defining how to wire up these
> fds.
After talking to Alex Williamson about including these APIs in the
kernel VFIO ioctls it became clear to me that there is an architectural
difference between classic vfio-pci devices and vfio-user devices:
The current model with kernel VFIO is that the VMM sets up ioeventfds
using ioctl(VFIO_DEVICE_IOEVENTFD). The VMM has device-specific
knowledge that allows it to add ioeventfds for specific hardware
registers.
In vfio-user the VMM has no knowledge of the specific device. Instead
the device tells the VMM how to configure ioeventfds and ioregionfds.
There are a few reasons for this:
1. It is not practical to add knowledge of every vfio-user device
implementation into the VMM. A cleaner solution is for the device to
advertise its desired configuration to the VMM instead.
2. The device implementation may prefer a specific ioeventfd or
ioregionfd configuration to fit its multi-threading architecture. A
multi-threaded (multi-queue) device implementation may want
ioeventfds and ioregionfds to be configured so that the right thread
receives certain MMIO/PIO accesses.
3. The device implementation may want to specify ioregionfd's user_data
value. This is the application-specific value that is passed along
with an MMIO/PIO access.
I think mdev fits in somewhere between these two models. I can imagine
complex mdev devices wanting to control the ioeventfd and ioregionfd
configuration just like vfio-user. But in simple cases it doesn't
matter.
I suggest including John's ioeventfd/ioregionfd work into the kernel
VFIO interface so mdev devices can take advantage of it too.
> Note that this is a new message instead of adding a new region capability to
> VFIO_USER_DEVICE_GET_REGION_INFO: with a new capability, there's no way for the
> server to know if ioeventfd/ioregionfd is actually used/requested by the client
> (the client would just have to close those fds if it didn't want to use FDs). An
> explicit new call is much clearer for this.
>
> The ioregionfd feature itself is at proposal stage, so there's a good chance of
> further changes depending on that.
>
> I also have these pending issues so far:
>
> 1) I'm not familiar with CCW bus, so don't know if
> KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY flag makes sense or is supportable in
> vfio-user context
>
> 2) if ioregionfd subsumes all ioeventfd use cases, we can remove the distinction
> from this API, and the client caller can translate to ioregionfd or ioeventfd as
> needed
>
> 3) is it neccessary for the client to indicate support (e.g. lacking ioregionfd
> support) ?
>
> 4) (ioregionfd issue) region_id is 4 bytes, which seems a little awkward from
> the server side: this has to encode both the region ID as well as the offset of
> the sub-region within that region. Can this be 64 bits wide?
The user_data field in Elena's ioregionfd patches is 64 bits. Does this
solve the problem?
>
> thanks
> john
>
> (NB: I edited the diff so the new sections are more readable.)
>
> diff --git a/docs/vfio-user.rst b/docs/vfio-user.rst
> index e3adc7a..e7274a2 100644
> --- a/docs/vfio-user.rst
> +++ b/docs/vfio-user.rst
> @@ -161,6 +161,17 @@ in the region info reply of a device-specific region. Such regions are reflected
> in ``struct vfio_device_info.num_regions``. Thus, for PCI devices this value can
> be equal to, or higher than, VFIO_PCI_NUM_REGIONS.
>
> +Region I/O via file descriptors
> +-------------------------------
> +
> +For unmapped regions, region I/O from the client is done via
> +VFIO_USER_REGION_READ/WRITE. As an optimization, ioeventfds or ioregionfds may
> +be configured for sub-regions of some regions. A client may request information
> +on these sub-regions via VFIO_USER_DEVICE_GET_REGION_IO_FDS; by configuring the
> +returned file descriptors as ioeventfds or ioregionfds, the server can be
> +directly notified of I/O (for example, by KVM) without taking a trip through the
> +client.
> +
> Interrupts
> ^^^^^^^^^^
> The client uses VFIO_USER_DEVICE_GET_IRQ_INFO messages to query the server for
> @@ -293,37 +304,39 @@ Commands
> The following table lists the VFIO message command IDs, and whether the
> message command is sent from the client or the server.
>
> -+----------------------------------+---------+-------------------+
> -| Name | Command | Request Direction |
> -+==================================+=========+===================+
> -| VFIO_USER_VERSION | 1 | client -> server |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_DMA_MAP | 2 | client -> server |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_DMA_UNMAP | 3 | client -> server |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_DEVICE_GET_INFO | 4 | client -> server |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_DEVICE_GET_REGION_INFO | 5 | client -> server |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_DEVICE_GET_IRQ_INFO | 6 | client -> server |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_DEVICE_SET_IRQS | 7 | client -> server |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_REGION_READ | 8 | client -> server |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_REGION_WRITE | 9 | client -> server |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_DMA_READ | 10 | server -> client |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_DMA_WRITE | 11 | server -> client |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_VM_INTERRUPT | 12 | server -> client |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_DEVICE_RESET | 13 | client -> server |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_DIRTY_PAGES | 14 | client -> server |
> -+----------------------------------+---------+-------------------+
> ++------------------------------------+---------+-------------------+
> +| Name | Command | Request Direction |
> ++====================================+=========+===================+
> +| VFIO_USER_VERSION | 1 | client -> server |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_DMA_MAP | 2 | client -> server |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_DMA_UNMAP | 3 | client -> server |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_DEVICE_GET_INFO | 4 | client -> server |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_DEVICE_GET_REGION_INFO | 5 | client -> server |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_DEVICE_GET_REGION_IO_FDS | 6 | client -> server |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_DEVICE_GET_IRQ_INFO | 7 | client -> server |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_DEVICE_SET_IRQS | 8 | client -> server |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_REGION_READ | 9 | client -> server |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_REGION_WRITE | 10 | client -> server |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_DMA_READ | 11 | server -> client |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_DMA_WRITE | 12 | server -> client |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_VM_INTERRUPT | 13 | server -> client |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_DEVICE_RESET | 14 | client -> server |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_DIRTY_PAGES | 15 | client -> server |
> ++------------------------------------+---------+-------------------+
>
>
> .. Note:: Some VFIO defines cannot be reused since their values are
> @@ -1130,6 +1143,161 @@ client must write data on the same order and transction size as read.
> If an error occurs then the server must fail the read or write operation. It is
> an implementation detail of the client how to handle errors.
>
> VFIO_USER_DEVICE_GET_REGION_IO_FDS
> ----------------------------------
>
> Message format
> ^^^^^^^^^^^^^^
>
> +--------------+------------------------+
> | Name | Value |
> +==============+========================+
> | Message ID | <ID> |
> +--------------+------------------------+
> | Command | 6 |
> +--------------+------------------------+
> | Message size | 32 + subregion info |
> +--------------+------------------------+
> | Flags | Reply bit set in reply |
> +--------------+------------------------+
> | Error | 0/errno |
> +--------------+------------------------+
> | Region info | Region IO FD info |
> +--------------+------------------------+
>
> Clients can access regions via VFIO_USER_REGION_READ/WRITE or, if provided, by
> mmap()ing a file descriptor provided by the server.
>
> VFIO_USER_DEVICE_GET_REGION_IO_FDS provides an alternative access mechanism via
> file descriptors. This is an optional feature intended for performance
> improvements where an underlying sub-system (such as KVM) supports communication
> across such file descriptors to the vfio-user server, without needing to
> round-trip through the client.
>
> The server returns an array describing sub-regions of the given region along
> with the server specifies a set of sub-regions and the requested file descriptor
> notification mechanism to use for that sub-region. Each sub-region in the
> response message may choose to use a different method, as defined below. The
> two mechanisms supported in this specification are ioeventfds and ioregionfds.
>
> A client should then hook up the returned file descriptors with the notification
> method requested.
>
> Region IO FD info format
> ^^^^^^^^^^^^^^^^^^^^^^^^
>
> +------------+--------+------+
> | Name | Offset | Size |
> +============+========+======+
> | argsz | 16 | 4 |
> +------------+--------+------+
> | flags | 20 | 4 |
> +------------+--------+------+
> | index | 24 | 4 |
> +------------+--------+------+
> | count | 28 | 4 |
> +------------+--------+------+
> | subregions | 32 | ... |
> +------------+--------+------+
>
> * *argsz* is the size of the region IO FD info structure plus the
> total size of the subregion array. Thus, each array entry "i" is at offset
> i * ((argsz - 32) / count)
> * *flags* must be zero
> * *index* is the index of memory region being queried
> * *count* is the number of sub-regions in the array
> * *subregions* is the array of Sub-Region IO FD info structures
>
> The client must set ``flags`` to zero and specify the region being queried in
> the ``index``.
>
> The client sets the ``argsz`` field to indicate the maximum size of the response
> that the server can send, which must be at least the size of the response header
> plus space for the subregion array. If the full response size exceeds ``argsz``,
> then the server must respond only with the response header and the Region IO FD
> info structure, setting in ``argsz`` the buffer size required to store the full
> response. In this case, no file descriptors are passed back. The client then
> retries the operation with a larger receive buffer.
>
> The reply message will additionally include at least one file descriptor in the
> ancillary data. Note that more than one subregion may share the same file
> descriptor.
>
> Each sub-region given in the response has one of two possible structures,
> depending whether *type* is `VFIO_USER_IO_FD_TYPE_IOEVENTFD` or
> `VFIO_USER_IO_FD_TYPE_IOREGIONFD`:
>
> Sub-Region IO FD info format (ioeventfd)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> +-----------+--------+------+
> | Name | Offset | Size |
> +===========+========+======+
> | offset | 0 | 8 |
> +-----------+--------+------+
> | size | 8 | 8 |
> +-----------+--------+------+
> | fd_index | 16 | 4 |
> +-----------+--------+------+
> | type | 20 | 4 |
> +-----------+--------+------+
> | flags | 24 | 4 |
> +-----------+--------+------+
> | padding | 28 | 4 |
> +-----------+--------+------+
> | datamatch | 32 | 8 |
> +-----------+--------+------+
>
> * *offset* is the offset of the start of the sub-region within the region
> requested ("physical address offset" for the region)
> * *size* is the length of the sub-region. This may be zero if the access size is
> not relevant, which may allow for optimizations
> * *fd_index* is the index in the ancillary data of the FD to use for ioeventfd
> notification; it may be shared.
> * *type* is `VFIO_USER_IO_FD_TYPE_IOEVENTFD`
> * *flags* is any of:
> * `KVM_IOEVENTFD_FLAG_DATAMATCH`
> * `KVM_IOEVENTFD_FLAG_PIO`
> * `KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY` (FIXME: makes sense?)
> * *datamatch* is the datamatch value if needed
>
> See https://www.kernel.org/doc/Documentation/virtual/kvm/api.txt 4.59
> KVM_IOEVENTFD for further context on the ioeventfd-specific fields.
>
> Sub-Region IO FD info format (ioregionfd)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> +-----------+--------+------+
> | Name | Offset | Size |
> +===========+========+======+
> | offset | 0 | 8 |
> +-----------+--------+------+
> | size | 8 | 8 |
> +-----------+--------+------+
> | fd_index | 16 | 4 |
> +-----------+--------+------+
> | type | 20 | 4 |
> +-----------+--------+------+
> | flags | 24 | 4 |
> +-----------+--------+------+
> | region_id | 28 | 4 |
> +-----------+--------+------+
>
> * *offset* is the offset of the start of the sub-region within the region
> requested ("physical address offset" for the region)
> * *size* is the length of the sub-region. FIXME: may allow zero?
> * *fd_index* is the index in the ancillary data of the FD to use for ioregionfd
> messages; it may be shared
> * *type* is `VFIO_USER_IO_FD_TYPE_IOREGIONFD`
> * *flags* is any of:
> * `KVM_IOREGIONFD_FLAG_PIO`
> * `KVM_IOREGIONFD_FLAG_POSTED_WRITES`
> * *region_id* is an opaque value passed back to the server via a message on the
> file descriptor
>
> See https://www.spinics.net/lists/kvm/msg208139.html (FIXME) for further context
> on the ioregionfd-specific fields.
>
> VFIO_USER_DEVICE_GET_IRQ_INFO
> -----------------------------
>
> @@ -1141,7 +1309,7 @@ Message format
> +==============+========================+
> | Message ID | <ID> |
> +--------------+------------------------+
> -| Command | 6 |
> +| Command | 7 |
> +--------------+------------------------+
> | Message size | 32 |
> +--------------+------------------------+
> @@ -1212,7 +1380,7 @@ Message format
> +==============+========================+
> | Message ID | <ID> |
> +--------------+------------------------+
> -| Command | 7 |
> +| Command | 8 |
> +--------------+------------------------+
> | Message size | 36 + any data |
> +--------------+------------------------+
> @@ -1370,7 +1538,7 @@ Message format
> +==============+========================+
> | Message ID | <ID> |
> +--------------+------------------------+
> -| Command | 8 |
> +| Command | 9 |
> +--------------+------------------------+
> | Message size | 32 + data size |
> +--------------+------------------------+
> @@ -1397,7 +1565,7 @@ Message format
> +==============+========================+
> | Message ID | <ID> |
> +--------------+------------------------+
> -| Command | 9 |
> +| Command | 10 |
> +--------------+------------------------+
> | Message size | 32 + data size |
> +--------------+------------------------+
> @@ -1424,7 +1592,7 @@ Message format
> +==============+========================+
> | Message ID | <ID> |
> +--------------+------------------------+
> -| Command | 10 |
> +| Command | 11 |
> +--------------+------------------------+
> | Message size | 28 + data size |
> +--------------+------------------------+
> @@ -1451,7 +1619,7 @@ Message format
> +==============+========================+
> | Message ID | <ID> |
> +--------------+------------------------+
> -| Command | 11 |
> +| Command | 12 |
> +--------------+------------------------+
> | Message size | 28 + data size |
> +--------------+------------------------+
> @@ -1478,7 +1646,7 @@ Message format
> +================+========================+
> | Message ID | <ID> |
> +----------------+------------------------+
> -| Command | 12 |
> +| Command | 13 |
> +----------------+------------------------+
> | Message size | 20 |
> +----------------+------------------------+
> @@ -1515,7 +1683,7 @@ Message format
> +==============+========================+
> | Message ID | <ID> |
> +--------------+------------------------+
> -| Command | 13 |
> +| Command | 14 |
> +--------------+------------------------+
> | Message size | 16 |
> +--------------+------------------------+
> @@ -1537,7 +1705,7 @@ Message format
> +====================+========================+
> | Message ID | <ID> |
> +--------------------+------------------------+
> -| Command | 14 |
> +| Command | 15 |
> +--------------------+------------------------+
> | Message size | 16 |
> +--------------------+------------------------+
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
next prev parent reply other threads:[~2021-01-26 12:22 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-01-21 16:09 [DRAFT RFC] ioeventfd/ioregionfd support in vfio-user John Levon
2021-01-26 11:01 ` Stefan Hajnoczi [this message]
2021-01-28 14:38 ` John Levon
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210126110104.GA250425@stefanha-x1.localdomain \
--to=stefanha@redhat.com \
--cc=cohuck@redhat.com \
--cc=john.g.johnson@oracle.com \
--cc=john.levon@nutanix.com \
--cc=kvm@vger.kernel.org \
--cc=libvfio-user-devel@nongnu.org \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).