All of lore.kernel.org
 help / color / mirror / Atom feed
* MMIO/PIO dispatch file descriptors (ioregionfd) design discussion
@ 2020-11-25 20:44 Elena Afanasova
  2020-12-02 18:06 ` Peter Xu
  2021-10-12  5:34   ` elena
  0 siblings, 2 replies; 33+ messages in thread
From: Elena Afanasova @ 2020-11-25 20:44 UTC (permalink / raw)
  To: kvm
  Cc: mst, john.g.johnson, dinechin, cohuck, jasowang, felipe,
	stefanha, elena.ufimtseva, jag.raman, eafanasova

Hello,

I'm an Outreachy intern with QEMU and I’m working on implementing the ioregionfd 
API in KVM. So I’d like to resume the ioregionfd design discussion. The latest 
version of the ioregionfd API document is provided below.

Overview
--------
ioregionfd is a KVM dispatch mechanism for handling MMIO/PIO accesses over a
file descriptor without returning from ioctl(KVM_RUN). This allows device
emulation to run in another task separate from the vCPU task.

This is achieved through KVM ioctls for registering MMIO/PIO regions and a wire
protocol that KVM uses to communicate with a task handling an MMIO/PIO access.

The traditional ioctl(KVM_RUN) dispatch mechanism with device emulation in a
separate task looks like this:

   kvm.ko  <---ioctl(KVM_RUN)---> VMM vCPU task <---messages---> device task

ioregionfd improves performance by eliminating the need for the vCPU task to
forward MMIO/PIO exits to device emulation tasks:

   kvm.ko  <---ioctl(KVM_RUN)---> VMM vCPU task
     ^
     `---ioregionfd---> device task

Both multi-threaded and multi-process VMMs can take advantage of ioregionfd to
run device emulation in dedicated threads and processes, respectively.

This mechanism is similar to ioeventfd except it supports all read and write
accesses, whereas ioeventfd only supports posted doorbell writes.

Traditional ioctl(KVM_RUN) dispatch and ioeventfd continue to work alongside
the new mechanism, but only one mechanism handles a MMIO/PIO access.

KVM_CREATE_IOREGIONFD
---------------------
:Capability: KVM_CAP_IOREGIONFD
:Architectures: all
:Type: system ioctl
:Parameters: none
:Returns: an ioregionfd file descriptor, -1 on error

This ioctl creates a new ioregionfd and returns the file descriptor. The fd can
be used to handle MMIO/PIO accesses instead of returning from ioctl(KVM_RUN)
with KVM_EXIT_MMIO or KVM_EXIT_PIO. One or more MMIO or PIO regions must be
registered with KVM_SET_IOREGION in order to receive MMIO/PIO accesses on the
fd. An ioregionfd can be used with multiple VMs and its lifecycle is not tied
to a specific VM.

When the last file descriptor for an ioregionfd is closed, all regions
registered with KVM_SET_IOREGION are dropped and guest accesses to those
regions cause ioctl(KVM_RUN) to return again.

KVM_SET_IOREGION
----------------
:Capability: KVM_CAP_IOREGIONFD
:Architectures: all
:Type: vm ioctl
:Parameters: struct kvm_ioregion (in)
:Returns: 0 on success, -1 on error

This ioctl adds, modifies, or removes an ioregionfd MMIO or PIO region. Guest
read and write accesses are dispatched through the given ioregionfd instead of
returning from ioctl(KVM_RUN).

::

  struct kvm_ioregion {
      __u64 guest_paddr; /* guest physical address */
      __u64 memory_size; /* bytes */
      __u64 user_data;
      __s32 fd; /* previously created with KVM_CREATE_IOREGIONFD */
      __u32 flags;
      __u8  pad[32];
  };

  /* for kvm_ioregion::flags */
  #define KVM_IOREGION_PIO           (1u << 0)
  #define KVM_IOREGION_POSTED_WRITES (1u << 1)

If a new region would split an existing region -1 is returned and errno is
EINVAL.

Regions can be deleted by setting fd to -1. If no existing region matches
guest_paddr and memory_size then -1 is returned and errno is ENOENT.

Existing regions can be modified as long as guest_paddr and memory_size
match an existing region.

MMIO is the default. The KVM_IOREGION_PIO flag selects PIO instead.

The user_data value is included in messages KVM writes to the ioregionfd upon
guest access. KVM does not interpret user_data.

Both read and write guest accesses wait for a response before entering the
guest again. The KVM_IOREGION_POSTED_WRITES flag does not wait for a response
and immediately enters the guest again. This is suitable for accesses that do
not require synchronous emulation, such as posted doorbell register writes.
Note that guest writes may block the vCPU despite KVM_IOREGION_POSTED_WRITES if
the device is too slow in reading from the ioregionfd.

Wire protocol
-------------
The protocol spoken over the file descriptor is as follows. The device reads
commands from the file descriptor with the following layout::

  struct ioregionfd_cmd {
      __u32 info;
      __u32 padding;
      __u64 user_data;
      __u64 offset;
      __u64 data;
  };

The info field layout is as follows::

  bits:  | 31 ... 8 |  6   | 5 ... 4 | 3 ... 0 |
  field: | reserved | resp |   size  |   cmd   |

The cmd field identifies the operation to perform::

  #define IOREGIONFD_CMD_READ  0
  #define IOREGIONFD_CMD_WRITE 1

The size field indicates the size of the access::

  #define IOREGIONFD_SIZE_8BIT  0
  #define IOREGIONFD_SIZE_16BIT 1
  #define IOREGIONFD_SIZE_32BIT 2
  #define IOREGIONFD_SIZE_64BIT 3

If the command is IOREGIONFD_CMD_WRITE then the resp bit indicates whether or
not a response must be sent.

The user_data field contains the opaque value provided to KVM_SET_IOREGION.
Applications can use this to uniquely identify the region that is being
accessed.

The offset field contains the byte offset being accessed within a region
that was registered with KVM_SET_IOREGION.

If the command is IOREGIONFD_CMD_WRITE then data contains the value
being written. The data value is a 64-bit integer in host endianness,
regardless of the access size.

The device sends responses by writing the following structure to the
file descriptor::

  struct ioregionfd_resp {
      __u64 data;
      __u8 pad[24];
  };

The data field contains the value read by an IOREGIONFD_CMD_READ
command. This field is zero for other commands. The data value is a 64-bit
integer in host endianness, regardless of the access size.

Ordering
--------
Guest accesses are delivered in order, including posted writes.

Signals
-------
The vCPU task can be interrupted by a signal while waiting for an ioregionfd
response. In this case ioctl(KVM_RUN) returns with -EINTR. Guest entry is
deferred until ioctl(KVM_RUN) is called again and the response has been written
to the ioregionfd.

Security
--------
Device emulation processes may be untrusted in multi-process VMM architectures.
Therefore the control plane and the data plane of ioregionfd are separate. A
task that only has access to an ioregionfd is unable to add/modify/remove
regions since that requires ioctls on a KVM vm fd. This ensures that device
emulation processes can only service MMIO/PIO accesses for regions that the VMM
registered on their behalf.

Multi-queue scalability
-----------------------
The protocol is synchronous - only one command/response cycle is in flight at a
time - but the vCPU will be blocked until the response has been processed
anyway. If another vCPU accesses an MMIO or PIO region belonging to the same
ioregionfd during this time then it waits for the first access to complete.

Per-queue ioregionfds can be set up to take advantage of concurrency on
multi-queue devices.

Polling
-------
Userspace can poll ioregionfd by submitting an io_uring IORING_OP_READ request
and polling the cq ring to detect when the read has completed. Although this
dispatch mechanism incurs more overhead than polling directly on guest RAM, it
captures each write access and supports reads.

Does it obsolete ioeventfd?
---------------------------
No, although KVM_IOREGION_POSTED_WRITES offers somewhat similar functionality
to ioeventfd, there are differences. The datamatch functionality of ioeventfd
is not available and would need to be implemented by the device emulation
program. Due to the counter semantics of eventfds there is automatic coalescing
of repeated accesses with ioeventfd. Overall ioeventfd is lighter weight but
also more limited.


^ permalink raw reply	[flat|nested] 33+ messages in thread
[parent not found: <CAFO2pHzmVf7g3z0RikQbYnejwcWRtHKV=npALs49eRDJdt4mJQ@mail.gmail.com>]

end of thread, other threads:[~2021-10-28  8:16 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-25 20:44 MMIO/PIO dispatch file descriptors (ioregionfd) design discussion Elena Afanasova
2020-12-02 18:06 ` Peter Xu
2020-12-03 11:10   ` Stefan Hajnoczi
2020-12-03 11:34     ` Michael S. Tsirkin
2020-12-04 13:23       ` Stefan Hajnoczi
2020-12-03 14:40     ` Peter Xu
2020-12-07 14:58       ` Stefan Hajnoczi
2021-10-12  5:34 ` elena
2021-10-12  5:34   ` elena
2021-10-25 12:42   ` Stefan Hajnoczi
2021-10-25 12:42     ` Stefan Hajnoczi
2021-10-25 15:21     ` Elena
2021-10-25 15:21       ` Elena
2021-10-25 16:56       ` Stefan Hajnoczi
2021-10-25 16:56         ` Stefan Hajnoczi
2021-10-26 19:01       ` John Levon
2021-10-26 19:01         ` John Levon
2021-10-27 10:15         ` Stefan Hajnoczi
2021-10-27 10:15           ` Stefan Hajnoczi
2021-10-27 12:22           ` John Levon
2021-10-27 12:22             ` John Levon
2021-10-28  8:14             ` Stefan Hajnoczi
2021-10-28  8:14               ` Stefan Hajnoczi
     [not found] <CAFO2pHzmVf7g3z0RikQbYnejwcWRtHKV=npALs49eRDJdt4mJQ@mail.gmail.com>
2020-11-26  3:37 ` Jason Wang
2020-11-26 12:36   ` Stefan Hajnoczi
2020-11-27  3:39     ` Jason Wang
2020-11-27 13:44       ` Stefan Hajnoczi
2020-11-30  2:14         ` Jason Wang
2020-11-30 12:47           ` Stefan Hajnoczi
2020-12-01  4:05             ` Jason Wang
2020-12-01 10:35               ` Stefan Hajnoczi
2020-12-02  2:53                 ` Jason Wang
2020-12-02 14:17                 ` Elena Afanasova

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.