All of lore.kernel.org
 help / color / mirror / Atom feed
* Proposal for MMIO/PIO dispatch file descriptors (ioregionfd)
@ 2020-02-22 20:19 Stefan Hajnoczi
  2020-02-24 16:14 ` Christophe de Dinechin
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Stefan Hajnoczi @ 2020-02-22 20:19 UTC (permalink / raw)
  To: kvm
  Cc: jasowang, mst, cohuck, slp, felipe, john.g.johnson,
	robert.bradford, Dan Horobeanu, Stephen Barber, Peter Shier

[-- Attachment #1: Type: text/plain, Size: 8559 bytes --]

Hi,
I wanted to share this idea with the KVM community and VMM developers.
If this isn't relevant to you but you know someone who should
participate, please feel free to add them :).

The following is an outline of "ioregionfd", a cross between ioeventfd
and KVM memory regions.  This mechanism would be helpful for VMMs that
emulate devices in separate processes, muser/VFIO, and to address
existing use cases that ioeventfd cannot handle.

Background
----------
There are currently two mechanisms for dispatching MMIO/PIO accesses in
KVM: returning KVM_EXIT_MMIO/KVM_EXIT_IO from ioctl(KVM_RUN) and
ioeventfd.  Some VMMs also use polling to avoid dispatching
performance-critical MMIO/PIO accesses altogether.

These mechanisms have shortcomings for VMMs that perform device
emulation in separate processes (usually for increased security):

1. Only one process performs ioctl(KVM_RUN) for a vCPU, so that
   mechanism is not available to device emulation processes.

2. ioeventfd does not store the value written.  This makes it unsuitable
   for NVMe Submission Queue Tail Doorbell registers because the value
   written is needed by the device emulation process, for example.
   ioeventfd also does not support read operations.

3. Polling does not support computed read operations and only the latest
   value written is available to the device emulation process
   (intermediate values are overwritten if the guest performs multiple
   accesses).

Overview
--------
This proposal aims to address this gap through a wire protocol and a new
KVM API for registering MMIO/PIO regions that use this alternative
dispatch mechanism.

The KVM API is used by the VMM to set up dispatch.  The wire protocol is
used to dispatch accesses from KVM to the device emulation process.

This new MMIO/PIO dispatch mechanism eliminates the need to return from
ioctl(KVM_RUN) in the VMM and then exchange messages with a device
emulation process.

Inefficient dispatch to device processes today:

   kvm.ko  <---ioctl(KVM_RUN)---> VMM <---messages---> device

Direct dispatch with the new mechanism:

   kvm.ko  <---ioctl(KVM_RUN)---> VMM
     ^
     `---new MMIO/PIO mechanism-> device

Even single-process VMMs can take advantage of the new mechanism.  For
example, QEMU's emulated NVMe storage controller can implement IOThread
support.

No constraint is placed on the device process architecture.  A single
process could emulate all devices belonging to the guest, each device
could be its own process, or something in between.

Both ioeventfd and traditional KVM_EXIT_MMIO/KVM_EXIT_IO emulation
continue to work alongside the new mechanism, but only one of them is
used for any given guest address.

KVM API
-------
The following new KVM ioctl is added:

KVM_SET_IOREGIONFD
Capability: KVM_CAP_IOREGIONFD
Architectures: all
Type: vm ioctl
Parameters: struct kvm_ioregionfd (in)
Returns: 0 on success, !0 on error

This ioctl adds, modifies, or removes MMIO or PIO regions where guest
accesses are dispatched through a given file descriptor instead of
returning from ioctl(KVM_RUN) with KVM_EXIT_MMIO or KVM_EXIT_PIO.

struct kvm_ioregionfd {
    __u64 guest_physical_addr;
    __u64 memory_size; /* bytes */
    __s32 fd;
    __u32 region_id;
    __u32 flags;
    __u8  pad[36];
};

/* for kvm_ioregionfd::flags */
#define KVM_IOREGIONFD_PIO           (1u << 0)
#define KVM_IOREGIONFD_POSTED_WRITES (1u << 1)

Regions are deleted by passing zero for memory_size.

MMIO is the default.  The KVM_IOREGIONFD_PIO flag selects PIO instead.

The region_id is an opaque token that is included as part of the write
to the file descriptor.  It is typically a unique identifier for this
region but KVM does not interpret its value.

Both read and write guest accesses wait until an acknowledgement is
received on the file descriptor.  The KVM_IOREGIONFD_POSTED_WRITES flag
skips waiting for an acknowledgement on write accesses.  This is
suitable for accesses that do not require synchronous emulation, such as
doorbell register writes.

Wire protocol
-------------
The protocol spoken over the file descriptor is as follows.  The device
reads commands from the file descriptor with the following layout:

struct ioregionfd_cmd {
    __u32 info;
    __u32 region_id;
    __u64 addr;
    __u64 data;
    __u8 pad[8];
};

/* for ioregionfd_cmd::info */
#define IOREGIONFD_CMD_MASK 0xf
# define IOREGIONFD_CMD_READ 0
# define IOREGIONFD_CMD_WRITE 1
#define IOREGIONFD_SIZE_MASK 0x30
#define IOREGIONFD_SIZE_SHIFT 4
# define IOREGIONFD_SIZE_8BIT 0
# define IOREGIONFD_SIZE_16BIT 1
# define IOREGIONFD_SIZE_32BIT 2
# define IOREGIONFD_SIZE_64BIT 3
#define IOREGIONFD_NEED_PIO (1u << 6)
#define IOREGIONFD_NEED_RESPONSE (1u << 7)

The command is interpreted by inspecting the info field:

  switch (cmd.info & IOREGIONFD_CMD_MASK) {
  case IOREGIONFD_CMD_READ:
      /* It's a read access */
      break;
  case IOREGIONFD_CMD_WRITE:
      /* It's a write access */
      break;
  default:
      /* Protocol violation, terminate connection */
  }

The access size is interpreted by inspecting the info field:

  unsigned size = (cmd.info & IOREGIONFD_SIZE_MASK) >> IOREGIONFD_SIZE_SHIFT;
  /* where nbytes = pow(2, size) */

The region_id indicates which MMIO/PIO region is being accessed.  This
field has no inherent structure but is typically a unique identifier.

The byte offset being accessed within that region is addr.

If the command is IOREGIONFD_CMD_WRITE then data contains the value
being written.

MMIO is the default.  The IOREGIONFD_NEED_PIO flag is set on PIO
accesses.

When IOREGIONFD_NEED_RESPONSE is set on a IOREGIONFD_CMD_WRITE command,
no response must be sent.  This flag has no effect for
IOREGIONFD_CMD_READ commands.

The device sends responses by writing the following structure to the
file descriptor:

struct ioregionfd_resp {
    __u64 data;
    __u32 info;
    __u8 pad[20];
};

/* for ioregionfd_resp::info */
#define IOREGIONFD_RESP_FAILED (1u << 0)

The info field is zero on success.  The IOREGIONFD_RESP_FAILED flag is
set on failure.

The data field contains the value read by an IOREGIONFD_CMD_READ
command.  This field is zero for other commands.

Does it support polling?
------------------------
Yes, use io_uring's IORING_OP_READ to submit an asynchronous read on the
file descriptor.  Poll the io_uring cq ring to detect when the read has
completed.

Although this dispatch mechanism incurs more overhead than polling
directly on guest RAM, it overcomes the limitations of polling: it
supports read accesses as well as capturing written values instead of
overwriting them.

Does it obsolete ioeventfd?
---------------------------
No, although KVM_IOREGIONFD_POSTED_WRITES offers somewhat similar
functionality to ioeventfd, there are differences.  The datamatch
functionality of ioeventfd is not available and would need to be
implemented by the device emulation program.  Due to the counter
semantics of eventfds there is automatic coalescing of repeated accesses
with ioeventfd.  Overall ioeventfd is lighter weight but also more
limited.

How does it scale?
------------------
The protocol is synchronous - only one command/response cycle is in
flight at a time.  The vCPU will be blocked until the response has been
processed anyway.  If another vCPU accesses an MMIO or PIO region with
the same file descriptor during this time then it will wait to.

In practice this is not a problem since per-queue file descriptors can
be set up for multi-queue devices.

It is up to the device emulation program whether to handle multiple
devices over the same file descriptor or not.

What exactly is the file descriptor (e.g. eventfd, pipe, char device)?
----------------------------------------------------------------------
Any file descriptor that supports bidirectional I/O would do.  This
rules out eventfds and pipes.  socketpair(AF_UNIX) is a likely
candidate.  Maybe a char device will be necessary for improved
performance.

Can this be part of KVM_SET_USER_MEMORY_REGION?
-----------------------------------------------
Maybe.  Perhaps everything can be squeezed into struct
kvm_userspace_memory_region but it's only worth doing if the memory
region code needs to be reused for this in the first place.  I'm not
sure.

What do you think?
------------------
I hope this serves as a starting point for improved MMIO/PIO dispatch in
KVM.  There are no immediate plans to implement this but I think it will
become necessary within the next year or two.

1. Does it meet your requirements?
2. Are there better alternatives?

Thanks,
Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Proposal for MMIO/PIO dispatch file descriptors (ioregionfd)
  2020-02-22 20:19 Proposal for MMIO/PIO dispatch file descriptors (ioregionfd) Stefan Hajnoczi
@ 2020-02-24 16:14 ` Christophe de Dinechin
  2020-02-24 17:43   ` Stefan Hajnoczi
  2020-02-24 17:10 ` Michael S. Tsirkin
  2020-02-25 12:19 ` Felipe Franciosi
  2 siblings, 1 reply; 11+ messages in thread
From: Christophe de Dinechin @ 2020-02-24 16:14 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kvm, jasowang, mst, cohuck, slp, felipe, john.g.johnson,
	robert.bradford, Dan Horobeanu, Stephen Barber, Peter Shier


Stefan Hajnoczi writes:

> Hi,
> I wanted to share this idea with the KVM community and VMM developers.
> If this isn't relevant to you but you know someone who should
> participate, please feel free to add them :).
>
> The following is an outline of "ioregionfd", a cross between ioeventfd
> and KVM memory regions.  This mechanism would be helpful for VMMs that
> emulate devices in separate processes, muser/VFIO, and to address
> existing use cases that ioeventfd cannot handle.

Looks interesting.

> This ioctl adds, modifies, or removes MMIO or PIO regions where guest
> accesses are dispatched through a given file descriptor instead of
> returning from ioctl(KVM_RUN) with KVM_EXIT_MMIO or KVM_EXIT_PIO.

What file descriptors can be used for that? Is there an equivalent to
eventfd(2)? You answer at end of mail seems to be that you could use
socketpair(AF_UNIX) or a char device. But it seems "weird" to me that
some arbitrary fd could have its behavior overriden by another process
doing this KVM ioctl.  Are there precedents for that kind of "fd takeover"
behavior?

>
> struct kvm_ioregionfd {
>     __u64 guest_physical_addr;
>     __u64 memory_size; /* bytes */
>     __s32 fd;
>     __u32 region_id;
>     __u32 flags;
>     __u8  pad[36];
> };
>
> /* for kvm_ioregionfd::flags */
> #define KVM_IOREGIONFD_PIO           (1u << 0)
> #define KVM_IOREGIONFD_POSTED_WRITES (1u << 1)
>
> Regions are deleted by passing zero for memory_size.

For delete and modify, this means you have to match on something.
Is that GPA only?

What should happen if you define or zero-size something that is in the
middle of a previously created region? I assume it's an error?

What about the fd being closed before/after you delete a region?

How small can the region be? Can it be just the size of a doorbell
register if all other registers for the device could be efficiently
implemented using memory writes?

>
> MMIO is the default.  The KVM_IOREGIONFD_PIO flag selects PIO instead.

Just curious: what use case do you see for PIO? Isn't that detrimental
to your goal for this to be high-performance and cross-platform?


> The region_id is an opaque token that is included as part of the write
> to the file descriptor.  It is typically a unique identifier for this
> region but KVM does not interpret its value.
>
> Both read and write guest accesses wait until an acknowledgement is
> received on the file descriptor.

By "acknowledgement", do you mean data has been read or written on the
other side, or something else?

> The KVM_IOREGIONFD_POSTED_WRITES flag
> skips waiting for an acknowledgement on write accesses.  This is
> suitable for accesses that do not require synchronous emulation, such as
> doorbell register writes.
>
> Wire protocol
> -------------
> The protocol spoken over the file descriptor is as follows.  The device
> reads commands from the file descriptor with the following layout:
>
> struct ioregionfd_cmd {
>     __u32 info;
>     __u32 region_id;
>     __u64 addr;
>     __u64 data;
>     __u8 pad[8];
> };
>
> /* for ioregionfd_cmd::info */
> #define IOREGIONFD_CMD_MASK 0xf
> # define IOREGIONFD_CMD_READ 0
> # define IOREGIONFD_CMD_WRITE 1

Maybe "GUEST_READ" and "GUEST_WRITE"?


> #define IOREGIONFD_SIZE_MASK 0x30
> #define IOREGIONFD_SIZE_SHIFT 4
> # define IOREGIONFD_SIZE_8BIT 0
> # define IOREGIONFD_SIZE_16BIT 1
> # define IOREGIONFD_SIZE_32BIT 2
> # define IOREGIONFD_SIZE_64BIT 3
> #define IOREGIONFD_NEED_PIO (1u << 6)
> #define IOREGIONFD_NEED_RESPONSE (1u << 7)
>
> The command is interpreted by inspecting the info field:
>
>   switch (cmd.info & IOREGIONFD_CMD_MASK) {
>   case IOREGIONFD_CMD_READ:
>       /* It's a read access */
>       break;
>   case IOREGIONFD_CMD_WRITE:
>       /* It's a write access */
>       break;
>   default:
>       /* Protocol violation, terminate connection */
>   }
>
> The access size is interpreted by inspecting the info field:
>
>   unsigned size = (cmd.info & IOREGIONFD_SIZE_MASK) >> IOREGIONFD_SIZE_SHIFT;
>   /* where nbytes = pow(2, size) */

What about providing a IOREGIONFD_SIZE(cmd) macro to do that?
>
> The region_id indicates which MMIO/PIO region is being accessed.  This
> field has no inherent structure but is typically a unique identifier.
>
> The byte offset being accessed within that region is addr.
>
> If the command is IOREGIONFD_CMD_WRITE then data contains the value
> being written.

I assume if the guest writes a 8-bit 42, data contains a 64-bit 42
irrespective of guest and host endianness.

>
> MMIO is the default.  The IOREGIONFD_NEED_PIO flag is set on PIO
> accesses.
>
> When IOREGIONFD_NEED_RESPONSE is set on a IOREGIONFD_CMD_WRITE command,
> no response must be sent.  This flag has no effect for
> IOREGIONFD_CMD_READ commands.

I find this paragraph confusing. "NEED_RESPONSE" seems to imply the
response must be sent. Typo? Or do I misunderstand who is supposed to
send the response?

Could you clarify the reason for having both POSTED_WRITES and NEED_RESPONSE?

>
> The device sends responses by writing the following structure to the
> file descriptor:
>
> struct ioregionfd_resp {
>     __u64 data;
>     __u32 info;
>     __u8 pad[20];
> };

I know you manually optimized for intra-padding here, but do we rule
128-bit data forever? :-)

>
> /* for ioregionfd_resp::info */
> #define IOREGIONFD_RESP_FAILED (1u << 0)

What happens when FAILED is set?
- If the guest still reads data, then how does it know read failed?
- Otherwise, what happens?

I understand the intent is for the resp to come in the same order as
the cmd. Is it OK for the same region to be accessed by different vCPUs?
If so, where do you keep the information about the vCPU that did a cmd
in order to be able to dispatch the resp back to the vCPU that initiated
the operation? [Answer below seems that to imply you don't and just
block the second vCPU in that case]

>
> The info field is zer oon success.

typo "zero on"

> The IOREGIONFD_RESP_FAILED flag is set on failure.

The device sets it (active voice), or are there other conditions where
it can be set (maybe state of the fd)?

>
> The data field contains the value read by an IOREGIONFD_CMD_READ
> command.  This field is zero for other commands.
>
> Does it support polling?
> ------------------------
> Yes, use io_uring's IORING_OP_READ to submit an asynchronous read on the
> file descriptor.  Poll the io_uring cq ring to detect when the read has
> completed.
>
> Although this dispatch mechanism incurs more overhead than polling
> directly on guest RAM, it overcomes the limitations of polling: it
> supports read accesses as well as capturing written values instead of
> overwriting them.
>
> Does it obsolete ioeventfd?
> ---------------------------
> No, although KVM_IOREGIONFD_POSTED_WRITES offers somewhat similar
> functionality to ioeventfd, there are differences.  The datamatch
> functionality of ioeventfd is not available and would need to be
> implemented by the device emulation program.  Due to the counter
> semantics of eventfds there is automatic coalescing of repeated accesses
> with ioeventfd.  Overall ioeventfd is lighter weight but also more
> limited.
>
> How does it scale?
> ------------------
> The protocol is synchronous - only one command/response cycle is in
> flight at a time.  The vCPU will be blocked until the response has been
> processed anyway.  If another vCPU accesses an MMIO or PIO region with
> the same file descriptor during this time then it will wait to.
>
> In practice this is not a problem since per-queue file descriptors can
> be set up for multi-queue devices.

Can a guest write be blocked if user-space is slow reading the fd?

What about a guest read? Since the vCPU is blocked anyway, could you
elaborate how the proposed switch to user-space improves relative to the
existing one? Seems like a possible win if you have some free CPU that
can pick up the user-space. If you need to steal a running CPU for your
user-space, it's less clear to me that there is a win (limit case being
a single host CPU where you'd just ping-pong between processes).

>
> It is up to the device emulation program whether to handle multiple
> devices over the same file descriptor or not.
>
> What exactly is the file descriptor (e.g. eventfd, pipe, char device)?
> ----------------------------------------------------------------------
> Any file descriptor that supports bidirectional I/O would do.  This
> rules out eventfds and pipes.  socketpair(AF_UNIX) is a likely
> candidate.  Maybe a char device will be necessary for improved
> performance.
>
> Can this be part of KVM_SET_USER_MEMORY_REGION?
> -----------------------------------------------
> Maybe.  Perhaps everything can be squeezed into struct
> kvm_userspace_memory_region but it's only worth doing if the memory
> region code needs to be reused for this in the first place.  I'm not
> sure.
>
> What do you think?
> ------------------
> I hope this serves as a starting point for improved MMIO/PIO dispatch in
> KVM.  There are no immediate plans to implement this but I think it will
> become necessary within the next year or two.
>
> 1. Does it meet your requirements?
> 2. Are there better alternatives?

For doorbell-style writes when you can make it non-blocking, it seems
like a nice win to me. I don't know if there are cases where reads would
dominate. If so, we may need to think a bit more about the read path.
In all cases, I personally find the idea interesting and potentially
useful.

--
Cheers,
Christophe de Dinechin (IRC c3d)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Proposal for MMIO/PIO dispatch file descriptors (ioregionfd)
  2020-02-22 20:19 Proposal for MMIO/PIO dispatch file descriptors (ioregionfd) Stefan Hajnoczi
  2020-02-24 16:14 ` Christophe de Dinechin
@ 2020-02-24 17:10 ` Michael S. Tsirkin
  2020-02-25  9:24   ` Stefan Hajnoczi
  2020-02-25 12:19 ` Felipe Franciosi
  2 siblings, 1 reply; 11+ messages in thread
From: Michael S. Tsirkin @ 2020-02-24 17:10 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kvm, jasowang, cohuck, slp, felipe, john.g.johnson,
	robert.bradford, Dan Horobeanu, Stephen Barber, Peter Shier

On Sat, Feb 22, 2020 at 08:19:16PM +0000, Stefan Hajnoczi wrote:
> The KVM_IOREGIONFD_POSTED_WRITES flag
> skips waiting for an acknowledgement on write accesses.  This is
> suitable for accesses that do not require synchronous emulation, such as
> doorbell register writes.

I would avoid hacks like this until we understand this better.
Specificlly one needs to be very careful since memory ordering semantics
can differ between a write into an uncacheable range and host writes into
a data structure. Reads from one region are also assumed to be ordered with
writes to another region, and drivers are known to make assumptions
like this.

Memory ordering being what it is, this isn't a field I'd be comfortable
device writes know what they are doing.

-- 
MST


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Proposal for MMIO/PIO dispatch file descriptors (ioregionfd)
  2020-02-24 16:14 ` Christophe de Dinechin
@ 2020-02-24 17:43   ` Stefan Hajnoczi
  0 siblings, 0 replies; 11+ messages in thread
From: Stefan Hajnoczi @ 2020-02-24 17:43 UTC (permalink / raw)
  To: Christophe de Dinechin
  Cc: kvm, jasowang, mst, cohuck, slp, felipe, john.g.johnson,
	robert.bradford, Dan Horobeanu, Stephen Barber, Peter Shier

[-- Attachment #1: Type: text/plain, Size: 11333 bytes --]

On Mon, Feb 24, 2020 at 05:14:25PM +0100, Christophe de Dinechin wrote:
> 
> Stefan Hajnoczi writes:
> 
> > Hi,
> > I wanted to share this idea with the KVM community and VMM developers.
> > If this isn't relevant to you but you know someone who should
> > participate, please feel free to add them :).
> >
> > The following is an outline of "ioregionfd", a cross between ioeventfd
> > and KVM memory regions.  This mechanism would be helpful for VMMs that
> > emulate devices in separate processes, muser/VFIO, and to address
> > existing use cases that ioeventfd cannot handle.
> 
> Looks interesting.
> 
> > This ioctl adds, modifies, or removes MMIO or PIO regions where guest
> > accesses are dispatched through a given file descriptor instead of
> > returning from ioctl(KVM_RUN) with KVM_EXIT_MMIO or KVM_EXIT_PIO.
> 
> What file descriptors can be used for that? Is there an equivalent to
> eventfd(2)? You answer at end of mail seems to be that you could use
> socketpair(AF_UNIX) or a char device. But it seems "weird" to me that
> some arbitrary fd could have its behavior overriden by another process
> doing this KVM ioctl.  Are there precedents for that kind of "fd takeover"
> behavior?

Yes, one example is userspace providing a TCP/IP socket to the NBD
kernel driver.

Think of it as asking the kernel to do read(2)/write(2) on an fd on
behalf of the process.

> >
> > struct kvm_ioregionfd {
> >     __u64 guest_physical_addr;
> >     __u64 memory_size; /* bytes */
> >     __s32 fd;
> >     __u32 region_id;
> >     __u32 flags;
> >     __u8  pad[36];
> > };
> >
> > /* for kvm_ioregionfd::flags */
> > #define KVM_IOREGIONFD_PIO           (1u << 0)
> > #define KVM_IOREGIONFD_POSTED_WRITES (1u << 1)
> >
> > Regions are deleted by passing zero for memory_size.
> 
> For delete and modify, this means you have to match on something.
> Is that GPA only?
> 
> What should happen if you define or zero-size something that is in the
> middle of a previously created region? I assume it's an error?

The answer to these should mirror KVM_SET_USER_MEMORY_REGION unless
there is a strong reason to behave differently.

> What about the fd being closed before/after you delete a region?

The kernel will fget() the struct file while in use to prevent it from
being deleted.

Ownership of the fd belongs to userspace, so userspace must close the fd
after deleting it from KVM.  This is the same as with KVM_IOEVENTFD.

> How small can the region be? Can it be just the size of a doorbell
> register if all other registers for the device could be efficiently
> implemented using memory writes?

The minimum size is 1 byte.

The recommended way of using this API is one region per QEMU
MemoryRegion or VFIO struct vfio_region_info.  Providing finer-grained
regions to KVM is only useful if they differ in the flags.

> >
> > MMIO is the default.  The KVM_IOREGIONFD_PIO flag selects PIO instead.
> 
> Just curious: what use case do you see for PIO? Isn't that detrimental
> to your goal for this to be high-performance and cross-platform?

PCI devices have I/O Space BARs so there must be a way to support them.

> > The region_id is an opaque token that is included as part of the write
> > to the file descriptor.  It is typically a unique identifier for this
> > region but KVM does not interpret its value.
> >
> > Both read and write guest accesses wait until an acknowledgement is
> > received on the file descriptor.
> 
> By "acknowledgement", do you mean data has been read or written on the
> other side, or something else?

The response (struct ioregionfd_resp) has been received.

> > The KVM_IOREGIONFD_POSTED_WRITES flag
> > skips waiting for an acknowledgement on write accesses.  This is
> > suitable for accesses that do not require synchronous emulation, such as
> > doorbell register writes.
> >
> > Wire protocol
> > -------------
> > The protocol spoken over the file descriptor is as follows.  The device
> > reads commands from the file descriptor with the following layout:
> >
> > struct ioregionfd_cmd {
> >     __u32 info;
> >     __u32 region_id;
> >     __u64 addr;
> >     __u64 data;
> >     __u8 pad[8];
> > };
> >
> > /* for ioregionfd_cmd::info */
> > #define IOREGIONFD_CMD_MASK 0xf
> > # define IOREGIONFD_CMD_READ 0
> > # define IOREGIONFD_CMD_WRITE 1
> 
> Maybe "GUEST_READ" and "GUEST_WRITE"?

There are use cases beyond virtualization, like testing or maybe
a "vfio-user-loopback" device.  Let's avoid the term "guest" for the
wire protocol (obviously it's fine when talking about the KVM API).

> > #define IOREGIONFD_SIZE_MASK 0x30
> > #define IOREGIONFD_SIZE_SHIFT 4
> > # define IOREGIONFD_SIZE_8BIT 0
> > # define IOREGIONFD_SIZE_16BIT 1
> > # define IOREGIONFD_SIZE_32BIT 2
> > # define IOREGIONFD_SIZE_64BIT 3
> > #define IOREGIONFD_NEED_PIO (1u << 6)
> > #define IOREGIONFD_NEED_RESPONSE (1u << 7)
> >
> > The command is interpreted by inspecting the info field:
> >
> >   switch (cmd.info & IOREGIONFD_CMD_MASK) {
> >   case IOREGIONFD_CMD_READ:
> >       /* It's a read access */
> >       break;
> >   case IOREGIONFD_CMD_WRITE:
> >       /* It's a write access */
> >       break;
> >   default:
> >       /* Protocol violation, terminate connection */
> >   }
> >
> > The access size is interpreted by inspecting the info field:
> >
> >   unsigned size = (cmd.info & IOREGIONFD_SIZE_MASK) >> IOREGIONFD_SIZE_SHIFT;
> >   /* where nbytes = pow(2, size) */
> 
> What about providing a IOREGIONFD_SIZE(cmd) macro to do that?

Good idea.

> >
> > The region_id indicates which MMIO/PIO region is being accessed.  This
> > field has no inherent structure but is typically a unique identifier.
> >
> > The byte offset being accessed within that region is addr.
> >
> > If the command is IOREGIONFD_CMD_WRITE then data contains the value
> > being written.
> 
> I assume if the guest writes a 8-bit 42, data contains a 64-bit 42
> irrespective of guest and host endianness.

Yes, the data field is native-endian.

> >
> > MMIO is the default.  The IOREGIONFD_NEED_PIO flag is set on PIO
> > accesses.
> >
> > When IOREGIONFD_NEED_RESPONSE is set on a IOREGIONFD_CMD_WRITE command,
> > no response must be sent.  This flag has no effect for
> > IOREGIONFD_CMD_READ commands.
> 
> I find this paragraph confusing. "NEED_RESPONSE" seems to imply the
> response must be sent. Typo? Or do I misunderstand who is supposed to
> send the response?

This was a typo.  It should be "NO_RESPONSE" :).

> Could you clarify the reason for having both POSTED_WRITES and NEED_RESPONSE?

The NO_RESPONSE bit will be set in struct ioregionfd_cmd when the region
has the POSTED_WRITES flag.

We could eliminate this flag from the wire protocol and assume that the
device emulation program knows that certain writes do not have a
response, but it's more flexible to include it.

Also, commands added to the wire protocol in the future might also want
to skip the response, so I think a general-purpose "NO_RESPONSE" name is
better than calling it "POSTED_WRITES" at the wire protocol level.

> >
> > The device sends responses by writing the following structure to the
> > file descriptor:
> >
> > struct ioregionfd_resp {
> >     __u64 data;
> >     __u32 info;
> >     __u8 pad[20];
> > };
> 
> I know you manually optimized for intra-padding here, but do we rule
> 128-bit data forever? :-)

Yeah, I think so :).

> 
> >
> > /* for ioregionfd_resp::info */
> > #define IOREGIONFD_RESP_FAILED (1u << 0)
> 
> What happens when FAILED is set?
> - If the guest still reads data, then how does it know read failed?
> - Otherwise, what happens?

This is a good question.  I don't have a detailed list of errors and how
they would be handled by KVM yet.

> I understand the intent is for the resp to come in the same order as
> the cmd. Is it OK for the same region to be accessed by different vCPUs?
> If so, where do you keep the information about the vCPU that did a cmd
> in order to be able to dispatch the resp back to the vCPU that initiated
> the operation? [Answer below seems that to imply you don't and just
> block the second vCPU in that case]

Yep, the second vCPU waits until the first one is done.

> >
> > The info field is zer oon success.
> 
> typo "zero on"

Thanks!

> 
> > The IOREGIONFD_RESP_FAILED flag is set on failure.
> 
> The device sets it (active voice), or are there other conditions where
> it can be set (maybe state of the fd)?

No other conditions (yet?).

> >
> > The data field contains the value read by an IOREGIONFD_CMD_READ
> > command.  This field is zero for other commands.
> >
> > Does it support polling?
> > ------------------------
> > Yes, use io_uring's IORING_OP_READ to submit an asynchronous read on the
> > file descriptor.  Poll the io_uring cq ring to detect when the read has
> > completed.
> >
> > Although this dispatch mechanism incurs more overhead than polling
> > directly on guest RAM, it overcomes the limitations of polling: it
> > supports read accesses as well as capturing written values instead of
> > overwriting them.
> >
> > Does it obsolete ioeventfd?
> > ---------------------------
> > No, although KVM_IOREGIONFD_POSTED_WRITES offers somewhat similar
> > functionality to ioeventfd, there are differences.  The datamatch
> > functionality of ioeventfd is not available and would need to be
> > implemented by the device emulation program.  Due to the counter
> > semantics of eventfds there is automatic coalescing of repeated accesses
> > with ioeventfd.  Overall ioeventfd is lighter weight but also more
> > limited.
> >
> > How does it scale?
> > ------------------
> > The protocol is synchronous - only one command/response cycle is in
> > flight at a time.  The vCPU will be blocked until the response has been
> > processed anyway.  If another vCPU accesses an MMIO or PIO region with
> > the same file descriptor during this time then it will wait to.
> >
> > In practice this is not a problem since per-queue file descriptors can
> > be set up for multi-queue devices.
> 
> Can a guest write be blocked if user-space is slow reading the fd?

Yes.  vmexits block the vCPU.

POSTED_WRITES avoid this but they can only be used when the semantics of
the registers allows it (e.g.  doorbell registers).  Also, if the fd
write(2) blocks (the socket sndbuf is full) then even a POSTED_WRITES
vCPU blocks.

> What about a guest read? Since the vCPU is blocked anyway, could you
> elaborate how the proposed switch to user-space improves relative to the
> existing one? Seems like a possible win if you have some free CPU that
> can pick up the user-space. If you need to steal a running CPU for your
> user-space, it's less clear to me that there is a win (limit case being
> a single host CPU where you'd just ping-pong between processes).

Today ioctl(KVM_RUN) exits to QEMU, which then has to forward the access
to another process/thread.  That's 2 wakeups.

With ioregionfd the access is directly handled by the device emulator
process.  That's 1 wakeup.

Plus the response needs to make the trip back.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Proposal for MMIO/PIO dispatch file descriptors (ioregionfd)
  2020-02-24 17:10 ` Michael S. Tsirkin
@ 2020-02-25  9:24   ` Stefan Hajnoczi
  2020-02-25 11:57     ` Michael S. Tsirkin
  0 siblings, 1 reply; 11+ messages in thread
From: Stefan Hajnoczi @ 2020-02-25  9:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, jasowang, cohuck, slp, felipe, john.g.johnson,
	robert.bradford, Dan Horobeanu, Stephen Barber, Peter Shier

[-- Attachment #1: Type: text/plain, Size: 1310 bytes --]

On Mon, Feb 24, 2020 at 12:10:25PM -0500, Michael S. Tsirkin wrote:
> On Sat, Feb 22, 2020 at 08:19:16PM +0000, Stefan Hajnoczi wrote:
> > The KVM_IOREGIONFD_POSTED_WRITES flag
> > skips waiting for an acknowledgement on write accesses.  This is
> > suitable for accesses that do not require synchronous emulation, such as
> > doorbell register writes.
> 
> I would avoid hacks like this until we understand this better.
> Specificlly one needs to be very careful since memory ordering semantics
> can differ between a write into an uncacheable range and host writes into
> a data structure. Reads from one region are also assumed to be ordered with
> writes to another region, and drivers are known to make assumptions
> like this.
> 
> Memory ordering being what it is, this isn't a field I'd be comfortable
> device writes know what they are doing.

Unlike PCI Posted Writes the idea is not to let the write operations sit
in a cache.  They will be sent immediately just like ioeventfd is
signalled immediately before re-entering the guest.

The purpose of this feature is to let the device emulation program
handle these writes asynchronously (without holding up the vCPU for a
response from the device emulation program) but the order of
reads/writes remains unchanged.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Proposal for MMIO/PIO dispatch file descriptors (ioregionfd)
  2020-02-25  9:24   ` Stefan Hajnoczi
@ 2020-02-25 11:57     ` Michael S. Tsirkin
  2020-02-25 16:36       ` Stefan Hajnoczi
  0 siblings, 1 reply; 11+ messages in thread
From: Michael S. Tsirkin @ 2020-02-25 11:57 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kvm, jasowang, cohuck, slp, felipe, john.g.johnson,
	robert.bradford, Dan Horobeanu, Stephen Barber, Peter Shier

On Tue, Feb 25, 2020 at 09:24:34AM +0000, Stefan Hajnoczi wrote:
> On Mon, Feb 24, 2020 at 12:10:25PM -0500, Michael S. Tsirkin wrote:
> > On Sat, Feb 22, 2020 at 08:19:16PM +0000, Stefan Hajnoczi wrote:
> > > The KVM_IOREGIONFD_POSTED_WRITES flag
> > > skips waiting for an acknowledgement on write accesses.  This is
> > > suitable for accesses that do not require synchronous emulation, such as
> > > doorbell register writes.
> > 
> > I would avoid hacks like this until we understand this better.
> > Specificlly one needs to be very careful since memory ordering semantics
> > can differ between a write into an uncacheable range and host writes into
> > a data structure. Reads from one region are also assumed to be ordered with
> > writes to another region, and drivers are known to make assumptions
> > like this.
> > 
> > Memory ordering being what it is, this isn't a field I'd be comfortable
> > device writes know what they are doing.
> 
> Unlike PCI Posted Writes the idea is not to let the write operations sit
> in a cache.  They will be sent immediately just like ioeventfd is
> signalled immediately before re-entering the guest.

But ioeventfd sits in the cache: the internal counter. The fact it's
signalled does not force a barrier on the signalling thread.  It looks
like the same happens here: value is saved with the file descriptor,
other accesses of the same device can bypass the write.

> The purpose of this feature is to let the device emulation program
> handle these writes asynchronously (without holding up the vCPU for a
> response from the device emulation program) but the order of
> reads/writes remains unchanged.
> 
> Stefan

I don't see how this can be implemented without guest changes though.
For example, how do you make sure two writes to such regions are ordered
just like they are on PCI?

-- 
MST


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Proposal for MMIO/PIO dispatch file descriptors (ioregionfd)
  2020-02-22 20:19 Proposal for MMIO/PIO dispatch file descriptors (ioregionfd) Stefan Hajnoczi
  2020-02-24 16:14 ` Christophe de Dinechin
  2020-02-24 17:10 ` Michael S. Tsirkin
@ 2020-02-25 12:19 ` Felipe Franciosi
  2020-02-25 12:30   ` Felipe Franciosi
  2020-02-25 16:45   ` Stefan Hajnoczi
  2 siblings, 2 replies; 11+ messages in thread
From: Felipe Franciosi @ 2020-02-25 12:19 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kvm, jasowang, Michael S. Tsirkin, Cornelia Huck, slp,
	John G Johnson, robert.bradford, Dan Horobeanu, Stephen Barber,
	Peter Shier, mlevitsk

Hi,

This looks amazing, Stefan. The lack of such a mechanism troubled us
during the development of MUSER and resulted in the slow-path we have
today for MMIO with register semantics (when writes cannot be
overwritten before the device emulator has a chance to process them).

I have added some comments inline, but wanted to first link your
proposal with an idea that I discussed with Maxim Levitsky back in
Lyon and evolve on it a little bit. IIRC/IIUC Maxim was keen on a VT-x
extension where a CPU could IPI another to handle events which would
normally cause a VMEXIT. That is probably more applicable to the
standard ioeventfd model, but it got me thinking about PML.

Bear with me. :)

In addition to an fd, which could be used for notifications only, the
wire protocol could append "struct ioregionfd_cmd"s (probably renamed
to drop "fd") to one or more pages (perhaps a ring buffer of sorts).

That would only work for writes; reads would still be synchronous.

The device emulator therefore doesn't have to respond to each write
command. It could process the whole lot whenever it gets around to it.
Most importantly (and linking back to the VT-x extension idea), maybe
we can avoid the VMEXIT altogether if the CPU could take care of
appending writes to that buffer. Thoughts?

> On Feb 22, 2020, at 8:19 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> Hi,
> I wanted to share this idea with the KVM community and VMM developers.
> If this isn't relevant to you but you know someone who should
> participate, please feel free to add them :).
> 
> The following is an outline of "ioregionfd", a cross between ioeventfd
> and KVM memory regions.  This mechanism would be helpful for VMMs that
> emulate devices in separate processes, muser/VFIO, and to address
> existing use cases that ioeventfd cannot handle.
> 
> Background
> ----------
> There are currently two mechanisms for dispatching MMIO/PIO accesses in
> KVM: returning KVM_EXIT_MMIO/KVM_EXIT_IO from ioctl(KVM_RUN) and
> ioeventfd.  Some VMMs also use polling to avoid dispatching
> performance-critical MMIO/PIO accesses altogether.
> 
> These mechanisms have shortcomings for VMMs that perform device
> emulation in separate processes (usually for increased security):
> 
> 1. Only one process performs ioctl(KVM_RUN) for a vCPU, so that
>   mechanism is not available to device emulation processes.
> 
> 2. ioeventfd does not store the value written.  This makes it unsuitable
>   for NVMe Submission Queue Tail Doorbell registers because the value
>   written is needed by the device emulation process, for example.
>   ioeventfd also does not support read operations.
> 
> 3. Polling does not support computed read operations and only the latest
>   value written is available to the device emulation process
>   (intermediate values are overwritten if the guest performs multiple
>   accesses).
> 
> Overview
> --------
> This proposal aims to address this gap through a wire protocol and a new
> KVM API for registering MMIO/PIO regions that use this alternative
> dispatch mechanism.
> 
> The KVM API is used by the VMM to set up dispatch.  The wire protocol is
> used to dispatch accesses from KVM to the device emulation process.
> 
> This new MMIO/PIO dispatch mechanism eliminates the need to return from
> ioctl(KVM_RUN) in the VMM and then exchange messages with a device
> emulation process.
> 
> Inefficient dispatch to device processes today:
> 
>   kvm.ko  <---ioctl(KVM_RUN)---> VMM <---messages---> device
> 
> Direct dispatch with the new mechanism:
> 
>   kvm.ko  <---ioctl(KVM_RUN)---> VMM
>     ^
>     `---new MMIO/PIO mechanism-> device
> 
> Even single-process VMMs can take advantage of the new mechanism.  For
> example, QEMU's emulated NVMe storage controller can implement IOThread
> support.
> 
> No constraint is placed on the device process architecture.  A single
> process could emulate all devices belonging to the guest, each device
> could be its own process, or something in between.
> 
> Both ioeventfd and traditional KVM_EXIT_MMIO/KVM_EXIT_IO emulation
> continue to work alongside the new mechanism, but only one of them is
> used for any given guest address.
> 
> KVM API
> -------
> The following new KVM ioctl is added:
> 
> KVM_SET_IOREGIONFD
> Capability: KVM_CAP_IOREGIONFD
> Architectures: all
> Type: vm ioctl
> Parameters: struct kvm_ioregionfd (in)
> Returns: 0 on success, !0 on error
> 
> This ioctl adds, modifies, or removes MMIO or PIO regions where guest
> accesses are dispatched through a given file descriptor instead of
> returning from ioctl(KVM_RUN) with KVM_EXIT_MMIO or KVM_EXIT_PIO.
> 
> struct kvm_ioregionfd {
>    __u64 guest_physical_addr;
>    __u64 memory_size; /* bytes */
>    __s32 fd;
>    __u32 region_id;
>    __u32 flags;
>    __u8  pad[36];
> };
> 
> /* for kvm_ioregionfd::flags */
> #define KVM_IOREGIONFD_PIO           (1u << 0)
> #define KVM_IOREGIONFD_POSTED_WRITES (1u << 1)
> 
> Regions are deleted by passing zero for memory_size.
> 
> MMIO is the default.  The KVM_IOREGIONFD_PIO flag selects PIO instead.
> 
> The region_id is an opaque token that is included as part of the write
> to the file descriptor.  It is typically a unique identifier for this
> region but KVM does not interpret its value.
> 
> Both read and write guest accesses wait until an acknowledgement is
> received on the file descriptor.  The KVM_IOREGIONFD_POSTED_WRITES flag
> skips waiting for an acknowledgement on write accesses.  This is
> suitable for accesses that do not require synchronous emulation, such as
> doorbell register writes.
> 
> Wire protocol
> -------------
> The protocol spoken over the file descriptor is as follows.  The device
> reads commands from the file descriptor with the following layout:
> 
> struct ioregionfd_cmd {
>    __u32 info;
>    __u32 region_id;
>    __u64 addr;
>    __u64 data;
>    __u8 pad[8];
> };
> 
> /* for ioregionfd_cmd::info */
> #define IOREGIONFD_CMD_MASK 0xf
> # define IOREGIONFD_CMD_READ 0
> # define IOREGIONFD_CMD_WRITE 1

Why do we need 4 bits for this? I appreciate you want to align the
next field, but there's SIZE_SHIFT for that; you could have CMD_MASK
set to 0x1 unless I'm missing something. The reserved space could be
used for something else in the future.

> #define IOREGIONFD_SIZE_MASK 0x30
> #define IOREGIONFD_SIZE_SHIFT 4
> # define IOREGIONFD_SIZE_8BIT 0
> # define IOREGIONFD_SIZE_16BIT 1
> # define IOREGIONFD_SIZE_32BIT 2
> # define IOREGIONFD_SIZE_64BIT 3

Christophe already asked about the 64-bit limit. I think that's fine,
and am assuming that if larger accesses are ever needed they can just
be split in two commands by KVM?

> #define IOREGIONFD_NEED_PIO (1u << 6)
> #define IOREGIONFD_NEED_RESPONSE (1u << 7)
> 
> The command is interpreted by inspecting the info field:
> 
>  switch (cmd.info & IOREGIONFD_CMD_MASK) {
>  case IOREGIONFD_CMD_READ:
>      /* It's a read access */
>      break;
>  case IOREGIONFD_CMD_WRITE:
>      /* It's a write access */
>      break;
>  default:
>      /* Protocol violation, terminate connection */
>  }
> 
> The access size is interpreted by inspecting the info field:
> 
>  unsigned size = (cmd.info & IOREGIONFD_SIZE_MASK) >> IOREGIONFD_SIZE_SHIFT;
>  /* where nbytes = pow(2, size) */
> 
> The region_id indicates which MMIO/PIO region is being accessed.  This
> field has no inherent structure but is typically a unique identifier.
> 
> The byte offset being accessed within that region is addr.

It's not clear to me if addr is GPA absolute or an offset. Sounds like
the latter, in which case isn't it preferable to name this "offset"?

> 
> If the command is IOREGIONFD_CMD_WRITE then data contains the value
> being written.
> 
> MMIO is the default.  The IOREGIONFD_NEED_PIO flag is set on PIO
> accesses.
> 
> When IOREGIONFD_NEED_RESPONSE is set on a IOREGIONFD_CMD_WRITE command,
> no response must be sent.  This flag has no effect for
> IOREGIONFD_CMD_READ commands.

Christophe already flagged this, too. :)

That's all I had for now.

F.

> 
> The device sends responses by writing the following structure to the
> file descriptor:
> 
> struct ioregionfd_resp {
>    __u64 data;
>    __u32 info;
>    __u8 pad[20];
> };
> 
> /* for ioregionfd_resp::info */
> #define IOREGIONFD_RESP_FAILED (1u << 0)
> 
> The info field is zero on success.  The IOREGIONFD_RESP_FAILED flag is
> set on failure.
> 
> The data field contains the value read by an IOREGIONFD_CMD_READ
> command.  This field is zero for other commands.
> 
> Does it support polling?
> ------------------------
> Yes, use io_uring's IORING_OP_READ to submit an asynchronous read on the
> file descriptor.  Poll the io_uring cq ring to detect when the read has
> completed.
> 
> Although this dispatch mechanism incurs more overhead than polling
> directly on guest RAM, it overcomes the limitations of polling: it
> supports read accesses as well as capturing written values instead of
> overwriting them.
> 
> Does it obsolete ioeventfd?
> ---------------------------
> No, although KVM_IOREGIONFD_POSTED_WRITES offers somewhat similar
> functionality to ioeventfd, there are differences.  The datamatch
> functionality of ioeventfd is not available and would need to be
> implemented by the device emulation program.  Due to the counter
> semantics of eventfds there is automatic coalescing of repeated accesses
> with ioeventfd.  Overall ioeventfd is lighter weight but also more
> limited.
> 
> How does it scale?
> ------------------
> The protocol is synchronous - only one command/response cycle is in
> flight at a time.  The vCPU will be blocked until the response has been
> processed anyway.  If another vCPU accesses an MMIO or PIO region with
> the same file descriptor during this time then it will wait to.
> 
> In practice this is not a problem since per-queue file descriptors can
> be set up for multi-queue devices.
> 
> It is up to the device emulation program whether to handle multiple
> devices over the same file descriptor or not.
> 
> What exactly is the file descriptor (e.g. eventfd, pipe, char device)?
> ----------------------------------------------------------------------
> Any file descriptor that supports bidirectional I/O would do.  This
> rules out eventfds and pipes.  socketpair(AF_UNIX) is a likely
> candidate.  Maybe a char device will be necessary for improved
> performance.
> 
> Can this be part of KVM_SET_USER_MEMORY_REGION?
> -----------------------------------------------
> Maybe.  Perhaps everything can be squeezed into struct
> kvm_userspace_memory_region but it's only worth doing if the memory
> region code needs to be reused for this in the first place.  I'm not
> sure.
> 
> What do you think?
> ------------------
> I hope this serves as a starting point for improved MMIO/PIO dispatch in
> KVM.  There are no immediate plans to implement this but I think it will
> become necessary within the next year or two.
> 
> 1. Does it meet your requirements?
> 2. Are there better alternatives?
> 
> Thanks,
> Stefan


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Proposal for MMIO/PIO dispatch file descriptors (ioregionfd)
  2020-02-25 12:19 ` Felipe Franciosi
@ 2020-02-25 12:30   ` Felipe Franciosi
  2020-02-25 16:47     ` Stefan Hajnoczi
  2020-02-25 16:45   ` Stefan Hajnoczi
  1 sibling, 1 reply; 11+ messages in thread
From: Felipe Franciosi @ 2020-02-25 12:30 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kvm, jasowang, Michael S. Tsirkin, Cornelia Huck, slp,
	John G Johnson, robert.bradford, Dan Horobeanu, Stephen Barber,
	Peter Shier, mlevitsk

Ah, I forgot to ask a few other questions.

> On Feb 25, 2020, at 12:19 PM, Felipe Franciosi <felipe@nutanix.com> wrote:
> 
> Hi,
> 
> This looks amazing, Stefan. The lack of such a mechanism troubled us
> during the development of MUSER and resulted in the slow-path we have
> today for MMIO with register semantics (when writes cannot be
> overwritten before the device emulator has a chance to process them).
> 
> I have added some comments inline, but wanted to first link your
> proposal with an idea that I discussed with Maxim Levitsky back in
> Lyon and evolve on it a little bit. IIRC/IIUC Maxim was keen on a VT-x
> extension where a CPU could IPI another to handle events which would
> normally cause a VMEXIT. That is probably more applicable to the
> standard ioeventfd model, but it got me thinking about PML.
> 
> Bear with me. :)
> 
> In addition to an fd, which could be used for notifications only, the
> wire protocol could append "struct ioregionfd_cmd"s (probably renamed
> to drop "fd") to one or more pages (perhaps a ring buffer of sorts).
> 
> That would only work for writes; reads would still be synchronous.
> 
> The device emulator therefore doesn't have to respond to each write
> command. It could process the whole lot whenever it gets around to it.
> Most importantly (and linking back to the VT-x extension idea), maybe
> we can avoid the VMEXIT altogether if the CPU could take care of
> appending writes to that buffer. Thoughts?
> 
>> On Feb 22, 2020, at 8:19 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>> 
>> Hi,
>> I wanted to share this idea with the KVM community and VMM developers.
>> If this isn't relevant to you but you know someone who should
>> participate, please feel free to add them :).
>> 
>> The following is an outline of "ioregionfd", a cross between ioeventfd
>> and KVM memory regions.  This mechanism would be helpful for VMMs that
>> emulate devices in separate processes, muser/VFIO, and to address
>> existing use cases that ioeventfd cannot handle.
>> 
>> Background
>> ----------
>> There are currently two mechanisms for dispatching MMIO/PIO accesses in
>> KVM: returning KVM_EXIT_MMIO/KVM_EXIT_IO from ioctl(KVM_RUN) and
>> ioeventfd.  Some VMMs also use polling to avoid dispatching
>> performance-critical MMIO/PIO accesses altogether.
>> 
>> These mechanisms have shortcomings for VMMs that perform device
>> emulation in separate processes (usually for increased security):
>> 
>> 1. Only one process performs ioctl(KVM_RUN) for a vCPU, so that
>>  mechanism is not available to device emulation processes.
>> 
>> 2. ioeventfd does not store the value written.  This makes it unsuitable
>>  for NVMe Submission Queue Tail Doorbell registers because the value
>>  written is needed by the device emulation process, for example.
>>  ioeventfd also does not support read operations.
>> 
>> 3. Polling does not support computed read operations and only the latest
>>  value written is available to the device emulation process
>>  (intermediate values are overwritten if the guest performs multiple
>>  accesses).
>> 
>> Overview
>> --------
>> This proposal aims to address this gap through a wire protocol and a new
>> KVM API for registering MMIO/PIO regions that use this alternative
>> dispatch mechanism.
>> 
>> The KVM API is used by the VMM to set up dispatch.  The wire protocol is
>> used to dispatch accesses from KVM to the device emulation process.
>> 
>> This new MMIO/PIO dispatch mechanism eliminates the need to return from
>> ioctl(KVM_RUN) in the VMM and then exchange messages with a device
>> emulation process.
>> 
>> Inefficient dispatch to device processes today:
>> 
>>  kvm.ko  <---ioctl(KVM_RUN)---> VMM <---messages---> device
>> 
>> Direct dispatch with the new mechanism:
>> 
>>  kvm.ko  <---ioctl(KVM_RUN)---> VMM
>>    ^
>>    `---new MMIO/PIO mechanism-> device
>> 
>> Even single-process VMMs can take advantage of the new mechanism.  For
>> example, QEMU's emulated NVMe storage controller can implement IOThread
>> support.
>> 
>> No constraint is placed on the device process architecture.  A single
>> process could emulate all devices belonging to the guest, each device
>> could be its own process, or something in between.
>> 
>> Both ioeventfd and traditional KVM_EXIT_MMIO/KVM_EXIT_IO emulation
>> continue to work alongside the new mechanism, but only one of them is
>> used for any given guest address.
>> 
>> KVM API
>> -------
>> The following new KVM ioctl is added:
>> 
>> KVM_SET_IOREGIONFD
>> Capability: KVM_CAP_IOREGIONFD
>> Architectures: all
>> Type: vm ioctl
>> Parameters: struct kvm_ioregionfd (in)
>> Returns: 0 on success, !0 on error
>> 
>> This ioctl adds, modifies, or removes MMIO or PIO regions where guest
>> accesses are dispatched through a given file descriptor instead of
>> returning from ioctl(KVM_RUN) with KVM_EXIT_MMIO or KVM_EXIT_PIO.
>> 
>> struct kvm_ioregionfd {
>>   __u64 guest_physical_addr;
>>   __u64 memory_size; /* bytes */
>>   __s32 fd;
>>   __u32 region_id;
>>   __u32 flags;
>>   __u8  pad[36];
>> };
>> 
>> /* for kvm_ioregionfd::flags */
>> #define KVM_IOREGIONFD_PIO           (1u << 0)
>> #define KVM_IOREGIONFD_POSTED_WRITES (1u << 1)
>> 
>> Regions are deleted by passing zero for memory_size.
>> 
>> MMIO is the default.  The KVM_IOREGIONFD_PIO flag selects PIO instead.
>> 
>> The region_id is an opaque token that is included as part of the write
>> to the file descriptor.  It is typically a unique identifier for this
>> region but KVM does not interpret its value.
>> 
>> Both read and write guest accesses wait until an acknowledgement is
>> received on the file descriptor.  The KVM_IOREGIONFD_POSTED_WRITES flag
>> skips waiting for an acknowledgement on write accesses.  This is
>> suitable for accesses that do not require synchronous emulation, such as
>> doorbell register writes.
>> 
>> Wire protocol
>> -------------
>> The protocol spoken over the file descriptor is as follows.  The device
>> reads commands from the file descriptor with the following layout:
>> 
>> struct ioregionfd_cmd {
>>   __u32 info;
>>   __u32 region_id;
>>   __u64 addr;
>>   __u64 data;
>>   __u8 pad[8];
>> };
>> 
>> /* for ioregionfd_cmd::info */
>> #define IOREGIONFD_CMD_MASK 0xf
>> # define IOREGIONFD_CMD_READ 0
>> # define IOREGIONFD_CMD_WRITE 1
> 
> Why do we need 4 bits for this? I appreciate you want to align the
> next field, but there's SIZE_SHIFT for that; you could have CMD_MASK
> set to 0x1 unless I'm missing something. The reserved space could be
> used for something else in the future.
> 
>> #define IOREGIONFD_SIZE_MASK 0x30
>> #define IOREGIONFD_SIZE_SHIFT 4
>> # define IOREGIONFD_SIZE_8BIT 0
>> # define IOREGIONFD_SIZE_16BIT 1
>> # define IOREGIONFD_SIZE_32BIT 2
>> # define IOREGIONFD_SIZE_64BIT 3
> 
> Christophe already asked about the 64-bit limit. I think that's fine,
> and am assuming that if larger accesses are ever needed they can just
> be split in two commands by KVM?
> 
>> #define IOREGIONFD_NEED_PIO (1u << 6)
>> #define IOREGIONFD_NEED_RESPONSE (1u << 7)
>> 
>> The command is interpreted by inspecting the info field:
>> 
>> switch (cmd.info & IOREGIONFD_CMD_MASK) {
>> case IOREGIONFD_CMD_READ:
>>     /* It's a read access */
>>     break;
>> case IOREGIONFD_CMD_WRITE:
>>     /* It's a write access */
>>     break;
>> default:
>>     /* Protocol violation, terminate connection */
>> }
>> 
>> The access size is interpreted by inspecting the info field:
>> 
>> unsigned size = (cmd.info & IOREGIONFD_SIZE_MASK) >> IOREGIONFD_SIZE_SHIFT;
>> /* where nbytes = pow(2, size) */
>> 
>> The region_id indicates which MMIO/PIO region is being accessed.  This
>> field has no inherent structure but is typically a unique identifier.
>> 
>> The byte offset being accessed within that region is addr.
> 
> It's not clear to me if addr is GPA absolute or an offset. Sounds like
> the latter, in which case isn't it preferable to name this "offset"?
> 
>> 
>> If the command is IOREGIONFD_CMD_WRITE then data contains the value
>> being written.
>> 
>> MMIO is the default.  The IOREGIONFD_NEED_PIO flag is set on PIO
>> accesses.
>> 
>> When IOREGIONFD_NEED_RESPONSE is set on a IOREGIONFD_CMD_WRITE command,
>> no response must be sent.  This flag has no effect for
>> IOREGIONFD_CMD_READ commands.
> 
> Christophe already flagged this, too. :)
> 
> That's all I had for now.
> 
> F.
> 
>> 
>> The device sends responses by writing the following structure to the
>> file descriptor:
>> 
>> struct ioregionfd_resp {
>>   __u64 data;
>>   __u32 info;
>>   __u8 pad[20];
>> };
>> 
>> /* for ioregionfd_resp::info */
>> #define IOREGIONFD_RESP_FAILED (1u << 0)
>> 
>> The info field is zero on success.  The IOREGIONFD_RESP_FAILED flag is
>> set on failure.
>> 
>> The data field contains the value read by an IOREGIONFD_CMD_READ
>> command.  This field is zero for other commands.
>> 
>> Does it support polling?
>> ------------------------
>> Yes, use io_uring's IORING_OP_READ to submit an asynchronous read on the
>> file descriptor.  Poll the io_uring cq ring to detect when the read has
>> completed.
>> 
>> Although this dispatch mechanism incurs more overhead than polling
>> directly on guest RAM, it overcomes the limitations of polling: it
>> supports read accesses as well as capturing written values instead of
>> overwriting them.
>> 
>> Does it obsolete ioeventfd?
>> ---------------------------
>> No, although KVM_IOREGIONFD_POSTED_WRITES offers somewhat similar
>> functionality to ioeventfd, there are differences.  The datamatch
>> functionality of ioeventfd is not available and would need to be
>> implemented by the device emulation program.  Due to the counter
>> semantics of eventfds there is automatic coalescing of repeated accesses
>> with ioeventfd.  Overall ioeventfd is lighter weight but also more
>> limited.
>> 
>> How does it scale?
>> ------------------
>> The protocol is synchronous - only one command/response cycle is in
>> flight at a time.  The vCPU will be blocked until the response has been
>> processed anyway.  If another vCPU accesses an MMIO or PIO region with
>> the same file descriptor during this time then it will wait to.

What happens if a vCPU issues an MMIO read access and the kernel task
is blocked reading from the fd, but the userspace counterpart does not respond?

Would the vCPU still respond to SIGIPIs if blocked?

What implications do you see or thoughts do you have for live migration?

Cheers,
Felipe

>> 
>> In practice this is not a problem since per-queue file descriptors can
>> be set up for multi-queue devices.
>> 
>> It is up to the device emulation program whether to handle multiple
>> devices over the same file descriptor or not.
>> 
>> What exactly is the file descriptor (e.g. eventfd, pipe, char device)?
>> ----------------------------------------------------------------------
>> Any file descriptor that supports bidirectional I/O would do.  This
>> rules out eventfds and pipes.  socketpair(AF_UNIX) is a likely
>> candidate.  Maybe a char device will be necessary for improved
>> performance.
>> 
>> Can this be part of KVM_SET_USER_MEMORY_REGION?
>> -----------------------------------------------
>> Maybe.  Perhaps everything can be squeezed into struct
>> kvm_userspace_memory_region but it's only worth doing if the memory
>> region code needs to be reused for this in the first place.  I'm not
>> sure.
>> 
>> What do you think?
>> ------------------
>> I hope this serves as a starting point for improved MMIO/PIO dispatch in
>> KVM.  There are no immediate plans to implement this but I think it will
>> become necessary within the next year or two.
>> 
>> 1. Does it meet your requirements?
>> 2. Are there better alternatives?
>> 
>> Thanks,
>> Stefan
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Proposal for MMIO/PIO dispatch file descriptors (ioregionfd)
  2020-02-25 11:57     ` Michael S. Tsirkin
@ 2020-02-25 16:36       ` Stefan Hajnoczi
  0 siblings, 0 replies; 11+ messages in thread
From: Stefan Hajnoczi @ 2020-02-25 16:36 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, jasowang, cohuck, slp, felipe, john.g.johnson,
	robert.bradford, Dan Horobeanu, Stephen Barber, Peter Shier

[-- Attachment #1: Type: text/plain, Size: 2433 bytes --]

On Tue, Feb 25, 2020 at 06:57:18AM -0500, Michael S. Tsirkin wrote:
> On Tue, Feb 25, 2020 at 09:24:34AM +0000, Stefan Hajnoczi wrote:
> > On Mon, Feb 24, 2020 at 12:10:25PM -0500, Michael S. Tsirkin wrote:
> > > On Sat, Feb 22, 2020 at 08:19:16PM +0000, Stefan Hajnoczi wrote:
> > > > The KVM_IOREGIONFD_POSTED_WRITES flag
> > > > skips waiting for an acknowledgement on write accesses.  This is
> > > > suitable for accesses that do not require synchronous emulation, such as
> > > > doorbell register writes.
> > > 
> > > I would avoid hacks like this until we understand this better.
> > > Specificlly one needs to be very careful since memory ordering semantics
> > > can differ between a write into an uncacheable range and host writes into
> > > a data structure. Reads from one region are also assumed to be ordered with
> > > writes to another region, and drivers are known to make assumptions
> > > like this.
> > > 
> > > Memory ordering being what it is, this isn't a field I'd be comfortable
> > > device writes know what they are doing.
> > 
> > Unlike PCI Posted Writes the idea is not to let the write operations sit
> > in a cache.  They will be sent immediately just like ioeventfd is
> > signalled immediately before re-entering the guest.
> 
> But ioeventfd sits in the cache: the internal counter. The fact it's
> signalled does not force a barrier on the signalling thread.  It looks
> like the same happens here: value is saved with the file descriptor,
> other accesses of the same device can bypass the write.

See below.

> > The purpose of this feature is to let the device emulation program
> > handle these writes asynchronously (without holding up the vCPU for a
> > response from the device emulation program) but the order of
> > reads/writes remains unchanged.
> > 
> > Stefan
> 
> I don't see how this can be implemented without guest changes though.
> For example, how do you make sure two writes to such regions are ordered
> just like they are on PCI?

Register both regions with the same fd.  This way the write accesses
will be read out by the device emulation program in serialized order.

The purpose of the region_id field is to allow device emulation programs
that register multiple regions with the same fd to identify which one is
being accessed.

QEMU would only need 1 fd.  Each VFIO userspace device (muser) would
need 1 fd.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Proposal for MMIO/PIO dispatch file descriptors (ioregionfd)
  2020-02-25 12:19 ` Felipe Franciosi
  2020-02-25 12:30   ` Felipe Franciosi
@ 2020-02-25 16:45   ` Stefan Hajnoczi
  1 sibling, 0 replies; 11+ messages in thread
From: Stefan Hajnoczi @ 2020-02-25 16:45 UTC (permalink / raw)
  To: Felipe Franciosi
  Cc: kvm, jasowang, Michael S. Tsirkin, Cornelia Huck, slp,
	John G Johnson, robert.bradford, Dan Horobeanu, Stephen Barber,
	Peter Shier, mlevitsk

[-- Attachment #1: Type: text/plain, Size: 2687 bytes --]

On Tue, Feb 25, 2020 at 12:19:55PM +0000, Felipe Franciosi wrote:
> Hi,
> 
> This looks amazing, Stefan. The lack of such a mechanism troubled us
> during the development of MUSER and resulted in the slow-path we have
> today for MMIO with register semantics (when writes cannot be
> overwritten before the device emulator has a chance to process them).
> 
> I have added some comments inline, but wanted to first link your
> proposal with an idea that I discussed with Maxim Levitsky back in
> Lyon and evolve on it a little bit. IIRC/IIUC Maxim was keen on a VT-x
> extension where a CPU could IPI another to handle events which would
> normally cause a VMEXIT. That is probably more applicable to the
> standard ioeventfd model, but it got me thinking about PML.
> 
> Bear with me. :)
> 
> In addition to an fd, which could be used for notifications only, the
> wire protocol could append "struct ioregionfd_cmd"s (probably renamed
> to drop "fd") to one or more pages (perhaps a ring buffer of sorts).
> 
> That would only work for writes; reads would still be synchronous.
> 
> The device emulator therefore doesn't have to respond to each write
> command. It could process the whole lot whenever it gets around to it.

Yes, the design supports that as follows:

1. Set the KVM_IOREGIONFD_POSTED_WRITES flag on regions that require
   asynchronous writes.
2. Use io_uring IORING_OP_READ with N * sizeof(struct ioregionfd_cmd)
   where N is the maximum number of commands to be processed in a batch.
   Also make sure the socket sndbuf is at least this large if you're
   using a socket fd.
3. Poll on the io_uring cq ring from userspace.  No syscalls are
   required on the fd.  If io_uring sq polling is also enabled then no
   syscalls may be required at all.

If the fd ceases to be writeable because the device emulation program is
not reading struct ioregionfd_cmds quickly enough then the vCPU will be
blocked until the fd becomes writeable again.  But this shouldn't be an
issue for a poll mode device emulation program.

> Most importantly (and linking back to the VT-x extension idea), maybe
> we can avoid the VMEXIT altogether if the CPU could take care of
> appending writes to that buffer. Thoughts?

A VT-x extension would be nice.  It would probably need to be much
simpler in terms of memory region data structures and complexity though
(because of the hardware implementation of this feature).  For example,
just a single buffer for *all* MMIO/PIO accesses made by a vCPU.  That
becomes difficult to use efficiently when there are multiple device
emulation processes.  It's an interesting idea though.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Proposal for MMIO/PIO dispatch file descriptors (ioregionfd)
  2020-02-25 12:30   ` Felipe Franciosi
@ 2020-02-25 16:47     ` Stefan Hajnoczi
  0 siblings, 0 replies; 11+ messages in thread
From: Stefan Hajnoczi @ 2020-02-25 16:47 UTC (permalink / raw)
  To: Felipe Franciosi
  Cc: kvm, jasowang, Michael S. Tsirkin, Cornelia Huck, slp,
	John G Johnson, robert.bradford, Dan Horobeanu, Stephen Barber,
	Peter Shier, mlevitsk

[-- Attachment #1: Type: text/plain, Size: 913 bytes --]

On Tue, Feb 25, 2020 at 12:30:16PM +0000, Felipe Franciosi wrote:
> > On Feb 25, 2020, at 12:19 PM, Felipe Franciosi <felipe@nutanix.com> wrote:
> >> /* for ioregionfd_cmd::info */
> >> #define IOREGIONFD_CMD_MASK 0xf
> >> # define IOREGIONFD_CMD_READ 0
> >> # define IOREGIONFD_CMD_WRITE 1
> > 
> > Why do we need 4 bits for this? I appreciate you want to align the
> > next field, but there's SIZE_SHIFT for that; you could have CMD_MASK
> > set to 0x1 unless I'm missing something. The reserved space could be
> > used for something else in the future.

Yes, it's reserved for future commands.

> >> The byte offset being accessed within that region is addr.
> > 
> > It's not clear to me if addr is GPA absolute or an offset. Sounds like
> > the latter, in which case isn't it preferable to name this "offset"?

Yes, it's an offset into the region.  "Offset" is clearer, thanks!

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-02-25 16:47 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-22 20:19 Proposal for MMIO/PIO dispatch file descriptors (ioregionfd) Stefan Hajnoczi
2020-02-24 16:14 ` Christophe de Dinechin
2020-02-24 17:43   ` Stefan Hajnoczi
2020-02-24 17:10 ` Michael S. Tsirkin
2020-02-25  9:24   ` Stefan Hajnoczi
2020-02-25 11:57     ` Michael S. Tsirkin
2020-02-25 16:36       ` Stefan Hajnoczi
2020-02-25 12:19 ` Felipe Franciosi
2020-02-25 12:30   ` Felipe Franciosi
2020-02-25 16:47     ` Stefan Hajnoczi
2020-02-25 16:45   ` Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.