All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion
       [not found] <CAFO2pHzmVf7g3z0RikQbYnejwcWRtHKV=npALs49eRDJdt4mJQ@mail.gmail.com>
@ 2020-11-26  3:37 ` Jason Wang
  2020-11-26 12:36   ` Stefan Hajnoczi
  0 siblings, 1 reply; 33+ messages in thread
From: Jason Wang @ 2020-11-26  3:37 UTC (permalink / raw)
  To: Elena Afanasova, kvm
  Cc: mst, john.g.johnson, dinechin, cohuck, felipe, Stefan Hajnoczi,
	Elena Ufimtseva, Jag Raman


On 2020/11/26 上午3:21, Elena Afanasova wrote:
> Hello,
>
> I'm an Outreachy intern with QEMU and I’m working on implementing the 
> ioregionfd API in KVM.
> So I’d like to resume the ioregionfd design discussion. The latest 
> version of the ioregionfd API document is provided below.
>
> Overview
> --------
> ioregionfd is a KVM dispatch mechanism for handling MMIO/PIO accesses 
> over a
> file descriptor without returning from ioctl(KVM_RUN). This allows device
> emulation to run in another task separate from the vCPU task.
>
> This is achieved through KVM ioctls for registering MMIO/PIO regions 
> and a wire
> protocol that KVM uses to communicate with a task handling an MMIO/PIO 
> access.
>
> The traditional ioctl(KVM_RUN) dispatch mechanism with device 
> emulation in a
> separate task looks like this:
>
>    kvm.ko  <---ioctl(KVM_RUN)---> VMM vCPU task <---messages---> 
> device task
>
> ioregionfd improves performance by eliminating the need for the vCPU 
> task to
> forward MMIO/PIO exits to device emulation tasks:


I wonder at which cases we care performance like this. (Note that 
vhost-user suppots set|get_config() for a while).


>
>    kvm.ko  <---ioctl(KVM_RUN)---> VMM vCPU task
>      ^
>      `---ioregionfd---> device task


It's better to draw a device task via the KVM_RUN path to show the 
possible advantage.


>
> Both multi-threaded and multi-process VMMs can take advantage of 
> ioregionfd to
> run device emulation in dedicated threads and processes, respectively.
>
> This mechanism is similar to ioeventfd except it supports all read and 
> write
> accesses, whereas ioeventfd only supports posted doorbell writes.
>
> Traditional ioctl(KVM_RUN) dispatch and ioeventfd continue to work 
> alongside
> the new mechanism, but only one mechanism handles a MMIO/PIO access.
>
> KVM_CREATE_IOREGIONFD
> ---------------------
> :Capability: KVM_CAP_IOREGIONFD
> :Architectures: all
> :Type: system ioctl
> :Parameters: none
> :Returns: an ioregionfd file descriptor, -1 on error
>
> This ioctl creates a new ioregionfd and returns the file descriptor. 
> The fd can
> be used to handle MMIO/PIO accesses instead of returning from 
> ioctl(KVM_RUN)
> with KVM_EXIT_MMIO or KVM_EXIT_PIO. One or more MMIO or PIO regions 
> must be
> registered with KVM_SET_IOREGION in order to receive MMIO/PIO accesses 
> on the
> fd. An ioregionfd can be used with multiple VMs and its lifecycle is 
> not tied
> to a specific VM.
>
> When the last file descriptor for an ioregionfd is closed, all regions
> registered with KVM_SET_IOREGION are dropped and guest accesses to those
> regions cause ioctl(KVM_RUN) to return again.


I may miss something, but I don't see any special requirement of this 
fd. The fd just a transport of a protocol between KVM and userspace 
process. So instead of mandating a new type, it might be better to allow 
any type of fd to be attached. (E.g pipe or socket).


>
> KVM_SET_IOREGION
> ----------------
> :Capability: KVM_CAP_IOREGIONFD
> :Architectures: all
> :Type: vm ioctl
> :Parameters: struct kvm_ioregion (in)
> :Returns: 0 on success, -1 on error
>
> This ioctl adds, modifies, or removes an ioregionfd MMIO or PIO 
> region. Guest
> read and write accesses are dispatched through the given ioregionfd 
> instead of
> returning from ioctl(KVM_RUN).
>
> ::
>
>   struct kvm_ioregion {
>       __u64 guest_paddr; /* guest physical address */
>       __u64 memory_size; /* bytes */
>       __u64 user_data;
>       __s32 fd; /* previously created with KVM_CREATE_IOREGIONFD */
>       __u32 flags;
>       __u8  pad[32];
>   };
>
>   /* for kvm_ioregion::flags */
>   #define KVM_IOREGION_PIO           (1u << 0)
>   #define KVM_IOREGION_POSTED_WRITES (1u << 1)
>
> If a new region would split an existing region -1 is returned and errno is
> EINVAL.
>
> Regions can be deleted by setting fd to -1. If no existing region matches
> guest_paddr and memory_size then -1 is returned and errno is ENOENT.
>
> Existing regions can be modified as long as guest_paddr and memory_size
> match an existing region.
>
> MMIO is the default. The KVM_IOREGION_PIO flag selects PIO instead.
>
> The user_data value is included in messages KVM writes to the 
> ioregionfd upon
> guest access. KVM does not interpret user_data.
>
> Both read and write guest accesses wait for a response before entering the
> guest again. The KVM_IOREGION_POSTED_WRITES flag does not wait for a 
> response
> and immediately enters the guest again. This is suitable for accesses 
> that do
> not require synchronous emulation, such as posted doorbell register 
> writes.
> Note that guest writes may block the vCPU despite 
> KVM_IOREGION_POSTED_WRITES if
> the device is too slow in reading from the ioregionfd.
>
> Wire protocol
> -------------
> The protocol spoken over the file descriptor is as follows. The device 
> reads
> commands from the file descriptor with the following layout::
>
>   struct ioregionfd_cmd {
>       __u32 info;
>       __u32 padding;
>       __u64 user_data;
>       __u64 offset;
>       __u64 data;
>   };
>
> The info field layout is as follows::
>
>   bits:  | 31 ... 8 |  6   | 5 ... 4 | 3 ... 0 |
>   field: | reserved | resp |   size  |   cmd   |
>
> The cmd field identifies the operation to perform::
>
>   #define IOREGIONFD_CMD_READ  0
>   #define IOREGIONFD_CMD_WRITE 1
>
> The size field indicates the size of the access::
>
>   #define IOREGIONFD_SIZE_8BIT  0
>   #define IOREGIONFD_SIZE_16BIT 1
>   #define IOREGIONFD_SIZE_32BIT 2
>   #define IOREGIONFD_SIZE_64BIT 3
>
> If the command is IOREGIONFD_CMD_WRITE then the resp bit indicates 
> whether or
> not a response must be sent.
>
> The user_data field contains the opaque value provided to 
> KVM_SET_IOREGION.
> Applications can use this to uniquely identify the region that is being
> accessed.
>
> The offset field contains the byte offset being accessed within a region
> that was registered with KVM_SET_IOREGION.
>
> If the command is IOREGIONFD_CMD_WRITE then data contains the value
> being written. The data value is a 64-bit integer in host endianness,
> regardless of the access size.
>
> The device sends responses by writing the following structure to the
> file descriptor::
>
>   struct ioregionfd_resp {
>       __u64 data;
>       __u8 pad[24];
>   };
>
> The data field contains the value read by an IOREGIONFD_CMD_READ
> command. This field is zero for other commands. The data value is a 64-bit
> integer in host endianness, regardless of the access size.
>
> Ordering
> --------
> Guest accesses are delivered in order, including posted writes.
>
> Signals
> -------
> The vCPU task can be interrupted by a signal while waiting for an 
> ioregionfd
> response. In this case ioctl(KVM_RUN) returns with -EINTR. Guest entry is
> deferred until ioctl(KVM_RUN) is called again and the response has 
> been written
> to the ioregionfd.
>
> Security
> --------
> Device emulation processes may be untrusted in multi-process VMM 
> architectures.
> Therefore the control plane and the data plane of ioregionfd are 
> separate. A
> task that only has access to an ioregionfd is unable to add/modify/remove
> regions since that requires ioctls on a KVM vm fd. This ensures that 
> device
> emulation processes can only service MMIO/PIO accesses for regions 
> that the VMM
> registered on their behalf.
>
> Multi-queue scalability
> -----------------------
> The protocol is synchronous - only one command/response cycle is in 
> flight at a
> time - but the vCPU will be blocked until the response has been processed
> anyway. If another vCPU accesses an MMIO or PIO region belonging to 
> the same
> ioregionfd during this time then it waits for the first access to 
> complete.
>
> Per-queue ioregionfds can be set up to take advantage of concurrency on
> multi-queue devices.
>
> Polling
> -------
> Userspace can poll ioregionfd by submitting an io_uring IORING_OP_READ 
> request
> and polling the cq ring to detect when the read has completed. 
> Although this
> dispatch mechanism incurs more overhead than polling directly on guest 
> RAM, it
> captures each write access and supports reads.
>
> Does it obsolete ioeventfd?
> ---------------------------
> No, although KVM_IOREGION_POSTED_WRITES offers somewhat similar 
> functionality
> to ioeventfd, there are differences. The datamatch functionality of 
> ioeventfd
> is not available and would need to be implemented by the device emulation
> program.


This means another dispatching layer in the device emulation.

Thanks


> Due to the counter semantics of eventfds there is automatic coalescing
> of repeated accesses with ioeventfd. Overall ioeventfd is lighter 
> weight but
> also more limited.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion
  2020-11-26  3:37 ` MMIO/PIO dispatch file descriptors (ioregionfd) design discussion Jason Wang
@ 2020-11-26 12:36   ` Stefan Hajnoczi
  2020-11-27  3:39     ` Jason Wang
  0 siblings, 1 reply; 33+ messages in thread
From: Stefan Hajnoczi @ 2020-11-26 12:36 UTC (permalink / raw)
  To: Jason Wang
  Cc: Elena Afanasova, kvm, mst, john.g.johnson, dinechin, cohuck,
	felipe, Elena Ufimtseva, Jag Raman

[-- Attachment #1: Type: text/plain, Size: 3205 bytes --]

On Thu, Nov 26, 2020 at 11:37:30AM +0800, Jason Wang wrote:
> On 2020/11/26 上午3:21, Elena Afanasova wrote:
> > Hello,
> > 
> > I'm an Outreachy intern with QEMU and I’m working on implementing the
> > ioregionfd API in KVM.
> > So I’d like to resume the ioregionfd design discussion. The latest
> > version of the ioregionfd API document is provided below.
> > 
> > Overview
> > --------
> > ioregionfd is a KVM dispatch mechanism for handling MMIO/PIO accesses
> > over a
> > file descriptor without returning from ioctl(KVM_RUN). This allows device
> > emulation to run in another task separate from the vCPU task.
> > 
> > This is achieved through KVM ioctls for registering MMIO/PIO regions and
> > a wire
> > protocol that KVM uses to communicate with a task handling an MMIO/PIO
> > access.
> > 
> > The traditional ioctl(KVM_RUN) dispatch mechanism with device emulation
> > in a
> > separate task looks like this:
> > 
> >    kvm.ko  <---ioctl(KVM_RUN)---> VMM vCPU task <---messages---> device
> > task
> > 
> > ioregionfd improves performance by eliminating the need for the vCPU
> > task to
> > forward MMIO/PIO exits to device emulation tasks:
> 
> 
> I wonder at which cases we care performance like this. (Note that vhost-user
> suppots set|get_config() for a while).

NVMe emulation needs this because ioeventfd cannot transfer the value
written to the doorbell. That's why QEMU's NVMe emulation doesn't
support IOThreads.

> > KVM_CREATE_IOREGIONFD
> > ---------------------
> > :Capability: KVM_CAP_IOREGIONFD
> > :Architectures: all
> > :Type: system ioctl
> > :Parameters: none
> > :Returns: an ioregionfd file descriptor, -1 on error
> > 
> > This ioctl creates a new ioregionfd and returns the file descriptor. The
> > fd can
> > be used to handle MMIO/PIO accesses instead of returning from
> > ioctl(KVM_RUN)
> > with KVM_EXIT_MMIO or KVM_EXIT_PIO. One or more MMIO or PIO regions must
> > be
> > registered with KVM_SET_IOREGION in order to receive MMIO/PIO accesses
> > on the
> > fd. An ioregionfd can be used with multiple VMs and its lifecycle is not
> > tied
> > to a specific VM.
> > 
> > When the last file descriptor for an ioregionfd is closed, all regions
> > registered with KVM_SET_IOREGION are dropped and guest accesses to those
> > regions cause ioctl(KVM_RUN) to return again.
> 
> 
> I may miss something, but I don't see any special requirement of this fd.
> The fd just a transport of a protocol between KVM and userspace process. So
> instead of mandating a new type, it might be better to allow any type of fd
> to be attached. (E.g pipe or socket).

pipe(2) is unidirectional on Linux, so it won't work.

mkfifo(3) seems usable but creates a node on a filesystem.

socketpair(2) would work, but brings in the network stack when it's not
needed. The advantage is that some future user case might want to direct
ioregionfd over a real socket to a remote host, which would be cool.

Do you have an idea of the performance difference of socketpair(2)
compared to a custom fd?

If it's neglible then using an arbitrary socket is more flexible and
sounds good.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion
  2020-11-26 12:36   ` Stefan Hajnoczi
@ 2020-11-27  3:39     ` Jason Wang
  2020-11-27 13:44       ` Stefan Hajnoczi
  0 siblings, 1 reply; 33+ messages in thread
From: Jason Wang @ 2020-11-27  3:39 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Afanasova, kvm, mst, john.g.johnson, dinechin, cohuck,
	felipe, Elena Ufimtseva, Jag Raman


On 2020/11/26 下午8:36, Stefan Hajnoczi wrote:
> On Thu, Nov 26, 2020 at 11:37:30AM +0800, Jason Wang wrote:
>> On 2020/11/26 上午3:21, Elena Afanasova wrote:
>>> Hello,
>>>
>>> I'm an Outreachy intern with QEMU and I’m working on implementing the
>>> ioregionfd API in KVM.
>>> So I’d like to resume the ioregionfd design discussion. The latest
>>> version of the ioregionfd API document is provided below.
>>>
>>> Overview
>>> --------
>>> ioregionfd is a KVM dispatch mechanism for handling MMIO/PIO accesses
>>> over a
>>> file descriptor without returning from ioctl(KVM_RUN). This allows device
>>> emulation to run in another task separate from the vCPU task.
>>>
>>> This is achieved through KVM ioctls for registering MMIO/PIO regions and
>>> a wire
>>> protocol that KVM uses to communicate with a task handling an MMIO/PIO
>>> access.
>>>
>>> The traditional ioctl(KVM_RUN) dispatch mechanism with device emulation
>>> in a
>>> separate task looks like this:
>>>
>>>     kvm.ko  <---ioctl(KVM_RUN)---> VMM vCPU task <---messages---> device
>>> task
>>>
>>> ioregionfd improves performance by eliminating the need for the vCPU
>>> task to
>>> forward MMIO/PIO exits to device emulation tasks:
>>
>> I wonder at which cases we care performance like this. (Note that vhost-user
>> suppots set|get_config() for a while).
> NVMe emulation needs this because ioeventfd cannot transfer the value
> written to the doorbell. That's why QEMU's NVMe emulation doesn't
> support IOThreads.


I think it depends on how many different value that can be carried via 
doorbell. If it's not tons of, we can use datamatch. Anyway virtio 
support differing queue index via the value wrote to doorbell.


>
>>> KVM_CREATE_IOREGIONFD
>>> ---------------------
>>> :Capability: KVM_CAP_IOREGIONFD
>>> :Architectures: all
>>> :Type: system ioctl
>>> :Parameters: none
>>> :Returns: an ioregionfd file descriptor, -1 on error
>>>
>>> This ioctl creates a new ioregionfd and returns the file descriptor. The
>>> fd can
>>> be used to handle MMIO/PIO accesses instead of returning from
>>> ioctl(KVM_RUN)
>>> with KVM_EXIT_MMIO or KVM_EXIT_PIO. One or more MMIO or PIO regions must
>>> be
>>> registered with KVM_SET_IOREGION in order to receive MMIO/PIO accesses
>>> on the
>>> fd. An ioregionfd can be used with multiple VMs and its lifecycle is not
>>> tied
>>> to a specific VM.
>>>
>>> When the last file descriptor for an ioregionfd is closed, all regions
>>> registered with KVM_SET_IOREGION are dropped and guest accesses to those
>>> regions cause ioctl(KVM_RUN) to return again.
>>
>> I may miss something, but I don't see any special requirement of this fd.
>> The fd just a transport of a protocol between KVM and userspace process. So
>> instead of mandating a new type, it might be better to allow any type of fd
>> to be attached. (E.g pipe or socket).
> pipe(2) is unidirectional on Linux, so it won't work.


Can we accept two file descriptors to make it work?


>
> mkfifo(3) seems usable but creates a node on a filesystem.
>
> socketpair(2) would work, but brings in the network stack when it's not
> needed. The advantage is that some future user case might want to direct
> ioregionfd over a real socket to a remote host, which would be cool.
>
> Do you have an idea of the performance difference of socketpair(2)
> compared to a custom fd?


It should be slower than custom fd and UNIX socket should be faster than 
TIPC. Maybe we can have a custom fd, but it's better to leave the policy 
to the userspace:

1) KVM should not have any limitation of the fd it uses, user will risk 
itself if the fd has been used wrongly, and the custom fd should be one 
of the choice
2) it's better to not have a virt specific name (e.g "KVM" or "ioregion")

Or I wonder whether we can attach an eBPF program when trapping MMIO/PIO 
and allow it to decide how to proceed?

Thanks


>
> If it's neglible then using an arbitrary socket is more flexible and
> sounds good.
>
> Stefan


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion
  2020-11-27  3:39     ` Jason Wang
@ 2020-11-27 13:44       ` Stefan Hajnoczi
  2020-11-30  2:14         ` Jason Wang
  0 siblings, 1 reply; 33+ messages in thread
From: Stefan Hajnoczi @ 2020-11-27 13:44 UTC (permalink / raw)
  To: Jason Wang
  Cc: Elena Afanasova, kvm, mst, john.g.johnson, dinechin, cohuck,
	felipe, Elena Ufimtseva, Jag Raman

[-- Attachment #1: Type: text/plain, Size: 5615 bytes --]

On Fri, Nov 27, 2020 at 11:39:23AM +0800, Jason Wang wrote:
> 
> On 2020/11/26 下午8:36, Stefan Hajnoczi wrote:
> > On Thu, Nov 26, 2020 at 11:37:30AM +0800, Jason Wang wrote:
> > > On 2020/11/26 上午3:21, Elena Afanasova wrote:
> > > > Hello,
> > > > 
> > > > I'm an Outreachy intern with QEMU and I’m working on implementing the
> > > > ioregionfd API in KVM.
> > > > So I’d like to resume the ioregionfd design discussion. The latest
> > > > version of the ioregionfd API document is provided below.
> > > > 
> > > > Overview
> > > > --------
> > > > ioregionfd is a KVM dispatch mechanism for handling MMIO/PIO accesses
> > > > over a
> > > > file descriptor without returning from ioctl(KVM_RUN). This allows device
> > > > emulation to run in another task separate from the vCPU task.
> > > > 
> > > > This is achieved through KVM ioctls for registering MMIO/PIO regions and
> > > > a wire
> > > > protocol that KVM uses to communicate with a task handling an MMIO/PIO
> > > > access.
> > > > 
> > > > The traditional ioctl(KVM_RUN) dispatch mechanism with device emulation
> > > > in a
> > > > separate task looks like this:
> > > > 
> > > >     kvm.ko  <---ioctl(KVM_RUN)---> VMM vCPU task <---messages---> device
> > > > task
> > > > 
> > > > ioregionfd improves performance by eliminating the need for the vCPU
> > > > task to
> > > > forward MMIO/PIO exits to device emulation tasks:
> > > 
> > > I wonder at which cases we care performance like this. (Note that vhost-user
> > > suppots set|get_config() for a while).
> > NVMe emulation needs this because ioeventfd cannot transfer the value
> > written to the doorbell. That's why QEMU's NVMe emulation doesn't
> > support IOThreads.
> 
> 
> I think it depends on how many different value that can be carried via
> doorbell. If it's not tons of, we can use datamatch. Anyway virtio support
> differing queue index via the value wrote to doorbell.

There are too many value, it's not the queue index. It's the ring index
of the latest request. If the ring size is 128, we need 128 ioeventfd
registrations, etc. It becomes a lot.

By the way, the long-term use case for ioregionfd is to allow vfio-user
device emulation processes to directly handle I/O accesses. Elena
benchmarked ioeventfd vs dispatching through QEMU and can share the
perform results. I think the number was around 30+% improvement via
direct ioeventfd dispatch, so it will be important for high IOPS
devices (network and storage controllers).

> > 
> > > > KVM_CREATE_IOREGIONFD
> > > > ---------------------
> > > > :Capability: KVM_CAP_IOREGIONFD
> > > > :Architectures: all
> > > > :Type: system ioctl
> > > > :Parameters: none
> > > > :Returns: an ioregionfd file descriptor, -1 on error
> > > > 
> > > > This ioctl creates a new ioregionfd and returns the file descriptor. The
> > > > fd can
> > > > be used to handle MMIO/PIO accesses instead of returning from
> > > > ioctl(KVM_RUN)
> > > > with KVM_EXIT_MMIO or KVM_EXIT_PIO. One or more MMIO or PIO regions must
> > > > be
> > > > registered with KVM_SET_IOREGION in order to receive MMIO/PIO accesses
> > > > on the
> > > > fd. An ioregionfd can be used with multiple VMs and its lifecycle is not
> > > > tied
> > > > to a specific VM.
> > > > 
> > > > When the last file descriptor for an ioregionfd is closed, all regions
> > > > registered with KVM_SET_IOREGION are dropped and guest accesses to those
> > > > regions cause ioctl(KVM_RUN) to return again.
> > > 
> > > I may miss something, but I don't see any special requirement of this fd.
> > > The fd just a transport of a protocol between KVM and userspace process. So
> > > instead of mandating a new type, it might be better to allow any type of fd
> > > to be attached. (E.g pipe or socket).
> > pipe(2) is unidirectional on Linux, so it won't work.
> 
> 
> Can we accept two file descriptors to make it work?
> 
> 
> > 
> > mkfifo(3) seems usable but creates a node on a filesystem.
> > 
> > socketpair(2) would work, but brings in the network stack when it's not
> > needed. The advantage is that some future user case might want to direct
> > ioregionfd over a real socket to a remote host, which would be cool.
> > 
> > Do you have an idea of the performance difference of socketpair(2)
> > compared to a custom fd?
> 
> 
> It should be slower than custom fd and UNIX socket should be faster than
> TIPC. Maybe we can have a custom fd, but it's better to leave the policy to
> the userspace:
> 
> 1) KVM should not have any limitation of the fd it uses, user will risk
> itself if the fd has been used wrongly, and the custom fd should be one of
> the choice
> 2) it's better to not have a virt specific name (e.g "KVM" or "ioregion")

Okay, it looks like there are things to investigate here.

Elena: My suggestion would be to start with the simplest option -
letting userspace pass in 1 file descriptor. You can investigate the
performance of socketpair(2)/fifo(7), 2 pipe fds, or a custom file
implementation later if time permits. That way the API has maximum
flexibility (userspace can decide on the file type).

> Or I wonder whether we can attach an eBPF program when trapping MMIO/PIO and
> allow it to decide how to proceed?

The eBPF program approach is interesting, but it would probably require
access to guest RAM and additional userspace state (e.g. device-specific
register values). I don't know the current status of Linux eBPF - is it
possible to access user memory (it could be swapped out)?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion
  2020-11-27 13:44       ` Stefan Hajnoczi
@ 2020-11-30  2:14         ` Jason Wang
  2020-11-30 12:47           ` Stefan Hajnoczi
  0 siblings, 1 reply; 33+ messages in thread
From: Jason Wang @ 2020-11-30  2:14 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Afanasova, kvm, mst, john.g.johnson, dinechin, cohuck,
	felipe, Elena Ufimtseva, Jag Raman


On 2020/11/27 下午9:44, Stefan Hajnoczi wrote:
> On Fri, Nov 27, 2020 at 11:39:23AM +0800, Jason Wang wrote:
>> On 2020/11/26 下午8:36, Stefan Hajnoczi wrote:
>>> On Thu, Nov 26, 2020 at 11:37:30AM +0800, Jason Wang wrote:
>>>> On 2020/11/26 上午3:21, Elena Afanasova wrote:
>>>>> Hello,
>>>>>
>>>>> I'm an Outreachy intern with QEMU and I’m working on implementing the
>>>>> ioregionfd API in KVM.
>>>>> So I’d like to resume the ioregionfd design discussion. The latest
>>>>> version of the ioregionfd API document is provided below.
>>>>>
>>>>> Overview
>>>>> --------
>>>>> ioregionfd is a KVM dispatch mechanism for handling MMIO/PIO accesses
>>>>> over a
>>>>> file descriptor without returning from ioctl(KVM_RUN). This allows device
>>>>> emulation to run in another task separate from the vCPU task.
>>>>>
>>>>> This is achieved through KVM ioctls for registering MMIO/PIO regions and
>>>>> a wire
>>>>> protocol that KVM uses to communicate with a task handling an MMIO/PIO
>>>>> access.
>>>>>
>>>>> The traditional ioctl(KVM_RUN) dispatch mechanism with device emulation
>>>>> in a
>>>>> separate task looks like this:
>>>>>
>>>>>      kvm.ko  <---ioctl(KVM_RUN)---> VMM vCPU task <---messages---> device
>>>>> task
>>>>>
>>>>> ioregionfd improves performance by eliminating the need for the vCPU
>>>>> task to
>>>>> forward MMIO/PIO exits to device emulation tasks:
>>>> I wonder at which cases we care performance like this. (Note that vhost-user
>>>> suppots set|get_config() for a while).
>>> NVMe emulation needs this because ioeventfd cannot transfer the value
>>> written to the doorbell. That's why QEMU's NVMe emulation doesn't
>>> support IOThreads.
>>
>> I think it depends on how many different value that can be carried via
>> doorbell. If it's not tons of, we can use datamatch. Anyway virtio support
>> differing queue index via the value wrote to doorbell.
> There are too many value, it's not the queue index. It's the ring index
> of the latest request. If the ring size is 128, we need 128 ioeventfd
> registrations, etc. It becomes a lot.


I see, it's somehow similar to the NOTIFICATION_DATA feature in virtio.


>
> By the way, the long-term use case for ioregionfd is to allow vfio-user
> device emulation processes to directly handle I/O accesses. Elena
> benchmarked ioeventfd vs dispatching through QEMU and can share the
> perform results. I think the number was around 30+% improvement via
> direct ioeventfd dispatch, so it will be important for high IOPS
> devices (network and storage controllers).


That's amazing :)


>
>>>>> KVM_CREATE_IOREGIONFD
>>>>> ---------------------
>>>>> :Capability: KVM_CAP_IOREGIONFD
>>>>> :Architectures: all
>>>>> :Type: system ioctl
>>>>> :Parameters: none
>>>>> :Returns: an ioregionfd file descriptor, -1 on error
>>>>>
>>>>> This ioctl creates a new ioregionfd and returns the file descriptor. The
>>>>> fd can
>>>>> be used to handle MMIO/PIO accesses instead of returning from
>>>>> ioctl(KVM_RUN)
>>>>> with KVM_EXIT_MMIO or KVM_EXIT_PIO. One or more MMIO or PIO regions must
>>>>> be
>>>>> registered with KVM_SET_IOREGION in order to receive MMIO/PIO accesses
>>>>> on the
>>>>> fd. An ioregionfd can be used with multiple VMs and its lifecycle is not
>>>>> tied
>>>>> to a specific VM.
>>>>>
>>>>> When the last file descriptor for an ioregionfd is closed, all regions
>>>>> registered with KVM_SET_IOREGION are dropped and guest accesses to those
>>>>> regions cause ioctl(KVM_RUN) to return again.
>>>> I may miss something, but I don't see any special requirement of this fd.
>>>> The fd just a transport of a protocol between KVM and userspace process. So
>>>> instead of mandating a new type, it might be better to allow any type of fd
>>>> to be attached. (E.g pipe or socket).
>>> pipe(2) is unidirectional on Linux, so it won't work.
>>
>> Can we accept two file descriptors to make it work?
>>
>>
>>> mkfifo(3) seems usable but creates a node on a filesystem.
>>>
>>> socketpair(2) would work, but brings in the network stack when it's not
>>> needed. The advantage is that some future user case might want to direct
>>> ioregionfd over a real socket to a remote host, which would be cool.
>>>
>>> Do you have an idea of the performance difference of socketpair(2)
>>> compared to a custom fd?
>>
>> It should be slower than custom fd and UNIX socket should be faster than
>> TIPC. Maybe we can have a custom fd, but it's better to leave the policy to
>> the userspace:
>>
>> 1) KVM should not have any limitation of the fd it uses, user will risk
>> itself if the fd has been used wrongly, and the custom fd should be one of
>> the choice
>> 2) it's better to not have a virt specific name (e.g "KVM" or "ioregion")
> Okay, it looks like there are things to investigate here.
>
> Elena: My suggestion would be to start with the simplest option -
> letting userspace pass in 1 file descriptor. You can investigate the
> performance of socketpair(2)/fifo(7), 2 pipe fds, or a custom file
> implementation later if time permits. That way the API has maximum
> flexibility (userspace can decide on the file type).
>
>> Or I wonder whether we can attach an eBPF program when trapping MMIO/PIO and
>> allow it to decide how to proceed?
> The eBPF program approach is interesting, but it would probably require
> access to guest RAM and additional userspace state (e.g. device-specific
> register values). I don't know the current status of Linux eBPF - is it
> possible to access user memory (it could be swapped out)?


AFAIK it doesn't, but just to make sure I understand, any reason that 
eBPF need to access userspace memory here?

Thanks


>
> Stefan


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion
  2020-11-30  2:14         ` Jason Wang
@ 2020-11-30 12:47           ` Stefan Hajnoczi
  2020-12-01  4:05             ` Jason Wang
  0 siblings, 1 reply; 33+ messages in thread
From: Stefan Hajnoczi @ 2020-11-30 12:47 UTC (permalink / raw)
  To: Jason Wang
  Cc: Elena Afanasova, kvm, mst, john.g.johnson, dinechin, cohuck,
	felipe, Elena Ufimtseva, Jag Raman

[-- Attachment #1: Type: text/plain, Size: 1644 bytes --]

On Mon, Nov 30, 2020 at 10:14:15AM +0800, Jason Wang wrote:
> On 2020/11/27 下午9:44, Stefan Hajnoczi wrote:
> > On Fri, Nov 27, 2020 at 11:39:23AM +0800, Jason Wang wrote:
> > > On 2020/11/26 下午8:36, Stefan Hajnoczi wrote:
> > > > On Thu, Nov 26, 2020 at 11:37:30AM +0800, Jason Wang wrote:
> > > > > On 2020/11/26 上午3:21, Elena Afanasova wrote:
> > > Or I wonder whether we can attach an eBPF program when trapping MMIO/PIO and
> > > allow it to decide how to proceed?
> > The eBPF program approach is interesting, but it would probably require
> > access to guest RAM and additional userspace state (e.g. device-specific
> > register values). I don't know the current status of Linux eBPF - is it
> > possible to access user memory (it could be swapped out)?
> 
> 
> AFAIK it doesn't, but just to make sure I understand, any reason that eBPF
> need to access userspace memory here?

Maybe we're thinking of different things. In the past I've thought about
using eBPF to avoid a trip to userspace for request submission and
completion, but that requires virtqueue parsing from eBPF and guest RAM
access.

Are you thinking about replacing ioctl(KVM_SET_IOREGION) and all the
necessary kvm.ko code with an ioctl(KVM_SET_IO_PROGRAM), where userspace
can load an eBPF program into kvm.ko that gets executed when an MMIO/PIO
accesses occur? Wouldn't it need to write to userspace memory to store
the ring index that was written to the doorbell register, for example?
How would the program communicate with userspace (eventfd isn't enough)
and how can it handle synchronous I/O accesses like reads?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion
  2020-11-30 12:47           ` Stefan Hajnoczi
@ 2020-12-01  4:05             ` Jason Wang
  2020-12-01 10:35               ` Stefan Hajnoczi
  0 siblings, 1 reply; 33+ messages in thread
From: Jason Wang @ 2020-12-01  4:05 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Afanasova, kvm, mst, john.g.johnson, dinechin, cohuck,
	felipe, Elena Ufimtseva, Jag Raman


On 2020/11/30 下午8:47, Stefan Hajnoczi wrote:
> On Mon, Nov 30, 2020 at 10:14:15AM +0800, Jason Wang wrote:
>> On 2020/11/27 下午9:44, Stefan Hajnoczi wrote:
>>> On Fri, Nov 27, 2020 at 11:39:23AM +0800, Jason Wang wrote:
>>>> On 2020/11/26 下午8:36, Stefan Hajnoczi wrote:
>>>>> On Thu, Nov 26, 2020 at 11:37:30AM +0800, Jason Wang wrote:
>>>>>> On 2020/11/26 上午3:21, Elena Afanasova wrote:
>>>> Or I wonder whether we can attach an eBPF program when trapping MMIO/PIO and
>>>> allow it to decide how to proceed?
>>> The eBPF program approach is interesting, but it would probably require
>>> access to guest RAM and additional userspace state (e.g. device-specific
>>> register values). I don't know the current status of Linux eBPF - is it
>>> possible to access user memory (it could be swapped out)?
>>
>> AFAIK it doesn't, but just to make sure I understand, any reason that eBPF
>> need to access userspace memory here?
> Maybe we're thinking of different things. In the past I've thought about
> using eBPF to avoid a trip to userspace for request submission and
> completion, but that requires virtqueue parsing from eBPF and guest RAM
> access.


I see. I've  considered something similar. e.g using eBPF dataplane in 
vhost, but it requires a lot of work. For guest RAM access, we probably 
can provide some eBPF helpers to do that but we need strong point to 
convince eBPF guys.


>
> Are you thinking about replacing ioctl(KVM_SET_IOREGION) and all the
> necessary kvm.ko code with an ioctl(KVM_SET_IO_PROGRAM), where userspace
> can load an eBPF program into kvm.ko that gets executed when an MMIO/PIO
> accesses occur?


Yes.


>   Wouldn't it need to write to userspace memory to store
> the ring index that was written to the doorbell register, for example?


The proram itself can choose want to do:

1) do datamatch and write/wakeup eventfd

or

2) transport the write via an arbitrary fd as what has been done in this 
proposal, but the protocol is userspace defined


> How would the program communicate with userspace (eventfd isn't enough)
> and how can it handle synchronous I/O accesses like reads?


I may miss something, but it can behave exactly as what has been 
proposed in this patch?

Thanks


>
> Stefan


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion
  2020-12-01  4:05             ` Jason Wang
@ 2020-12-01 10:35               ` Stefan Hajnoczi
  2020-12-02  2:53                 ` Jason Wang
  2020-12-02 14:17                 ` Elena Afanasova
  0 siblings, 2 replies; 33+ messages in thread
From: Stefan Hajnoczi @ 2020-12-01 10:35 UTC (permalink / raw)
  To: Jason Wang
  Cc: Elena Afanasova, kvm, mst, john.g.johnson, dinechin, cohuck,
	felipe, Elena Ufimtseva, Jag Raman

[-- Attachment #1: Type: text/plain, Size: 3425 bytes --]

On Tue, Dec 01, 2020 at 12:05:04PM +0800, Jason Wang wrote:
> 
> On 2020/11/30 下午8:47, Stefan Hajnoczi wrote:
> > On Mon, Nov 30, 2020 at 10:14:15AM +0800, Jason Wang wrote:
> > > On 2020/11/27 下午9:44, Stefan Hajnoczi wrote:
> > > > On Fri, Nov 27, 2020 at 11:39:23AM +0800, Jason Wang wrote:
> > > > > On 2020/11/26 下午8:36, Stefan Hajnoczi wrote:
> > > > > > On Thu, Nov 26, 2020 at 11:37:30AM +0800, Jason Wang wrote:
> > > > > > > On 2020/11/26 上午3:21, Elena Afanasova wrote:
> > > > > Or I wonder whether we can attach an eBPF program when trapping MMIO/PIO and
> > > > > allow it to decide how to proceed?
> > > > The eBPF program approach is interesting, but it would probably require
> > > > access to guest RAM and additional userspace state (e.g. device-specific
> > > > register values). I don't know the current status of Linux eBPF - is it
> > > > possible to access user memory (it could be swapped out)?
> > > 
> > > AFAIK it doesn't, but just to make sure I understand, any reason that eBPF
> > > need to access userspace memory here?
> > Maybe we're thinking of different things. In the past I've thought about
> > using eBPF to avoid a trip to userspace for request submission and
> > completion, but that requires virtqueue parsing from eBPF and guest RAM
> > access.
> 
> 
> I see. I've  considered something similar. e.g using eBPF dataplane in
> vhost, but it requires a lot of work. For guest RAM access, we probably can
> provide some eBPF helpers to do that but we need strong point to convince
> eBPF guys.
> 
> 
> > 
> > Are you thinking about replacing ioctl(KVM_SET_IOREGION) and all the
> > necessary kvm.ko code with an ioctl(KVM_SET_IO_PROGRAM), where userspace
> > can load an eBPF program into kvm.ko that gets executed when an MMIO/PIO
> > accesses occur?
> 
> 
> Yes.
> 
> 
> >   Wouldn't it need to write to userspace memory to store
> > the ring index that was written to the doorbell register, for example?
> 
> 
> The proram itself can choose want to do:
> 
> 1) do datamatch and write/wakeup eventfd
> 
> or
> 
> 2) transport the write via an arbitrary fd as what has been done in this
> proposal, but the protocol is userspace defined
>
> > How would the program communicate with userspace (eventfd isn't enough)
> > and how can it handle synchronous I/O accesses like reads?
> 
> 
> I may miss something, but it can behave exactly as what has been proposed in
> this patch?

I see. This seems to have two possible advantages:
1. Pushing the kvm.ko code into userspace thanks to eBPF. Less kernel
   code.
2. Allowing more flexibile I/O dispatch logic (e.g. ioeventfd-style
   datamatch) and communication protocols.

I think #1 is minor because the communication protocol is trivial,
struct kvm_io_device can be reused for dispatch, and eBPF will introduce
some complexity itself.

#2 is more interesting but I'm not sure how to use this extra
flexibility to get a big advantage. Maybe vfio-user applications could
install an eBPF program that speaks the vfio-user protocol instead of
the ioregionfd protocol, making it easier to integrate ioregionfd into
vfio-user programs?

My opinion is that eBPF complicates things and since we lack a strong
use case for that extra flexibility, I would stick to the ioregionfd
proposal.

Elena, Jason: Do you have any opinions on this?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion
  2020-12-01 10:35               ` Stefan Hajnoczi
@ 2020-12-02  2:53                 ` Jason Wang
  2020-12-02 14:17                 ` Elena Afanasova
  1 sibling, 0 replies; 33+ messages in thread
From: Jason Wang @ 2020-12-02  2:53 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Afanasova, kvm, mst, john.g.johnson, dinechin, cohuck,
	felipe, Elena Ufimtseva, Jag Raman


On 2020/12/1 下午6:35, Stefan Hajnoczi wrote:
> On Tue, Dec 01, 2020 at 12:05:04PM +0800, Jason Wang wrote:
>> On 2020/11/30 下午8:47, Stefan Hajnoczi wrote:
>>> On Mon, Nov 30, 2020 at 10:14:15AM +0800, Jason Wang wrote:
>>>> On 2020/11/27 下午9:44, Stefan Hajnoczi wrote:
>>>>> On Fri, Nov 27, 2020 at 11:39:23AM +0800, Jason Wang wrote:
>>>>>> On 2020/11/26 下午8:36, Stefan Hajnoczi wrote:
>>>>>>> On Thu, Nov 26, 2020 at 11:37:30AM +0800, Jason Wang wrote:
>>>>>>>> On 2020/11/26 上午3:21, Elena Afanasova wrote:
>>>>>> Or I wonder whether we can attach an eBPF program when trapping MMIO/PIO and
>>>>>> allow it to decide how to proceed?
>>>>> The eBPF program approach is interesting, but it would probably require
>>>>> access to guest RAM and additional userspace state (e.g. device-specific
>>>>> register values). I don't know the current status of Linux eBPF - is it
>>>>> possible to access user memory (it could be swapped out)?
>>>> AFAIK it doesn't, but just to make sure I understand, any reason that eBPF
>>>> need to access userspace memory here?
>>> Maybe we're thinking of different things. In the past I've thought about
>>> using eBPF to avoid a trip to userspace for request submission and
>>> completion, but that requires virtqueue parsing from eBPF and guest RAM
>>> access.
>>
>> I see. I've  considered something similar. e.g using eBPF dataplane in
>> vhost, but it requires a lot of work. For guest RAM access, we probably can
>> provide some eBPF helpers to do that but we need strong point to convince
>> eBPF guys.
>>
>>
>>> Are you thinking about replacing ioctl(KVM_SET_IOREGION) and all the
>>> necessary kvm.ko code with an ioctl(KVM_SET_IO_PROGRAM), where userspace
>>> can load an eBPF program into kvm.ko that gets executed when an MMIO/PIO
>>> accesses occur?
>>
>> Yes.
>>
>>
>>>    Wouldn't it need to write to userspace memory to store
>>> the ring index that was written to the doorbell register, for example?
>>
>> The proram itself can choose want to do:
>>
>> 1) do datamatch and write/wakeup eventfd
>>
>> or
>>
>> 2) transport the write via an arbitrary fd as what has been done in this
>> proposal, but the protocol is userspace defined
>>
>>> How would the program communicate with userspace (eventfd isn't enough)
>>> and how can it handle synchronous I/O accesses like reads?
>>
>> I may miss something, but it can behave exactly as what has been proposed in
>> this patch?
> I see. This seems to have two possible advantages:
> 1. Pushing the kvm.ko code into userspace thanks to eBPF. Less kernel
>     code.
> 2. Allowing more flexibile I/O dispatch logic (e.g. ioeventfd-style
>     datamatch) and communication protocols.
>
> I think #1 is minor because the communication protocol is trivial,
> struct kvm_io_device can be reused for dispatch, and eBPF will introduce
> some complexity itself.
>
> #2 is more interesting but I'm not sure how to use this extra
> flexibility to get a big advantage. Maybe vfio-user applications could
> install an eBPF program that speaks the vfio-user protocol instead of
> the ioregionfd protocol, making it easier to integrate ioregionfd into
> vfio-user programs?


Yes, that's could be one. Basically it shift the policy from kernel to 
userspace.


>
> My opinion is that eBPF complicates things and since we lack a strong
> use case for that extra flexibility, I would stick to the ioregionfd
> proposal.
>
> Elena, Jason: Do you have any opinions on this?


I agree. And we need a way to make it work without eBPF. Let's leave it 
for future investigation.

Thanks


>
> Stefan


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion
  2020-12-01 10:35               ` Stefan Hajnoczi
  2020-12-02  2:53                 ` Jason Wang
@ 2020-12-02 14:17                 ` Elena Afanasova
  1 sibling, 0 replies; 33+ messages in thread
From: Elena Afanasova @ 2020-12-02 14:17 UTC (permalink / raw)
  To: Stefan Hajnoczi, Jason Wang
  Cc: kvm, mst, john.g.johnson, dinechin, cohuck, felipe,
	Elena Ufimtseva, Jag Raman

On Tue, 2020-12-01 at 10:35 +0000, Stefan Hajnoczi wrote:
> On Tue, Dec 01, 2020 at 12:05:04PM +0800, Jason Wang wrote:
> > On 2020/11/30 下午8:47, Stefan Hajnoczi wrote:
> > > On Mon, Nov 30, 2020 at 10:14:15AM +0800, Jason Wang wrote:
> > > > On 2020/11/27 下午9:44, Stefan Hajnoczi wrote:
> > > > > On Fri, Nov 27, 2020 at 11:39:23AM +0800, Jason Wang wrote:
> > > > > > On 2020/11/26 下午8:36, Stefan Hajnoczi wrote:
> > > > > > > On Thu, Nov 26, 2020 at 11:37:30AM +0800, Jason Wang
> > > > > > > wrote:
> > > > > > > > On 2020/11/26 上午3:21, Elena Afanasova wrote:
> > > > > > Or I wonder whether we can attach an eBPF program when
> > > > > > trapping MMIO/PIO and
> > > > > > allow it to decide how to proceed?
> > > > > The eBPF program approach is interesting, but it would
> > > > > probably require
> > > > > access to guest RAM and additional userspace state (e.g.
> > > > > device-specific
> > > > > register values). I don't know the current status of Linux
> > > > > eBPF - is it
> > > > > possible to access user memory (it could be swapped out)?
> > > > 
> > > > AFAIK it doesn't, but just to make sure I understand, any
> > > > reason that eBPF
> > > > need to access userspace memory here?
> > > Maybe we're thinking of different things. In the past I've
> > > thought about
> > > using eBPF to avoid a trip to userspace for request submission
> > > and
> > > completion, but that requires virtqueue parsing from eBPF and
> > > guest RAM
> > > access.
> > 
> > I see. I've  considered something similar. e.g using eBPF dataplane
> > in
> > vhost, but it requires a lot of work. For guest RAM access, we
> > probably can
> > provide some eBPF helpers to do that but we need strong point to
> > convince
> > eBPF guys.
> > 
> > 
> > > Are you thinking about replacing ioctl(KVM_SET_IOREGION) and all
> > > the
> > > necessary kvm.ko code with an ioctl(KVM_SET_IO_PROGRAM), where
> > > userspace
> > > can load an eBPF program into kvm.ko that gets executed when an
> > > MMIO/PIO
> > > accesses occur?
> > 
> > Yes.
> > 
> > 
> > >   Wouldn't it need to write to userspace memory to store
> > > the ring index that was written to the doorbell register, for
> > > example?
> > 
> > The proram itself can choose want to do:
> > 
> > 1) do datamatch and write/wakeup eventfd
> > 
> > or
> > 
> > 2) transport the write via an arbitrary fd as what has been done in
> > this
> > proposal, but the protocol is userspace defined
> > 
> > > How would the program communicate with userspace (eventfd isn't
> > > enough)
> > > and how can it handle synchronous I/O accesses like reads?
> > 
> > I may miss something, but it can behave exactly as what has been
> > proposed in
> > this patch?
> 
> I see. This seems to have two possible advantages:
> 1. Pushing the kvm.ko code into userspace thanks to eBPF. Less kernel
>    code.
> 2. Allowing more flexibile I/O dispatch logic (e.g. ioeventfd-style
>    datamatch) and communication protocols.
> 
> I think #1 is minor because the communication protocol is trivial,
> struct kvm_io_device can be reused for dispatch, and eBPF will
> introduce
> some complexity itself.
> 
> #2 is more interesting but I'm not sure how to use this extra
> flexibility to get a big advantage. Maybe vfio-user applications
> could
> install an eBPF program that speaks the vfio-user protocol instead of
> the ioregionfd protocol, making it easier to integrate ioregionfd
> into
> vfio-user programs?
> 
> My opinion is that eBPF complicates things and since we lack a strong
> use case for that extra flexibility, I would stick to the ioregionfd
> proposal.
> 
I agree with this point.

> Elena, Jason: Do you have any opinions on this?
> 
> Stefan


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion
  2021-10-27 12:22             ` John Levon
@ 2021-10-28  8:14               ` Stefan Hajnoczi
  -1 siblings, 0 replies; 33+ messages in thread
From: Stefan Hajnoczi @ 2021-10-28  8:14 UTC (permalink / raw)
  To: John Levon
  Cc: Elena, qemu-devel, kvm, mst, john.g.johnson, dinechin, cohuck,
	jasowang, felipe, jag.raman, eafanasova

[-- Attachment #1: Type: text/plain, Size: 1300 bytes --]

On Wed, Oct 27, 2021 at 01:22:28PM +0100, John Levon wrote:
> On Wed, Oct 27, 2021 at 11:15:35AM +0100, Stefan Hajnoczi wrote:
> 
> > > > I like this approach as well.
> > > > As you have mentioned, the device emulation code with first approach
> > > > does have to how to handle the region accesses. The second approach will
> > > > make things more transparent. Let me see how can I modify what there is
> > > > there now and may ask further questions.
> > > 
> > > Sorry I'm a bit late to this discussion, I'm not clear on the above WRT
> > > vfio-user. If an ioregionfd has to cover a whole BAR0 (?), how would this
> > > interact with partly-mmap()able regions like we do with SPDK/vfio-user/NVMe?
> > 
> > The ioregionfd doesn't need to cover an entire BAR. QEMU's MemoryRegions
> > form a hierarchy, so it's possible to sub-divide the BAR into several
> > MemoryRegions.
> 
> I think you're saying that when vfio-user client in qemu calls
> VFIO_USER_DEVICE_GET_REGION_IO_FDS, it would create a sub-MR corresponding to
> each one, before asking KVM to configure them?

Yes. Actually I wasn't thinking of the vfio-user client but just about
QEMU device emulation code in general. What you suggested sounds like a
clean mapping from MemoryRegions to vfio-user.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion
@ 2021-10-28  8:14               ` Stefan Hajnoczi
  0 siblings, 0 replies; 33+ messages in thread
From: Stefan Hajnoczi @ 2021-10-28  8:14 UTC (permalink / raw)
  To: John Levon
  Cc: Elena, john.g.johnson, jag.raman, kvm, mst, jasowang, cohuck,
	qemu-devel, eafanasova, felipe, dinechin

[-- Attachment #1: Type: text/plain, Size: 1300 bytes --]

On Wed, Oct 27, 2021 at 01:22:28PM +0100, John Levon wrote:
> On Wed, Oct 27, 2021 at 11:15:35AM +0100, Stefan Hajnoczi wrote:
> 
> > > > I like this approach as well.
> > > > As you have mentioned, the device emulation code with first approach
> > > > does have to how to handle the region accesses. The second approach will
> > > > make things more transparent. Let me see how can I modify what there is
> > > > there now and may ask further questions.
> > > 
> > > Sorry I'm a bit late to this discussion, I'm not clear on the above WRT
> > > vfio-user. If an ioregionfd has to cover a whole BAR0 (?), how would this
> > > interact with partly-mmap()able regions like we do with SPDK/vfio-user/NVMe?
> > 
> > The ioregionfd doesn't need to cover an entire BAR. QEMU's MemoryRegions
> > form a hierarchy, so it's possible to sub-divide the BAR into several
> > MemoryRegions.
> 
> I think you're saying that when vfio-user client in qemu calls
> VFIO_USER_DEVICE_GET_REGION_IO_FDS, it would create a sub-MR corresponding to
> each one, before asking KVM to configure them?

Yes. Actually I wasn't thinking of the vfio-user client but just about
QEMU device emulation code in general. What you suggested sounds like a
clean mapping from MemoryRegions to vfio-user.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion
  2021-10-27 10:15           ` Stefan Hajnoczi
@ 2021-10-27 12:22             ` John Levon
  -1 siblings, 0 replies; 33+ messages in thread
From: John Levon @ 2021-10-27 12:22 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena, qemu-devel, kvm, mst, john.g.johnson, dinechin, cohuck,
	jasowang, felipe, jag.raman, eafanasova

On Wed, Oct 27, 2021 at 11:15:35AM +0100, Stefan Hajnoczi wrote:

> > > I like this approach as well.
> > > As you have mentioned, the device emulation code with first approach
> > > does have to how to handle the region accesses. The second approach will
> > > make things more transparent. Let me see how can I modify what there is
> > > there now and may ask further questions.
> > 
> > Sorry I'm a bit late to this discussion, I'm not clear on the above WRT
> > vfio-user. If an ioregionfd has to cover a whole BAR0 (?), how would this
> > interact with partly-mmap()able regions like we do with SPDK/vfio-user/NVMe?
> 
> The ioregionfd doesn't need to cover an entire BAR. QEMU's MemoryRegions
> form a hierarchy, so it's possible to sub-divide the BAR into several
> MemoryRegions.

I think you're saying that when vfio-user client in qemu calls
VFIO_USER_DEVICE_GET_REGION_IO_FDS, it would create a sub-MR corresponding to
each one, before asking KVM to configure them?

thanks
john

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion
@ 2021-10-27 12:22             ` John Levon
  0 siblings, 0 replies; 33+ messages in thread
From: John Levon @ 2021-10-27 12:22 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena, john.g.johnson, jag.raman, kvm, mst, jasowang, cohuck,
	qemu-devel, eafanasova, felipe, dinechin

On Wed, Oct 27, 2021 at 11:15:35AM +0100, Stefan Hajnoczi wrote:

> > > I like this approach as well.
> > > As you have mentioned, the device emulation code with first approach
> > > does have to how to handle the region accesses. The second approach will
> > > make things more transparent. Let me see how can I modify what there is
> > > there now and may ask further questions.
> > 
> > Sorry I'm a bit late to this discussion, I'm not clear on the above WRT
> > vfio-user. If an ioregionfd has to cover a whole BAR0 (?), how would this
> > interact with partly-mmap()able regions like we do with SPDK/vfio-user/NVMe?
> 
> The ioregionfd doesn't need to cover an entire BAR. QEMU's MemoryRegions
> form a hierarchy, so it's possible to sub-divide the BAR into several
> MemoryRegions.

I think you're saying that when vfio-user client in qemu calls
VFIO_USER_DEVICE_GET_REGION_IO_FDS, it would create a sub-MR corresponding to
each one, before asking KVM to configure them?

thanks
john


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion
  2021-10-26 19:01         ` John Levon
@ 2021-10-27 10:15           ` Stefan Hajnoczi
  -1 siblings, 0 replies; 33+ messages in thread
From: Stefan Hajnoczi @ 2021-10-27 10:15 UTC (permalink / raw)
  To: John Levon
  Cc: Elena, qemu-devel, kvm, mst, john.g.johnson, dinechin, cohuck,
	jasowang, felipe, jag.raman, eafanasova

[-- Attachment #1: Type: text/plain, Size: 2649 bytes --]

On Tue, Oct 26, 2021 at 08:01:39PM +0100, John Levon wrote:
> On Mon, Oct 25, 2021 at 08:21:22AM -0700, Elena wrote:
> 
> > > I'm curious what approach you want to propose for QEMU integration. A
> > > while back I thought about the QEMU API. It's possible to implement it
> > > along the lines of the memory_region_add_eventfd() API where each
> > > ioregionfd is explicitly added by device emulation code. An advantage of
> > > this approach is that a MemoryRegion can have multiple ioregionfds, but
> > > I'm not sure if that is a useful feature.
> > >
> > 
> > This is the approach that is currently in the works. Agree, I dont see
> > much of the application here at this point to have multiple ioregions
> > per MemoryRegion.
> > I added Memory API/eventfd approach to the vfio-user as well to try
> > things out.
> > 
> > > An alternative is to cover the entire MemoryRegion with one ioregionfd.
> > > That way the device emulation code can use ioregionfd without much fuss
> > > since there is a 1:1 mapping between MemoryRegions, which are already
> > > there in existing devices. There is no need to think deeply about which
> > > ioregionfds to create for a device.
> > >
> > > A new API called memory_region_set_aio_context(MemoryRegion *mr,
> > > AioContext *ctx) would cause ioregionfd (or a userspace fallback for
> > > non-KVM cases) to execute the MemoryRegion->read/write() accessors from
> > > the given AioContext. The details of ioregionfd are hidden behind the
> > > memory_region_set_aio_context() API, so the device emulation code
> > > doesn't need to know the capabilities of ioregionfd.
> > 
> > > 
> > > The second approach seems promising if we want more devices to use
> > > ioregionfd inside QEMU because it requires less ioregionfd-specific
> > > code.
> > > 
> > I like this approach as well.
> > As you have mentioned, the device emulation code with first approach
> > does have to how to handle the region accesses. The second approach will
> > make things more transparent. Let me see how can I modify what there is
> > there now and may ask further questions.
> 
> Sorry I'm a bit late to this discussion, I'm not clear on the above WRT
> vfio-user. If an ioregionfd has to cover a whole BAR0 (?), how would this
> interact with partly-mmap()able regions like we do with SPDK/vfio-user/NVMe?

The ioregionfd doesn't need to cover an entire BAR. QEMU's MemoryRegions
form a hierarchy, so it's possible to sub-divide the BAR into several
MemoryRegions.

This means it's still possible to have mmap() sub-regions or even
ioeventfds sprinkled in between.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion
@ 2021-10-27 10:15           ` Stefan Hajnoczi
  0 siblings, 0 replies; 33+ messages in thread
From: Stefan Hajnoczi @ 2021-10-27 10:15 UTC (permalink / raw)
  To: John Levon
  Cc: Elena, john.g.johnson, jag.raman, kvm, mst, jasowang, cohuck,
	qemu-devel, eafanasova, felipe, dinechin

[-- Attachment #1: Type: text/plain, Size: 2649 bytes --]

On Tue, Oct 26, 2021 at 08:01:39PM +0100, John Levon wrote:
> On Mon, Oct 25, 2021 at 08:21:22AM -0700, Elena wrote:
> 
> > > I'm curious what approach you want to propose for QEMU integration. A
> > > while back I thought about the QEMU API. It's possible to implement it
> > > along the lines of the memory_region_add_eventfd() API where each
> > > ioregionfd is explicitly added by device emulation code. An advantage of
> > > this approach is that a MemoryRegion can have multiple ioregionfds, but
> > > I'm not sure if that is a useful feature.
> > >
> > 
> > This is the approach that is currently in the works. Agree, I dont see
> > much of the application here at this point to have multiple ioregions
> > per MemoryRegion.
> > I added Memory API/eventfd approach to the vfio-user as well to try
> > things out.
> > 
> > > An alternative is to cover the entire MemoryRegion with one ioregionfd.
> > > That way the device emulation code can use ioregionfd without much fuss
> > > since there is a 1:1 mapping between MemoryRegions, which are already
> > > there in existing devices. There is no need to think deeply about which
> > > ioregionfds to create for a device.
> > >
> > > A new API called memory_region_set_aio_context(MemoryRegion *mr,
> > > AioContext *ctx) would cause ioregionfd (or a userspace fallback for
> > > non-KVM cases) to execute the MemoryRegion->read/write() accessors from
> > > the given AioContext. The details of ioregionfd are hidden behind the
> > > memory_region_set_aio_context() API, so the device emulation code
> > > doesn't need to know the capabilities of ioregionfd.
> > 
> > > 
> > > The second approach seems promising if we want more devices to use
> > > ioregionfd inside QEMU because it requires less ioregionfd-specific
> > > code.
> > > 
> > I like this approach as well.
> > As you have mentioned, the device emulation code with first approach
> > does have to how to handle the region accesses. The second approach will
> > make things more transparent. Let me see how can I modify what there is
> > there now and may ask further questions.
> 
> Sorry I'm a bit late to this discussion, I'm not clear on the above WRT
> vfio-user. If an ioregionfd has to cover a whole BAR0 (?), how would this
> interact with partly-mmap()able regions like we do with SPDK/vfio-user/NVMe?

The ioregionfd doesn't need to cover an entire BAR. QEMU's MemoryRegions
form a hierarchy, so it's possible to sub-divide the BAR into several
MemoryRegions.

This means it's still possible to have mmap() sub-regions or even
ioeventfds sprinkled in between.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion
  2021-10-25 15:21       ` Elena
@ 2021-10-26 19:01         ` John Levon
  -1 siblings, 0 replies; 33+ messages in thread
From: John Levon @ 2021-10-26 19:01 UTC (permalink / raw)
  To: Elena
  Cc: john.g.johnson, jag.raman, kvm, mst, jasowang, cohuck,
	qemu-devel, eafanasova, Stefan Hajnoczi, felipe, dinechin

On Mon, Oct 25, 2021 at 08:21:22AM -0700, Elena wrote:

> > I'm curious what approach you want to propose for QEMU integration. A
> > while back I thought about the QEMU API. It's possible to implement it
> > along the lines of the memory_region_add_eventfd() API where each
> > ioregionfd is explicitly added by device emulation code. An advantage of
> > this approach is that a MemoryRegion can have multiple ioregionfds, but
> > I'm not sure if that is a useful feature.
> >
> 
> This is the approach that is currently in the works. Agree, I dont see
> much of the application here at this point to have multiple ioregions
> per MemoryRegion.
> I added Memory API/eventfd approach to the vfio-user as well to try
> things out.
> 
> > An alternative is to cover the entire MemoryRegion with one ioregionfd.
> > That way the device emulation code can use ioregionfd without much fuss
> > since there is a 1:1 mapping between MemoryRegions, which are already
> > there in existing devices. There is no need to think deeply about which
> > ioregionfds to create for a device.
> >
> > A new API called memory_region_set_aio_context(MemoryRegion *mr,
> > AioContext *ctx) would cause ioregionfd (or a userspace fallback for
> > non-KVM cases) to execute the MemoryRegion->read/write() accessors from
> > the given AioContext. The details of ioregionfd are hidden behind the
> > memory_region_set_aio_context() API, so the device emulation code
> > doesn't need to know the capabilities of ioregionfd.
> 
> > 
> > The second approach seems promising if we want more devices to use
> > ioregionfd inside QEMU because it requires less ioregionfd-specific
> > code.
> > 
> I like this approach as well.
> As you have mentioned, the device emulation code with first approach
> does have to how to handle the region accesses. The second approach will
> make things more transparent. Let me see how can I modify what there is
> there now and may ask further questions.

Sorry I'm a bit late to this discussion, I'm not clear on the above WRT
vfio-user. If an ioregionfd has to cover a whole BAR0 (?), how would this
interact with partly-mmap()able regions like we do with SPDK/vfio-user/NVMe?

thanks
john


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion
@ 2021-10-26 19:01         ` John Levon
  0 siblings, 0 replies; 33+ messages in thread
From: John Levon @ 2021-10-26 19:01 UTC (permalink / raw)
  To: Elena
  Cc: Stefan Hajnoczi, qemu-devel, kvm, mst, john.g.johnson, dinechin,
	cohuck, jasowang, felipe, jag.raman, eafanasova

On Mon, Oct 25, 2021 at 08:21:22AM -0700, Elena wrote:

> > I'm curious what approach you want to propose for QEMU integration. A
> > while back I thought about the QEMU API. It's possible to implement it
> > along the lines of the memory_region_add_eventfd() API where each
> > ioregionfd is explicitly added by device emulation code. An advantage of
> > this approach is that a MemoryRegion can have multiple ioregionfds, but
> > I'm not sure if that is a useful feature.
> >
> 
> This is the approach that is currently in the works. Agree, I dont see
> much of the application here at this point to have multiple ioregions
> per MemoryRegion.
> I added Memory API/eventfd approach to the vfio-user as well to try
> things out.
> 
> > An alternative is to cover the entire MemoryRegion with one ioregionfd.
> > That way the device emulation code can use ioregionfd without much fuss
> > since there is a 1:1 mapping between MemoryRegions, which are already
> > there in existing devices. There is no need to think deeply about which
> > ioregionfds to create for a device.
> >
> > A new API called memory_region_set_aio_context(MemoryRegion *mr,
> > AioContext *ctx) would cause ioregionfd (or a userspace fallback for
> > non-KVM cases) to execute the MemoryRegion->read/write() accessors from
> > the given AioContext. The details of ioregionfd are hidden behind the
> > memory_region_set_aio_context() API, so the device emulation code
> > doesn't need to know the capabilities of ioregionfd.
> 
> > 
> > The second approach seems promising if we want more devices to use
> > ioregionfd inside QEMU because it requires less ioregionfd-specific
> > code.
> > 
> I like this approach as well.
> As you have mentioned, the device emulation code with first approach
> does have to how to handle the region accesses. The second approach will
> make things more transparent. Let me see how can I modify what there is
> there now and may ask further questions.

Sorry I'm a bit late to this discussion, I'm not clear on the above WRT
vfio-user. If an ioregionfd has to cover a whole BAR0 (?), how would this
interact with partly-mmap()able regions like we do with SPDK/vfio-user/NVMe?

thanks
john

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion
  2021-10-25 15:21       ` Elena
@ 2021-10-25 16:56         ` Stefan Hajnoczi
  -1 siblings, 0 replies; 33+ messages in thread
From: Stefan Hajnoczi @ 2021-10-25 16:56 UTC (permalink / raw)
  To: Elena
  Cc: qemu-devel, kvm, mst, john.g.johnson, dinechin, cohuck, jasowang,
	felipe, jag.raman, eafanasova

[-- Attachment #1: Type: text/plain, Size: 2714 bytes --]

On Mon, Oct 25, 2021 at 08:21:22AM -0700, Elena wrote:
> On Mon, Oct 25, 2021 at 01:42:56PM +0100, Stefan Hajnoczi wrote:
> > On Mon, Oct 11, 2021 at 10:34:29PM -0700, elena wrote:
> > > On Wed, Nov 25, 2020 at 12:44:07PM -0800, Elena Afanasova wrote:
> > > > Hello,
> > > >
> > > 
> > > Hi
> > > 
> > > Sorry for top-posting, just wanted to provide a quik update.
> > > We are currently working on the support for ioregionfd in Qemu and will
> > > be posting the patches soon. Plus the KVM patches will be posted based
> > > of the RFC v3 with some fixes if there are no objections from Elena's side
> > > who originally posted KVM RFC patchset.
> > 
> > Hi Elena,
> 
> Hello Stefan.
> 
> > I'm curious what approach you want to propose for QEMU integration. A
> > while back I thought about the QEMU API. It's possible to implement it
> > along the lines of the memory_region_add_eventfd() API where each
> > ioregionfd is explicitly added by device emulation code. An advantage of
> > this approach is that a MemoryRegion can have multiple ioregionfds, but
> > I'm not sure if that is a useful feature.
> >
> 
> This is the approach that is currently in the works. Agree, I dont see
> much of the application here at this point to have multiple ioregions
> per MemoryRegion.
> I added Memory API/eventfd approach to the vfio-user as well to try
> things out.
> 
> > An alternative is to cover the entire MemoryRegion with one ioregionfd.
> > That way the device emulation code can use ioregionfd without much fuss
> > since there is a 1:1 mapping between MemoryRegions, which are already
> > there in existing devices. There is no need to think deeply about which
> > ioregionfds to create for a device.
> >
> > A new API called memory_region_set_aio_context(MemoryRegion *mr,
> > AioContext *ctx) would cause ioregionfd (or a userspace fallback for
> > non-KVM cases) to execute the MemoryRegion->read/write() accessors from
> > the given AioContext. The details of ioregionfd are hidden behind the
> > memory_region_set_aio_context() API, so the device emulation code
> > doesn't need to know the capabilities of ioregionfd.
> 
> > 
> > The second approach seems promising if we want more devices to use
> > ioregionfd inside QEMU because it requires less ioregionfd-specific
> > code.
> > 
> I like this approach as well.
> As you have mentioned, the device emulation code with first approach
> does have to how to handle the region accesses. The second approach will
> make things more transparent. Let me see how can I modify what there is
> there now and may ask further questions.

Thanks, I look forward to patches you are working on!

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion
@ 2021-10-25 16:56         ` Stefan Hajnoczi
  0 siblings, 0 replies; 33+ messages in thread
From: Stefan Hajnoczi @ 2021-10-25 16:56 UTC (permalink / raw)
  To: Elena
  Cc: john.g.johnson, jag.raman, kvm, mst, jasowang, cohuck,
	qemu-devel, eafanasova, felipe, dinechin

[-- Attachment #1: Type: text/plain, Size: 2714 bytes --]

On Mon, Oct 25, 2021 at 08:21:22AM -0700, Elena wrote:
> On Mon, Oct 25, 2021 at 01:42:56PM +0100, Stefan Hajnoczi wrote:
> > On Mon, Oct 11, 2021 at 10:34:29PM -0700, elena wrote:
> > > On Wed, Nov 25, 2020 at 12:44:07PM -0800, Elena Afanasova wrote:
> > > > Hello,
> > > >
> > > 
> > > Hi
> > > 
> > > Sorry for top-posting, just wanted to provide a quik update.
> > > We are currently working on the support for ioregionfd in Qemu and will
> > > be posting the patches soon. Plus the KVM patches will be posted based
> > > of the RFC v3 with some fixes if there are no objections from Elena's side
> > > who originally posted KVM RFC patchset.
> > 
> > Hi Elena,
> 
> Hello Stefan.
> 
> > I'm curious what approach you want to propose for QEMU integration. A
> > while back I thought about the QEMU API. It's possible to implement it
> > along the lines of the memory_region_add_eventfd() API where each
> > ioregionfd is explicitly added by device emulation code. An advantage of
> > this approach is that a MemoryRegion can have multiple ioregionfds, but
> > I'm not sure if that is a useful feature.
> >
> 
> This is the approach that is currently in the works. Agree, I dont see
> much of the application here at this point to have multiple ioregions
> per MemoryRegion.
> I added Memory API/eventfd approach to the vfio-user as well to try
> things out.
> 
> > An alternative is to cover the entire MemoryRegion with one ioregionfd.
> > That way the device emulation code can use ioregionfd without much fuss
> > since there is a 1:1 mapping between MemoryRegions, which are already
> > there in existing devices. There is no need to think deeply about which
> > ioregionfds to create for a device.
> >
> > A new API called memory_region_set_aio_context(MemoryRegion *mr,
> > AioContext *ctx) would cause ioregionfd (or a userspace fallback for
> > non-KVM cases) to execute the MemoryRegion->read/write() accessors from
> > the given AioContext. The details of ioregionfd are hidden behind the
> > memory_region_set_aio_context() API, so the device emulation code
> > doesn't need to know the capabilities of ioregionfd.
> 
> > 
> > The second approach seems promising if we want more devices to use
> > ioregionfd inside QEMU because it requires less ioregionfd-specific
> > code.
> > 
> I like this approach as well.
> As you have mentioned, the device emulation code with first approach
> does have to how to handle the region accesses. The second approach will
> make things more transparent. Let me see how can I modify what there is
> there now and may ask further questions.

Thanks, I look forward to patches you are working on!

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion
  2021-10-25 12:42     ` Stefan Hajnoczi
@ 2021-10-25 15:21       ` Elena
  -1 siblings, 0 replies; 33+ messages in thread
From: Elena @ 2021-10-25 15:21 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, kvm, mst, john.g.johnson, dinechin, cohuck, jasowang,
	felipe, jag.raman, eafanasova

On Mon, Oct 25, 2021 at 01:42:56PM +0100, Stefan Hajnoczi wrote:
> On Mon, Oct 11, 2021 at 10:34:29PM -0700, elena wrote:
> > On Wed, Nov 25, 2020 at 12:44:07PM -0800, Elena Afanasova wrote:
> > > Hello,
> > >
> > 
> > Hi
> > 
> > Sorry for top-posting, just wanted to provide a quik update.
> > We are currently working on the support for ioregionfd in Qemu and will
> > be posting the patches soon. Plus the KVM patches will be posted based
> > of the RFC v3 with some fixes if there are no objections from Elena's side
> > who originally posted KVM RFC patchset.
> 
> Hi Elena,

Hello Stefan.

> I'm curious what approach you want to propose for QEMU integration. A
> while back I thought about the QEMU API. It's possible to implement it
> along the lines of the memory_region_add_eventfd() API where each
> ioregionfd is explicitly added by device emulation code. An advantage of
> this approach is that a MemoryRegion can have multiple ioregionfds, but
> I'm not sure if that is a useful feature.
>

This is the approach that is currently in the works. Agree, I dont see
much of the application here at this point to have multiple ioregions
per MemoryRegion.
I added Memory API/eventfd approach to the vfio-user as well to try
things out.

> An alternative is to cover the entire MemoryRegion with one ioregionfd.
> That way the device emulation code can use ioregionfd without much fuss
> since there is a 1:1 mapping between MemoryRegions, which are already
> there in existing devices. There is no need to think deeply about which
> ioregionfds to create for a device.
>
> A new API called memory_region_set_aio_context(MemoryRegion *mr,
> AioContext *ctx) would cause ioregionfd (or a userspace fallback for
> non-KVM cases) to execute the MemoryRegion->read/write() accessors from
> the given AioContext. The details of ioregionfd are hidden behind the
> memory_region_set_aio_context() API, so the device emulation code
> doesn't need to know the capabilities of ioregionfd.

> 
> The second approach seems promising if we want more devices to use
> ioregionfd inside QEMU because it requires less ioregionfd-specific
> code.
> 
I like this approach as well.
As you have mentioned, the device emulation code with first approach
does have to how to handle the region accesses. The second approach will
make things more transparent. Let me see how can I modify what there is
there now and may ask further questions.

Thank you for your input Stefan.
Elena

> Stefan



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion
@ 2021-10-25 15:21       ` Elena
  0 siblings, 0 replies; 33+ messages in thread
From: Elena @ 2021-10-25 15:21 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: john.g.johnson, jag.raman, kvm, mst, jasowang, cohuck,
	qemu-devel, eafanasova, felipe, dinechin

On Mon, Oct 25, 2021 at 01:42:56PM +0100, Stefan Hajnoczi wrote:
> On Mon, Oct 11, 2021 at 10:34:29PM -0700, elena wrote:
> > On Wed, Nov 25, 2020 at 12:44:07PM -0800, Elena Afanasova wrote:
> > > Hello,
> > >
> > 
> > Hi
> > 
> > Sorry for top-posting, just wanted to provide a quik update.
> > We are currently working on the support for ioregionfd in Qemu and will
> > be posting the patches soon. Plus the KVM patches will be posted based
> > of the RFC v3 with some fixes if there are no objections from Elena's side
> > who originally posted KVM RFC patchset.
> 
> Hi Elena,

Hello Stefan.

> I'm curious what approach you want to propose for QEMU integration. A
> while back I thought about the QEMU API. It's possible to implement it
> along the lines of the memory_region_add_eventfd() API where each
> ioregionfd is explicitly added by device emulation code. An advantage of
> this approach is that a MemoryRegion can have multiple ioregionfds, but
> I'm not sure if that is a useful feature.
>

This is the approach that is currently in the works. Agree, I dont see
much of the application here at this point to have multiple ioregions
per MemoryRegion.
I added Memory API/eventfd approach to the vfio-user as well to try
things out.

> An alternative is to cover the entire MemoryRegion with one ioregionfd.
> That way the device emulation code can use ioregionfd without much fuss
> since there is a 1:1 mapping between MemoryRegions, which are already
> there in existing devices. There is no need to think deeply about which
> ioregionfds to create for a device.
>
> A new API called memory_region_set_aio_context(MemoryRegion *mr,
> AioContext *ctx) would cause ioregionfd (or a userspace fallback for
> non-KVM cases) to execute the MemoryRegion->read/write() accessors from
> the given AioContext. The details of ioregionfd are hidden behind the
> memory_region_set_aio_context() API, so the device emulation code
> doesn't need to know the capabilities of ioregionfd.

> 
> The second approach seems promising if we want more devices to use
> ioregionfd inside QEMU because it requires less ioregionfd-specific
> code.
> 
I like this approach as well.
As you have mentioned, the device emulation code with first approach
does have to how to handle the region accesses. The second approach will
make things more transparent. Let me see how can I modify what there is
there now and may ask further questions.

Thank you for your input Stefan.
Elena

> Stefan




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion
  2021-10-12  5:34   ` elena
@ 2021-10-25 12:42     ` Stefan Hajnoczi
  -1 siblings, 0 replies; 33+ messages in thread
From: Stefan Hajnoczi @ 2021-10-25 12:42 UTC (permalink / raw)
  To: elena
  Cc: qemu-devel, kvm, mst, john.g.johnson, dinechin, cohuck, jasowang,
	felipe, jag.raman, eafanasova

[-- Attachment #1: Type: text/plain, Size: 1790 bytes --]

On Mon, Oct 11, 2021 at 10:34:29PM -0700, elena wrote:
> On Wed, Nov 25, 2020 at 12:44:07PM -0800, Elena Afanasova wrote:
> > Hello,
> >
> 
> Hi
> 
> Sorry for top-posting, just wanted to provide a quik update.
> We are currently working on the support for ioregionfd in Qemu and will
> be posting the patches soon. Plus the KVM patches will be posted based
> of the RFC v3 with some fixes if there are no objections from Elena's side
> who originally posted KVM RFC patchset.

Hi Elena,
I'm curious what approach you want to propose for QEMU integration. A
while back I thought about the QEMU API. It's possible to implement it
along the lines of the memory_region_add_eventfd() API where each
ioregionfd is explicitly added by device emulation code. An advantage of
this approach is that a MemoryRegion can have multiple ioregionfds, but
I'm not sure if that is a useful feature.

An alternative is to cover the entire MemoryRegion with one ioregionfd.
That way the device emulation code can use ioregionfd without much fuss
since there is a 1:1 mapping between MemoryRegions, which are already
there in existing devices. There is no need to think deeply about which
ioregionfds to create for a device.

A new API called memory_region_set_aio_context(MemoryRegion *mr,
AioContext *ctx) would cause ioregionfd (or a userspace fallback for
non-KVM cases) to execute the MemoryRegion->read/write() accessors from
the given AioContext. The details of ioregionfd are hidden behind the
memory_region_set_aio_context() API, so the device emulation code
doesn't need to know the capabilities of ioregionfd.

The second approach seems promising if we want more devices to use
ioregionfd inside QEMU because it requires less ioregionfd-specific
code.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion
@ 2021-10-25 12:42     ` Stefan Hajnoczi
  0 siblings, 0 replies; 33+ messages in thread
From: Stefan Hajnoczi @ 2021-10-25 12:42 UTC (permalink / raw)
  To: elena
  Cc: john.g.johnson, jag.raman, kvm, mst, jasowang, cohuck,
	qemu-devel, eafanasova, felipe, dinechin

[-- Attachment #1: Type: text/plain, Size: 1790 bytes --]

On Mon, Oct 11, 2021 at 10:34:29PM -0700, elena wrote:
> On Wed, Nov 25, 2020 at 12:44:07PM -0800, Elena Afanasova wrote:
> > Hello,
> >
> 
> Hi
> 
> Sorry for top-posting, just wanted to provide a quik update.
> We are currently working on the support for ioregionfd in Qemu and will
> be posting the patches soon. Plus the KVM patches will be posted based
> of the RFC v3 with some fixes if there are no objections from Elena's side
> who originally posted KVM RFC patchset.

Hi Elena,
I'm curious what approach you want to propose for QEMU integration. A
while back I thought about the QEMU API. It's possible to implement it
along the lines of the memory_region_add_eventfd() API where each
ioregionfd is explicitly added by device emulation code. An advantage of
this approach is that a MemoryRegion can have multiple ioregionfds, but
I'm not sure if that is a useful feature.

An alternative is to cover the entire MemoryRegion with one ioregionfd.
That way the device emulation code can use ioregionfd without much fuss
since there is a 1:1 mapping between MemoryRegions, which are already
there in existing devices. There is no need to think deeply about which
ioregionfds to create for a device.

A new API called memory_region_set_aio_context(MemoryRegion *mr,
AioContext *ctx) would cause ioregionfd (or a userspace fallback for
non-KVM cases) to execute the MemoryRegion->read/write() accessors from
the given AioContext. The details of ioregionfd are hidden behind the
memory_region_set_aio_context() API, so the device emulation code
doesn't need to know the capabilities of ioregionfd.

The second approach seems promising if we want more devices to use
ioregionfd inside QEMU because it requires less ioregionfd-specific
code.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion
  2020-11-25 20:44 Elena Afanasova
@ 2021-10-12  5:34   ` elena
  2021-10-12  5:34   ` elena
  1 sibling, 0 replies; 33+ messages in thread
From: elena @ 2021-10-12  5:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, mst, john.g.johnson, dinechin, cohuck, jasowang, felipe,
	stefanha, jag.raman, eafanasova

On Wed, Nov 25, 2020 at 12:44:07PM -0800, Elena Afanasova wrote:
> Hello,
>

Hi

Sorry for top-posting, just wanted to provide a quik update.
We are currently working on the support for ioregionfd in Qemu and will
be posting the patches soon. Plus the KVM patches will be posted based
of the RFC v3 with some fixes if there are no objections from Elena's side
who originally posted KVM RFC patchset.


Thanks!
Elena Ufimtseva

> I'm an Outreachy intern with QEMU and I’m working on implementing the ioregionfd 
> API in KVM. So I’d like to resume the ioregionfd design discussion. The latest 
> version of the ioregionfd API document is provided below.
> 
> Overview
> --------
> ioregionfd is a KVM dispatch mechanism for handling MMIO/PIO accesses over a
> file descriptor without returning from ioctl(KVM_RUN). This allows device
> emulation to run in another task separate from the vCPU task.
> 
> This is achieved through KVM ioctls for registering MMIO/PIO regions and a wire
> protocol that KVM uses to communicate with a task handling an MMIO/PIO access.
> 
> The traditional ioctl(KVM_RUN) dispatch mechanism with device emulation in a
> separate task looks like this:
> 
>    kvm.ko  <---ioctl(KVM_RUN)---> VMM vCPU task <---messages---> device task
> 
> ioregionfd improves performance by eliminating the need for the vCPU task to
> forward MMIO/PIO exits to device emulation tasks:
> 
>    kvm.ko  <---ioctl(KVM_RUN)---> VMM vCPU task
>      ^
>      `---ioregionfd---> device task
> 
> Both multi-threaded and multi-process VMMs can take advantage of ioregionfd to
> run device emulation in dedicated threads and processes, respectively.
> 
> This mechanism is similar to ioeventfd except it supports all read and write
> accesses, whereas ioeventfd only supports posted doorbell writes.
> 
> Traditional ioctl(KVM_RUN) dispatch and ioeventfd continue to work alongside
> the new mechanism, but only one mechanism handles a MMIO/PIO access.
> 
> KVM_CREATE_IOREGIONFD
> ---------------------
> :Capability: KVM_CAP_IOREGIONFD
> :Architectures: all
> :Type: system ioctl
> :Parameters: none
> :Returns: an ioregionfd file descriptor, -1 on error
> 
> This ioctl creates a new ioregionfd and returns the file descriptor. The fd can
> be used to handle MMIO/PIO accesses instead of returning from ioctl(KVM_RUN)
> with KVM_EXIT_MMIO or KVM_EXIT_PIO. One or more MMIO or PIO regions must be
> registered with KVM_SET_IOREGION in order to receive MMIO/PIO accesses on the
> fd. An ioregionfd can be used with multiple VMs and its lifecycle is not tied
> to a specific VM.
> 
> When the last file descriptor for an ioregionfd is closed, all regions
> registered with KVM_SET_IOREGION are dropped and guest accesses to those
> regions cause ioctl(KVM_RUN) to return again.
> 
> KVM_SET_IOREGION
> ----------------
> :Capability: KVM_CAP_IOREGIONFD
> :Architectures: all
> :Type: vm ioctl
> :Parameters: struct kvm_ioregion (in)
> :Returns: 0 on success, -1 on error
> 
> This ioctl adds, modifies, or removes an ioregionfd MMIO or PIO region. Guest
> read and write accesses are dispatched through the given ioregionfd instead of
> returning from ioctl(KVM_RUN).
> 
> ::
> 
>   struct kvm_ioregion {
>       __u64 guest_paddr; /* guest physical address */
>       __u64 memory_size; /* bytes */
>       __u64 user_data;
>       __s32 fd; /* previously created with KVM_CREATE_IOREGIONFD */
>       __u32 flags;
>       __u8  pad[32];
>   };
> 
>   /* for kvm_ioregion::flags */
>   #define KVM_IOREGION_PIO           (1u << 0)
>   #define KVM_IOREGION_POSTED_WRITES (1u << 1)
> 
> If a new region would split an existing region -1 is returned and errno is
> EINVAL.
> 
> Regions can be deleted by setting fd to -1. If no existing region matches
> guest_paddr and memory_size then -1 is returned and errno is ENOENT.
> 
> Existing regions can be modified as long as guest_paddr and memory_size
> match an existing region.
> 
> MMIO is the default. The KVM_IOREGION_PIO flag selects PIO instead.
> 
> The user_data value is included in messages KVM writes to the ioregionfd upon
> guest access. KVM does not interpret user_data.
> 
> Both read and write guest accesses wait for a response before entering the
> guest again. The KVM_IOREGION_POSTED_WRITES flag does not wait for a response
> and immediately enters the guest again. This is suitable for accesses that do
> not require synchronous emulation, such as posted doorbell register writes.
> Note that guest writes may block the vCPU despite KVM_IOREGION_POSTED_WRITES if
> the device is too slow in reading from the ioregionfd.
> 
> Wire protocol
> -------------
> The protocol spoken over the file descriptor is as follows. The device reads
> commands from the file descriptor with the following layout::
> 
>   struct ioregionfd_cmd {
>       __u32 info;
>       __u32 padding;
>       __u64 user_data;
>       __u64 offset;
>       __u64 data;
>   };
> 
> The info field layout is as follows::
> 
>   bits:  | 31 ... 8 |  6   | 5 ... 4 | 3 ... 0 |
>   field: | reserved | resp |   size  |   cmd   |
> 
> The cmd field identifies the operation to perform::
> 
>   #define IOREGIONFD_CMD_READ  0
>   #define IOREGIONFD_CMD_WRITE 1
> 
> The size field indicates the size of the access::
> 
>   #define IOREGIONFD_SIZE_8BIT  0
>   #define IOREGIONFD_SIZE_16BIT 1
>   #define IOREGIONFD_SIZE_32BIT 2
>   #define IOREGIONFD_SIZE_64BIT 3
> 
> If the command is IOREGIONFD_CMD_WRITE then the resp bit indicates whether or
> not a response must be sent.
> 
> The user_data field contains the opaque value provided to KVM_SET_IOREGION.
> Applications can use this to uniquely identify the region that is being
> accessed.
> 
> The offset field contains the byte offset being accessed within a region
> that was registered with KVM_SET_IOREGION.
> 
> If the command is IOREGIONFD_CMD_WRITE then data contains the value
> being written. The data value is a 64-bit integer in host endianness,
> regardless of the access size.
> 
> The device sends responses by writing the following structure to the
> file descriptor::
> 
>   struct ioregionfd_resp {
>       __u64 data;
>       __u8 pad[24];
>   };
> 
> The data field contains the value read by an IOREGIONFD_CMD_READ
> command. This field is zero for other commands. The data value is a 64-bit
> integer in host endianness, regardless of the access size.
> 
> Ordering
> --------
> Guest accesses are delivered in order, including posted writes.
> 
> Signals
> -------
> The vCPU task can be interrupted by a signal while waiting for an ioregionfd
> response. In this case ioctl(KVM_RUN) returns with -EINTR. Guest entry is
> deferred until ioctl(KVM_RUN) is called again and the response has been written
> to the ioregionfd.
> 
> Security
> --------
> Device emulation processes may be untrusted in multi-process VMM architectures.
> Therefore the control plane and the data plane of ioregionfd are separate. A
> task that only has access to an ioregionfd is unable to add/modify/remove
> regions since that requires ioctls on a KVM vm fd. This ensures that device
> emulation processes can only service MMIO/PIO accesses for regions that the VMM
> registered on their behalf.
> 
> Multi-queue scalability
> -----------------------
> The protocol is synchronous - only one command/response cycle is in flight at a
> time - but the vCPU will be blocked until the response has been processed
> anyway. If another vCPU accesses an MMIO or PIO region belonging to the same
> ioregionfd during this time then it waits for the first access to complete.
> 
> Per-queue ioregionfds can be set up to take advantage of concurrency on
> multi-queue devices.
> 
> Polling
> -------
> Userspace can poll ioregionfd by submitting an io_uring IORING_OP_READ request
> and polling the cq ring to detect when the read has completed. Although this
> dispatch mechanism incurs more overhead than polling directly on guest RAM, it
> captures each write access and supports reads.
> 
> Does it obsolete ioeventfd?
> ---------------------------
> No, although KVM_IOREGION_POSTED_WRITES offers somewhat similar functionality
> to ioeventfd, there are differences. The datamatch functionality of ioeventfd
> is not available and would need to be implemented by the device emulation
> program. Due to the counter semantics of eventfds there is automatic coalescing
> of repeated accesses with ioeventfd. Overall ioeventfd is lighter weight but
> also more limited.
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion
@ 2021-10-12  5:34   ` elena
  0 siblings, 0 replies; 33+ messages in thread
From: elena @ 2021-10-12  5:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: john.g.johnson, jag.raman, kvm, mst, jasowang, cohuck,
	eafanasova, stefanha, felipe, dinechin

On Wed, Nov 25, 2020 at 12:44:07PM -0800, Elena Afanasova wrote:
> Hello,
>

Hi

Sorry for top-posting, just wanted to provide a quik update.
We are currently working on the support for ioregionfd in Qemu and will
be posting the patches soon. Plus the KVM patches will be posted based
of the RFC v3 with some fixes if there are no objections from Elena's side
who originally posted KVM RFC patchset.


Thanks!
Elena Ufimtseva

> I'm an Outreachy intern with QEMU and I’m working on implementing the ioregionfd 
> API in KVM. So I’d like to resume the ioregionfd design discussion. The latest 
> version of the ioregionfd API document is provided below.
> 
> Overview
> --------
> ioregionfd is a KVM dispatch mechanism for handling MMIO/PIO accesses over a
> file descriptor without returning from ioctl(KVM_RUN). This allows device
> emulation to run in another task separate from the vCPU task.
> 
> This is achieved through KVM ioctls for registering MMIO/PIO regions and a wire
> protocol that KVM uses to communicate with a task handling an MMIO/PIO access.
> 
> The traditional ioctl(KVM_RUN) dispatch mechanism with device emulation in a
> separate task looks like this:
> 
>    kvm.ko  <---ioctl(KVM_RUN)---> VMM vCPU task <---messages---> device task
> 
> ioregionfd improves performance by eliminating the need for the vCPU task to
> forward MMIO/PIO exits to device emulation tasks:
> 
>    kvm.ko  <---ioctl(KVM_RUN)---> VMM vCPU task
>      ^
>      `---ioregionfd---> device task
> 
> Both multi-threaded and multi-process VMMs can take advantage of ioregionfd to
> run device emulation in dedicated threads and processes, respectively.
> 
> This mechanism is similar to ioeventfd except it supports all read and write
> accesses, whereas ioeventfd only supports posted doorbell writes.
> 
> Traditional ioctl(KVM_RUN) dispatch and ioeventfd continue to work alongside
> the new mechanism, but only one mechanism handles a MMIO/PIO access.
> 
> KVM_CREATE_IOREGIONFD
> ---------------------
> :Capability: KVM_CAP_IOREGIONFD
> :Architectures: all
> :Type: system ioctl
> :Parameters: none
> :Returns: an ioregionfd file descriptor, -1 on error
> 
> This ioctl creates a new ioregionfd and returns the file descriptor. The fd can
> be used to handle MMIO/PIO accesses instead of returning from ioctl(KVM_RUN)
> with KVM_EXIT_MMIO or KVM_EXIT_PIO. One or more MMIO or PIO regions must be
> registered with KVM_SET_IOREGION in order to receive MMIO/PIO accesses on the
> fd. An ioregionfd can be used with multiple VMs and its lifecycle is not tied
> to a specific VM.
> 
> When the last file descriptor for an ioregionfd is closed, all regions
> registered with KVM_SET_IOREGION are dropped and guest accesses to those
> regions cause ioctl(KVM_RUN) to return again.
> 
> KVM_SET_IOREGION
> ----------------
> :Capability: KVM_CAP_IOREGIONFD
> :Architectures: all
> :Type: vm ioctl
> :Parameters: struct kvm_ioregion (in)
> :Returns: 0 on success, -1 on error
> 
> This ioctl adds, modifies, or removes an ioregionfd MMIO or PIO region. Guest
> read and write accesses are dispatched through the given ioregionfd instead of
> returning from ioctl(KVM_RUN).
> 
> ::
> 
>   struct kvm_ioregion {
>       __u64 guest_paddr; /* guest physical address */
>       __u64 memory_size; /* bytes */
>       __u64 user_data;
>       __s32 fd; /* previously created with KVM_CREATE_IOREGIONFD */
>       __u32 flags;
>       __u8  pad[32];
>   };
> 
>   /* for kvm_ioregion::flags */
>   #define KVM_IOREGION_PIO           (1u << 0)
>   #define KVM_IOREGION_POSTED_WRITES (1u << 1)
> 
> If a new region would split an existing region -1 is returned and errno is
> EINVAL.
> 
> Regions can be deleted by setting fd to -1. If no existing region matches
> guest_paddr and memory_size then -1 is returned and errno is ENOENT.
> 
> Existing regions can be modified as long as guest_paddr and memory_size
> match an existing region.
> 
> MMIO is the default. The KVM_IOREGION_PIO flag selects PIO instead.
> 
> The user_data value is included in messages KVM writes to the ioregionfd upon
> guest access. KVM does not interpret user_data.
> 
> Both read and write guest accesses wait for a response before entering the
> guest again. The KVM_IOREGION_POSTED_WRITES flag does not wait for a response
> and immediately enters the guest again. This is suitable for accesses that do
> not require synchronous emulation, such as posted doorbell register writes.
> Note that guest writes may block the vCPU despite KVM_IOREGION_POSTED_WRITES if
> the device is too slow in reading from the ioregionfd.
> 
> Wire protocol
> -------------
> The protocol spoken over the file descriptor is as follows. The device reads
> commands from the file descriptor with the following layout::
> 
>   struct ioregionfd_cmd {
>       __u32 info;
>       __u32 padding;
>       __u64 user_data;
>       __u64 offset;
>       __u64 data;
>   };
> 
> The info field layout is as follows::
> 
>   bits:  | 31 ... 8 |  6   | 5 ... 4 | 3 ... 0 |
>   field: | reserved | resp |   size  |   cmd   |
> 
> The cmd field identifies the operation to perform::
> 
>   #define IOREGIONFD_CMD_READ  0
>   #define IOREGIONFD_CMD_WRITE 1
> 
> The size field indicates the size of the access::
> 
>   #define IOREGIONFD_SIZE_8BIT  0
>   #define IOREGIONFD_SIZE_16BIT 1
>   #define IOREGIONFD_SIZE_32BIT 2
>   #define IOREGIONFD_SIZE_64BIT 3
> 
> If the command is IOREGIONFD_CMD_WRITE then the resp bit indicates whether or
> not a response must be sent.
> 
> The user_data field contains the opaque value provided to KVM_SET_IOREGION.
> Applications can use this to uniquely identify the region that is being
> accessed.
> 
> The offset field contains the byte offset being accessed within a region
> that was registered with KVM_SET_IOREGION.
> 
> If the command is IOREGIONFD_CMD_WRITE then data contains the value
> being written. The data value is a 64-bit integer in host endianness,
> regardless of the access size.
> 
> The device sends responses by writing the following structure to the
> file descriptor::
> 
>   struct ioregionfd_resp {
>       __u64 data;
>       __u8 pad[24];
>   };
> 
> The data field contains the value read by an IOREGIONFD_CMD_READ
> command. This field is zero for other commands. The data value is a 64-bit
> integer in host endianness, regardless of the access size.
> 
> Ordering
> --------
> Guest accesses are delivered in order, including posted writes.
> 
> Signals
> -------
> The vCPU task can be interrupted by a signal while waiting for an ioregionfd
> response. In this case ioctl(KVM_RUN) returns with -EINTR. Guest entry is
> deferred until ioctl(KVM_RUN) is called again and the response has been written
> to the ioregionfd.
> 
> Security
> --------
> Device emulation processes may be untrusted in multi-process VMM architectures.
> Therefore the control plane and the data plane of ioregionfd are separate. A
> task that only has access to an ioregionfd is unable to add/modify/remove
> regions since that requires ioctls on a KVM vm fd. This ensures that device
> emulation processes can only service MMIO/PIO accesses for regions that the VMM
> registered on their behalf.
> 
> Multi-queue scalability
> -----------------------
> The protocol is synchronous - only one command/response cycle is in flight at a
> time - but the vCPU will be blocked until the response has been processed
> anyway. If another vCPU accesses an MMIO or PIO region belonging to the same
> ioregionfd during this time then it waits for the first access to complete.
> 
> Per-queue ioregionfds can be set up to take advantage of concurrency on
> multi-queue devices.
> 
> Polling
> -------
> Userspace can poll ioregionfd by submitting an io_uring IORING_OP_READ request
> and polling the cq ring to detect when the read has completed. Although this
> dispatch mechanism incurs more overhead than polling directly on guest RAM, it
> captures each write access and supports reads.
> 
> Does it obsolete ioeventfd?
> ---------------------------
> No, although KVM_IOREGION_POSTED_WRITES offers somewhat similar functionality
> to ioeventfd, there are differences. The datamatch functionality of ioeventfd
> is not available and would need to be implemented by the device emulation
> program. Due to the counter semantics of eventfds there is automatic coalescing
> of repeated accesses with ioeventfd. Overall ioeventfd is lighter weight but
> also more limited.
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion
  2020-12-03 14:40     ` Peter Xu
@ 2020-12-07 14:58       ` Stefan Hajnoczi
  0 siblings, 0 replies; 33+ messages in thread
From: Stefan Hajnoczi @ 2020-12-07 14:58 UTC (permalink / raw)
  To: Peter Xu
  Cc: Stefan Hajnoczi, Elena Afanasova, kvm, mst, john.g.johnson,
	dinechin, cohuck, jasowang, felipe, elena.ufimtseva, jag.raman

[-- Attachment #1: Type: text/plain, Size: 5161 bytes --]

On Thu, Dec 03, 2020 at 09:40:37AM -0500, Peter Xu wrote:
> On Thu, Dec 03, 2020 at 11:10:36AM +0000, Stefan Hajnoczi wrote:
> > On Wed, Dec 02, 2020 at 01:06:28PM -0500, Peter Xu wrote:
> > > On Wed, Nov 25, 2020 at 12:44:07PM -0800, Elena Afanasova wrote:
> > > 
> > > [...]
> > > 
> > > > Wire protocol
> > > > -------------
> > > > The protocol spoken over the file descriptor is as follows. The device reads
> > > > commands from the file descriptor with the following layout::
> > > > 
> > > >   struct ioregionfd_cmd {
> > > >       __u32 info;
> > > >       __u32 padding;
> > > >       __u64 user_data;
> > > >       __u64 offset;
> > > >       __u64 data;
> > > >   };
> > > 
> > > I'm thinking whether it would be nice to have a handshake on the wire protocol
> > > before starting the cmd/resp sequence.
> > > 
> > > I was thinking about migration - we have had a hard time trying to be
> > > compatible between old/new qemus.  Now we fixed those by applying the same
> > > migration capabilities on both sides always so we do the handshake "manually"
> > > from libvirt, but it really should be done with a real handshake on the
> > > channel, imho..  That's another story, for sure.
> > > 
> > > My understanding is that the wire protocol is kind of a standalone (but tiny)
> > > protocol between kvm and the emulation process.  So I'm thinking the handshake
> > > could also help when e.g. kvm can fallback to an old version of wire protocol
> > > if it knows the emulation binary is old.  Ideally, I think this could even
> > > happen without VMM's awareness.
> > > 
> > > [...]
> > 
> > I imagined that would happen in the control plane (KVM ioctls) instead
> > of the data plane (the fd). There is a flags field in
> > ioctl(KVM_SET_IOREGION):
> > 
> >   struct kvm_ioregion {
> >       __u64 guest_paddr; /* guest physical address */
> >       __u64 memory_size; /* bytes */
> >       __u64 user_data;
> >       __s32 fd; /* previously created with KVM_CREATE_IOREGIONFD */
> >       __u32 flags;
> >       __u8  pad[32];
> >   };
> > 
> > When userspace sets up the ioregionfd it can tell the kernel which
> > features to enable.
> > 
> > Feature availability can be checked through ioctl(KVM_CHECK_EXTENSION).
> > 
> > Do you think this existing mechanism is enough? It's not clear to me
> > what kind of additional negotiation would be necessary between the
> > device emulation process and KVM after the ioregionfd has been
> > registered?
> 
> Yes I think kvm capability can be used as a way for the handshake between VMM
> and KVM.  However I'm not sure how to do similar things to the emulation
> process besides passing over the ioregionfd, because that's between VMM and the
> emulation process.  Is there a way to do so with current proposasl?

The interface is designed so the VMM controls the ioregionfd setup on
behalf of the device emulation process. This is for security: the device
emulation process is untrusted.

There is setup involving the VMM and the device emulation process that
is outside the scope of the KVM API where the device emulation process
could request for the VMM to enable ioregionfd features. For example, if
QEMU is initializing a vfio-user device emulation program then QEMU
queries available vfio-user device regions from the device emulation
program. This is where extra information like ioregionfd support would
be indicated by the device emulation program.

> The out-of-order feature may not be a good enough example if it's destined to
> be useless... but let's just imagine we have such a requirement as an extention
> to the current wire protocol.  What I was thinking was whether we could have
> another handshake somehow to the emulation process, so that we can identify "ok
> this emulation process supports out-of-band" or vice versa.  Only with that
> information could VMM enable/disable the out-of-band in KVM.
> 
> The handshake to the emulation process can start from either VMM or KVM.  And
> my previous idea was simply let KVM and emualtion process talk rather than VMM
> against the emulation, because they really look like two isolated protocols:
> 
>   - The VMM <-> KVM talks via KVM_SET_IOREGIONFD, it is only responsible to
>     setup a mmio/pio region in the guest and create the fd. As long as the
>     region is properly setup, and the fd is passed over to emulation process,
>     it's done as a proxy layer.
> 
>   - The KVM <-> emulation process talks via the wire protocol (so this protocol
>     can be irrelevant to KVM_SET_IOREGIONFD protocol itself).  For example the
>     out-of-band feature changes the behavior of the wire protocol.  Ideally it
>     does not even need a new KVM capability because KVM_SET_IOREGIONFD
>     functionality shouldn't be touched.
> 
> I thought it would be cleaner to have these protocols separate, but I could
> also be wrong or over-engineering.

This makes sense to me if the wire protocol needs to become complex. For
now sticking to the simple control/data plane approach is what I would
suggest for security.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion
  2020-12-03 11:34     ` Michael S. Tsirkin
@ 2020-12-04 13:23       ` Stefan Hajnoczi
  0 siblings, 0 replies; 33+ messages in thread
From: Stefan Hajnoczi @ 2020-12-04 13:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Peter Xu, Elena Afanasova, kvm, john.g.johnson, dinechin, cohuck,
	jasowang, felipe, elena.ufimtseva, jag.raman

[-- Attachment #1: Type: text/plain, Size: 4009 bytes --]

On Thu, Dec 03, 2020 at 06:34:00AM -0500, Michael S. Tsirkin wrote:
> On Thu, Dec 03, 2020 at 11:10:36AM +0000, Stefan Hajnoczi wrote:
> > On Wed, Dec 02, 2020 at 01:06:28PM -0500, Peter Xu wrote:
> > > On Wed, Nov 25, 2020 at 12:44:07PM -0800, Elena Afanasova wrote:
> > > 
> > > [...]
> > > 
> > > > Wire protocol
> > > > -------------
> > > > The protocol spoken over the file descriptor is as follows. The device reads
> > > > commands from the file descriptor with the following layout::
> > > > 
> > > >   struct ioregionfd_cmd {
> > > >       __u32 info;
> > > >       __u32 padding;
> > > >       __u64 user_data;
> > > >       __u64 offset;
> > > >       __u64 data;
> > > >   };
> > > 
> > > I'm thinking whether it would be nice to have a handshake on the wire protocol
> > > before starting the cmd/resp sequence.
> > > 
> > > I was thinking about migration - we have had a hard time trying to be
> > > compatible between old/new qemus.  Now we fixed those by applying the same
> > > migration capabilities on both sides always so we do the handshake "manually"
> > > from libvirt, but it really should be done with a real handshake on the
> > > channel, imho..  That's another story, for sure.
> > > 
> > > My understanding is that the wire protocol is kind of a standalone (but tiny)
> > > protocol between kvm and the emulation process.  So I'm thinking the handshake
> > > could also help when e.g. kvm can fallback to an old version of wire protocol
> > > if it knows the emulation binary is old.  Ideally, I think this could even
> > > happen without VMM's awareness.
> > > 
> > > [...]
> > 
> > I imagined that would happen in the control plane (KVM ioctls) instead
> > of the data plane (the fd). There is a flags field in
> > ioctl(KVM_SET_IOREGION):
> > 
> >   struct kvm_ioregion {
> >       __u64 guest_paddr; /* guest physical address */
> >       __u64 memory_size; /* bytes */
> >       __u64 user_data;
> >       __s32 fd; /* previously created with KVM_CREATE_IOREGIONFD */
> >       __u32 flags;
> >       __u8  pad[32];
> >   };
> > 
> > When userspace sets up the ioregionfd it can tell the kernel which
> > features to enable.
> > 
> > Feature availability can be checked through ioctl(KVM_CHECK_EXTENSION).
> > 
> > Do you think this existing mechanism is enough? It's not clear to me
> > what kind of additional negotiation would be necessary between the
> > device emulation process and KVM after the ioregionfd has been
> > registered?
> > 
> > > > Ordering
> > > > --------
> > > > Guest accesses are delivered in order, including posted writes.
> > > 
> > > I'm wondering whether it should prepare for out-of-order commands assuming if
> > > there's no handshake so harder to extend, just in case there could be some slow
> > > commands so we still have chance to reply to a very trivial command during
> > > handling the slow one (then each command may require a command ID, too).  But
> > > it won't be a problem at all if we can easily extend the wire protocol so the
> > > ordering constraint can be extended too when really needed, and we can always
> > > start with in-order-only requests.
> > 
> > Elena and I brainstormed out-of-order commands but didn't include them
> > in the proposal because it's not clear that they are needed. For
> > multi-queue devices the per-queue registers can be assigned different
> > ioregionfds that are handled by dedicated threads.
> 
> The difficulty is I think the reverse: reading
> any register from a PCI device is normally enough to flush any
> writes and interrupts in progress.

Accesses dispatched on the same ioregionfd are ordered. If a device
requires ordering then it needs to use one ioregionfd.

I'm not aware of issues with ioeventfd where the vhost worker thread or
QEMU's IOThread process the doorbell while the vCPU thread processes
other virtio-pci register accesses. The situation is similar with
ioregionfd.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion
  2020-12-03 11:10   ` Stefan Hajnoczi
  2020-12-03 11:34     ` Michael S. Tsirkin
@ 2020-12-03 14:40     ` Peter Xu
  2020-12-07 14:58       ` Stefan Hajnoczi
  1 sibling, 1 reply; 33+ messages in thread
From: Peter Xu @ 2020-12-03 14:40 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Afanasova, kvm, mst, john.g.johnson, dinechin, cohuck,
	jasowang, felipe, elena.ufimtseva, jag.raman

On Thu, Dec 03, 2020 at 11:10:36AM +0000, Stefan Hajnoczi wrote:
> On Wed, Dec 02, 2020 at 01:06:28PM -0500, Peter Xu wrote:
> > On Wed, Nov 25, 2020 at 12:44:07PM -0800, Elena Afanasova wrote:
> > 
> > [...]
> > 
> > > Wire protocol
> > > -------------
> > > The protocol spoken over the file descriptor is as follows. The device reads
> > > commands from the file descriptor with the following layout::
> > > 
> > >   struct ioregionfd_cmd {
> > >       __u32 info;
> > >       __u32 padding;
> > >       __u64 user_data;
> > >       __u64 offset;
> > >       __u64 data;
> > >   };
> > 
> > I'm thinking whether it would be nice to have a handshake on the wire protocol
> > before starting the cmd/resp sequence.
> > 
> > I was thinking about migration - we have had a hard time trying to be
> > compatible between old/new qemus.  Now we fixed those by applying the same
> > migration capabilities on both sides always so we do the handshake "manually"
> > from libvirt, but it really should be done with a real handshake on the
> > channel, imho..  That's another story, for sure.
> > 
> > My understanding is that the wire protocol is kind of a standalone (but tiny)
> > protocol between kvm and the emulation process.  So I'm thinking the handshake
> > could also help when e.g. kvm can fallback to an old version of wire protocol
> > if it knows the emulation binary is old.  Ideally, I think this could even
> > happen without VMM's awareness.
> > 
> > [...]
> 
> I imagined that would happen in the control plane (KVM ioctls) instead
> of the data plane (the fd). There is a flags field in
> ioctl(KVM_SET_IOREGION):
> 
>   struct kvm_ioregion {
>       __u64 guest_paddr; /* guest physical address */
>       __u64 memory_size; /* bytes */
>       __u64 user_data;
>       __s32 fd; /* previously created with KVM_CREATE_IOREGIONFD */
>       __u32 flags;
>       __u8  pad[32];
>   };
> 
> When userspace sets up the ioregionfd it can tell the kernel which
> features to enable.
> 
> Feature availability can be checked through ioctl(KVM_CHECK_EXTENSION).
> 
> Do you think this existing mechanism is enough? It's not clear to me
> what kind of additional negotiation would be necessary between the
> device emulation process and KVM after the ioregionfd has been
> registered?

Yes I think kvm capability can be used as a way for the handshake between VMM
and KVM.  However I'm not sure how to do similar things to the emulation
process besides passing over the ioregionfd, because that's between VMM and the
emulation process.  Is there a way to do so with current proposasl?

The out-of-order feature may not be a good enough example if it's destined to
be useless... but let's just imagine we have such a requirement as an extention
to the current wire protocol.  What I was thinking was whether we could have
another handshake somehow to the emulation process, so that we can identify "ok
this emulation process supports out-of-band" or vice versa.  Only with that
information could VMM enable/disable the out-of-band in KVM.

The handshake to the emulation process can start from either VMM or KVM.  And
my previous idea was simply let KVM and emualtion process talk rather than VMM
against the emulation, because they really look like two isolated protocols:

  - The VMM <-> KVM talks via KVM_SET_IOREGIONFD, it is only responsible to
    setup a mmio/pio region in the guest and create the fd. As long as the
    region is properly setup, and the fd is passed over to emulation process,
    it's done as a proxy layer.

  - The KVM <-> emulation process talks via the wire protocol (so this protocol
    can be irrelevant to KVM_SET_IOREGIONFD protocol itself).  For example the
    out-of-band feature changes the behavior of the wire protocol.  Ideally it
    does not even need a new KVM capability because KVM_SET_IOREGIONFD
    functionality shouldn't be touched.

I thought it would be cleaner to have these protocols separate, but I could
also be wrong or over-engineering.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion
  2020-12-03 11:10   ` Stefan Hajnoczi
@ 2020-12-03 11:34     ` Michael S. Tsirkin
  2020-12-04 13:23       ` Stefan Hajnoczi
  2020-12-03 14:40     ` Peter Xu
  1 sibling, 1 reply; 33+ messages in thread
From: Michael S. Tsirkin @ 2020-12-03 11:34 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Peter Xu, Elena Afanasova, kvm, john.g.johnson, dinechin, cohuck,
	jasowang, felipe, elena.ufimtseva, jag.raman

On Thu, Dec 03, 2020 at 11:10:36AM +0000, Stefan Hajnoczi wrote:
> On Wed, Dec 02, 2020 at 01:06:28PM -0500, Peter Xu wrote:
> > On Wed, Nov 25, 2020 at 12:44:07PM -0800, Elena Afanasova wrote:
> > 
> > [...]
> > 
> > > Wire protocol
> > > -------------
> > > The protocol spoken over the file descriptor is as follows. The device reads
> > > commands from the file descriptor with the following layout::
> > > 
> > >   struct ioregionfd_cmd {
> > >       __u32 info;
> > >       __u32 padding;
> > >       __u64 user_data;
> > >       __u64 offset;
> > >       __u64 data;
> > >   };
> > 
> > I'm thinking whether it would be nice to have a handshake on the wire protocol
> > before starting the cmd/resp sequence.
> > 
> > I was thinking about migration - we have had a hard time trying to be
> > compatible between old/new qemus.  Now we fixed those by applying the same
> > migration capabilities on both sides always so we do the handshake "manually"
> > from libvirt, but it really should be done with a real handshake on the
> > channel, imho..  That's another story, for sure.
> > 
> > My understanding is that the wire protocol is kind of a standalone (but tiny)
> > protocol between kvm and the emulation process.  So I'm thinking the handshake
> > could also help when e.g. kvm can fallback to an old version of wire protocol
> > if it knows the emulation binary is old.  Ideally, I think this could even
> > happen without VMM's awareness.
> > 
> > [...]
> 
> I imagined that would happen in the control plane (KVM ioctls) instead
> of the data plane (the fd). There is a flags field in
> ioctl(KVM_SET_IOREGION):
> 
>   struct kvm_ioregion {
>       __u64 guest_paddr; /* guest physical address */
>       __u64 memory_size; /* bytes */
>       __u64 user_data;
>       __s32 fd; /* previously created with KVM_CREATE_IOREGIONFD */
>       __u32 flags;
>       __u8  pad[32];
>   };
> 
> When userspace sets up the ioregionfd it can tell the kernel which
> features to enable.
> 
> Feature availability can be checked through ioctl(KVM_CHECK_EXTENSION).
> 
> Do you think this existing mechanism is enough? It's not clear to me
> what kind of additional negotiation would be necessary between the
> device emulation process and KVM after the ioregionfd has been
> registered?
> 
> > > Ordering
> > > --------
> > > Guest accesses are delivered in order, including posted writes.
> > 
> > I'm wondering whether it should prepare for out-of-order commands assuming if
> > there's no handshake so harder to extend, just in case there could be some slow
> > commands so we still have chance to reply to a very trivial command during
> > handling the slow one (then each command may require a command ID, too).  But
> > it won't be a problem at all if we can easily extend the wire protocol so the
> > ordering constraint can be extended too when really needed, and we can always
> > start with in-order-only requests.
> 
> Elena and I brainstormed out-of-order commands but didn't include them
> in the proposal because it's not clear that they are needed. For
> multi-queue devices the per-queue registers can be assigned different
> ioregionfds that are handled by dedicated threads.

The difficulty is I think the reverse: reading
any register from a PCI device is normally enough to flush any
writes and interrupts in progress.



> Out-of-order commands are only necessary if a device needs to
> concurrently process register accesses to the *same* set of registers. I
> think it's rare for hardware register interfaces to be designed like
> that.
> 
> This could be a mistake, of course. If someone knows a device that needs
> multiple in-flight register accesses, please let us know.
> 
> Stefan



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion
  2020-12-02 18:06 ` Peter Xu
@ 2020-12-03 11:10   ` Stefan Hajnoczi
  2020-12-03 11:34     ` Michael S. Tsirkin
  2020-12-03 14:40     ` Peter Xu
  0 siblings, 2 replies; 33+ messages in thread
From: Stefan Hajnoczi @ 2020-12-03 11:10 UTC (permalink / raw)
  To: Peter Xu
  Cc: Elena Afanasova, kvm, mst, john.g.johnson, dinechin, cohuck,
	jasowang, felipe, elena.ufimtseva, jag.raman

[-- Attachment #1: Type: text/plain, Size: 3417 bytes --]

On Wed, Dec 02, 2020 at 01:06:28PM -0500, Peter Xu wrote:
> On Wed, Nov 25, 2020 at 12:44:07PM -0800, Elena Afanasova wrote:
> 
> [...]
> 
> > Wire protocol
> > -------------
> > The protocol spoken over the file descriptor is as follows. The device reads
> > commands from the file descriptor with the following layout::
> > 
> >   struct ioregionfd_cmd {
> >       __u32 info;
> >       __u32 padding;
> >       __u64 user_data;
> >       __u64 offset;
> >       __u64 data;
> >   };
> 
> I'm thinking whether it would be nice to have a handshake on the wire protocol
> before starting the cmd/resp sequence.
> 
> I was thinking about migration - we have had a hard time trying to be
> compatible between old/new qemus.  Now we fixed those by applying the same
> migration capabilities on both sides always so we do the handshake "manually"
> from libvirt, but it really should be done with a real handshake on the
> channel, imho..  That's another story, for sure.
> 
> My understanding is that the wire protocol is kind of a standalone (but tiny)
> protocol between kvm and the emulation process.  So I'm thinking the handshake
> could also help when e.g. kvm can fallback to an old version of wire protocol
> if it knows the emulation binary is old.  Ideally, I think this could even
> happen without VMM's awareness.
> 
> [...]

I imagined that would happen in the control plane (KVM ioctls) instead
of the data plane (the fd). There is a flags field in
ioctl(KVM_SET_IOREGION):

  struct kvm_ioregion {
      __u64 guest_paddr; /* guest physical address */
      __u64 memory_size; /* bytes */
      __u64 user_data;
      __s32 fd; /* previously created with KVM_CREATE_IOREGIONFD */
      __u32 flags;
      __u8  pad[32];
  };

When userspace sets up the ioregionfd it can tell the kernel which
features to enable.

Feature availability can be checked through ioctl(KVM_CHECK_EXTENSION).

Do you think this existing mechanism is enough? It's not clear to me
what kind of additional negotiation would be necessary between the
device emulation process and KVM after the ioregionfd has been
registered?

> > Ordering
> > --------
> > Guest accesses are delivered in order, including posted writes.
> 
> I'm wondering whether it should prepare for out-of-order commands assuming if
> there's no handshake so harder to extend, just in case there could be some slow
> commands so we still have chance to reply to a very trivial command during
> handling the slow one (then each command may require a command ID, too).  But
> it won't be a problem at all if we can easily extend the wire protocol so the
> ordering constraint can be extended too when really needed, and we can always
> start with in-order-only requests.

Elena and I brainstormed out-of-order commands but didn't include them
in the proposal because it's not clear that they are needed. For
multi-queue devices the per-queue registers can be assigned different
ioregionfds that are handled by dedicated threads.

Out-of-order commands are only necessary if a device needs to
concurrently process register accesses to the *same* set of registers. I
think it's rare for hardware register interfaces to be designed like
that.

This could be a mistake, of course. If someone knows a device that needs
multiple in-flight register accesses, please let us know.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion
  2020-11-25 20:44 Elena Afanasova
@ 2020-12-02 18:06 ` Peter Xu
  2020-12-03 11:10   ` Stefan Hajnoczi
  2021-10-12  5:34   ` elena
  1 sibling, 1 reply; 33+ messages in thread
From: Peter Xu @ 2020-12-02 18:06 UTC (permalink / raw)
  To: Elena Afanasova
  Cc: kvm, mst, john.g.johnson, dinechin, cohuck, jasowang, felipe,
	stefanha, elena.ufimtseva, jag.raman

On Wed, Nov 25, 2020 at 12:44:07PM -0800, Elena Afanasova wrote:

[...]

> Wire protocol
> -------------
> The protocol spoken over the file descriptor is as follows. The device reads
> commands from the file descriptor with the following layout::
> 
>   struct ioregionfd_cmd {
>       __u32 info;
>       __u32 padding;
>       __u64 user_data;
>       __u64 offset;
>       __u64 data;
>   };

I'm thinking whether it would be nice to have a handshake on the wire protocol
before starting the cmd/resp sequence.

I was thinking about migration - we have had a hard time trying to be
compatible between old/new qemus.  Now we fixed those by applying the same
migration capabilities on both sides always so we do the handshake "manually"
from libvirt, but it really should be done with a real handshake on the
channel, imho..  That's another story, for sure.

My understanding is that the wire protocol is kind of a standalone (but tiny)
protocol between kvm and the emulation process.  So I'm thinking the handshake
could also help when e.g. kvm can fallback to an old version of wire protocol
if it knows the emulation binary is old.  Ideally, I think this could even
happen without VMM's awareness.

[...]

> Ordering
> --------
> Guest accesses are delivered in order, including posted writes.

I'm wondering whether it should prepare for out-of-order commands assuming if
there's no handshake so harder to extend, just in case there could be some slow
commands so we still have chance to reply to a very trivial command during
handling the slow one (then each command may require a command ID, too).  But
it won't be a problem at all if we can easily extend the wire protocol so the
ordering constraint can be extended too when really needed, and we can always
start with in-order-only requests.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 33+ messages in thread

* MMIO/PIO dispatch file descriptors (ioregionfd) design discussion
@ 2020-11-25 20:44 Elena Afanasova
  2020-12-02 18:06 ` Peter Xu
  2021-10-12  5:34   ` elena
  0 siblings, 2 replies; 33+ messages in thread
From: Elena Afanasova @ 2020-11-25 20:44 UTC (permalink / raw)
  To: kvm
  Cc: mst, john.g.johnson, dinechin, cohuck, jasowang, felipe,
	stefanha, elena.ufimtseva, jag.raman, eafanasova

Hello,

I'm an Outreachy intern with QEMU and I’m working on implementing the ioregionfd 
API in KVM. So I’d like to resume the ioregionfd design discussion. The latest 
version of the ioregionfd API document is provided below.

Overview
--------
ioregionfd is a KVM dispatch mechanism for handling MMIO/PIO accesses over a
file descriptor without returning from ioctl(KVM_RUN). This allows device
emulation to run in another task separate from the vCPU task.

This is achieved through KVM ioctls for registering MMIO/PIO regions and a wire
protocol that KVM uses to communicate with a task handling an MMIO/PIO access.

The traditional ioctl(KVM_RUN) dispatch mechanism with device emulation in a
separate task looks like this:

   kvm.ko  <---ioctl(KVM_RUN)---> VMM vCPU task <---messages---> device task

ioregionfd improves performance by eliminating the need for the vCPU task to
forward MMIO/PIO exits to device emulation tasks:

   kvm.ko  <---ioctl(KVM_RUN)---> VMM vCPU task
     ^
     `---ioregionfd---> device task

Both multi-threaded and multi-process VMMs can take advantage of ioregionfd to
run device emulation in dedicated threads and processes, respectively.

This mechanism is similar to ioeventfd except it supports all read and write
accesses, whereas ioeventfd only supports posted doorbell writes.

Traditional ioctl(KVM_RUN) dispatch and ioeventfd continue to work alongside
the new mechanism, but only one mechanism handles a MMIO/PIO access.

KVM_CREATE_IOREGIONFD
---------------------
:Capability: KVM_CAP_IOREGIONFD
:Architectures: all
:Type: system ioctl
:Parameters: none
:Returns: an ioregionfd file descriptor, -1 on error

This ioctl creates a new ioregionfd and returns the file descriptor. The fd can
be used to handle MMIO/PIO accesses instead of returning from ioctl(KVM_RUN)
with KVM_EXIT_MMIO or KVM_EXIT_PIO. One or more MMIO or PIO regions must be
registered with KVM_SET_IOREGION in order to receive MMIO/PIO accesses on the
fd. An ioregionfd can be used with multiple VMs and its lifecycle is not tied
to a specific VM.

When the last file descriptor for an ioregionfd is closed, all regions
registered with KVM_SET_IOREGION are dropped and guest accesses to those
regions cause ioctl(KVM_RUN) to return again.

KVM_SET_IOREGION
----------------
:Capability: KVM_CAP_IOREGIONFD
:Architectures: all
:Type: vm ioctl
:Parameters: struct kvm_ioregion (in)
:Returns: 0 on success, -1 on error

This ioctl adds, modifies, or removes an ioregionfd MMIO or PIO region. Guest
read and write accesses are dispatched through the given ioregionfd instead of
returning from ioctl(KVM_RUN).

::

  struct kvm_ioregion {
      __u64 guest_paddr; /* guest physical address */
      __u64 memory_size; /* bytes */
      __u64 user_data;
      __s32 fd; /* previously created with KVM_CREATE_IOREGIONFD */
      __u32 flags;
      __u8  pad[32];
  };

  /* for kvm_ioregion::flags */
  #define KVM_IOREGION_PIO           (1u << 0)
  #define KVM_IOREGION_POSTED_WRITES (1u << 1)

If a new region would split an existing region -1 is returned and errno is
EINVAL.

Regions can be deleted by setting fd to -1. If no existing region matches
guest_paddr and memory_size then -1 is returned and errno is ENOENT.

Existing regions can be modified as long as guest_paddr and memory_size
match an existing region.

MMIO is the default. The KVM_IOREGION_PIO flag selects PIO instead.

The user_data value is included in messages KVM writes to the ioregionfd upon
guest access. KVM does not interpret user_data.

Both read and write guest accesses wait for a response before entering the
guest again. The KVM_IOREGION_POSTED_WRITES flag does not wait for a response
and immediately enters the guest again. This is suitable for accesses that do
not require synchronous emulation, such as posted doorbell register writes.
Note that guest writes may block the vCPU despite KVM_IOREGION_POSTED_WRITES if
the device is too slow in reading from the ioregionfd.

Wire protocol
-------------
The protocol spoken over the file descriptor is as follows. The device reads
commands from the file descriptor with the following layout::

  struct ioregionfd_cmd {
      __u32 info;
      __u32 padding;
      __u64 user_data;
      __u64 offset;
      __u64 data;
  };

The info field layout is as follows::

  bits:  | 31 ... 8 |  6   | 5 ... 4 | 3 ... 0 |
  field: | reserved | resp |   size  |   cmd   |

The cmd field identifies the operation to perform::

  #define IOREGIONFD_CMD_READ  0
  #define IOREGIONFD_CMD_WRITE 1

The size field indicates the size of the access::

  #define IOREGIONFD_SIZE_8BIT  0
  #define IOREGIONFD_SIZE_16BIT 1
  #define IOREGIONFD_SIZE_32BIT 2
  #define IOREGIONFD_SIZE_64BIT 3

If the command is IOREGIONFD_CMD_WRITE then the resp bit indicates whether or
not a response must be sent.

The user_data field contains the opaque value provided to KVM_SET_IOREGION.
Applications can use this to uniquely identify the region that is being
accessed.

The offset field contains the byte offset being accessed within a region
that was registered with KVM_SET_IOREGION.

If the command is IOREGIONFD_CMD_WRITE then data contains the value
being written. The data value is a 64-bit integer in host endianness,
regardless of the access size.

The device sends responses by writing the following structure to the
file descriptor::

  struct ioregionfd_resp {
      __u64 data;
      __u8 pad[24];
  };

The data field contains the value read by an IOREGIONFD_CMD_READ
command. This field is zero for other commands. The data value is a 64-bit
integer in host endianness, regardless of the access size.

Ordering
--------
Guest accesses are delivered in order, including posted writes.

Signals
-------
The vCPU task can be interrupted by a signal while waiting for an ioregionfd
response. In this case ioctl(KVM_RUN) returns with -EINTR. Guest entry is
deferred until ioctl(KVM_RUN) is called again and the response has been written
to the ioregionfd.

Security
--------
Device emulation processes may be untrusted in multi-process VMM architectures.
Therefore the control plane and the data plane of ioregionfd are separate. A
task that only has access to an ioregionfd is unable to add/modify/remove
regions since that requires ioctls on a KVM vm fd. This ensures that device
emulation processes can only service MMIO/PIO accesses for regions that the VMM
registered on their behalf.

Multi-queue scalability
-----------------------
The protocol is synchronous - only one command/response cycle is in flight at a
time - but the vCPU will be blocked until the response has been processed
anyway. If another vCPU accesses an MMIO or PIO region belonging to the same
ioregionfd during this time then it waits for the first access to complete.

Per-queue ioregionfds can be set up to take advantage of concurrency on
multi-queue devices.

Polling
-------
Userspace can poll ioregionfd by submitting an io_uring IORING_OP_READ request
and polling the cq ring to detect when the read has completed. Although this
dispatch mechanism incurs more overhead than polling directly on guest RAM, it
captures each write access and supports reads.

Does it obsolete ioeventfd?
---------------------------
No, although KVM_IOREGION_POSTED_WRITES offers somewhat similar functionality
to ioeventfd, there are differences. The datamatch functionality of ioeventfd
is not available and would need to be implemented by the device emulation
program. Due to the counter semantics of eventfds there is automatic coalescing
of repeated accesses with ioeventfd. Overall ioeventfd is lighter weight but
also more limited.


^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2021-10-28  8:16 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAFO2pHzmVf7g3z0RikQbYnejwcWRtHKV=npALs49eRDJdt4mJQ@mail.gmail.com>
2020-11-26  3:37 ` MMIO/PIO dispatch file descriptors (ioregionfd) design discussion Jason Wang
2020-11-26 12:36   ` Stefan Hajnoczi
2020-11-27  3:39     ` Jason Wang
2020-11-27 13:44       ` Stefan Hajnoczi
2020-11-30  2:14         ` Jason Wang
2020-11-30 12:47           ` Stefan Hajnoczi
2020-12-01  4:05             ` Jason Wang
2020-12-01 10:35               ` Stefan Hajnoczi
2020-12-02  2:53                 ` Jason Wang
2020-12-02 14:17                 ` Elena Afanasova
2020-11-25 20:44 Elena Afanasova
2020-12-02 18:06 ` Peter Xu
2020-12-03 11:10   ` Stefan Hajnoczi
2020-12-03 11:34     ` Michael S. Tsirkin
2020-12-04 13:23       ` Stefan Hajnoczi
2020-12-03 14:40     ` Peter Xu
2020-12-07 14:58       ` Stefan Hajnoczi
2021-10-12  5:34 ` elena
2021-10-12  5:34   ` elena
2021-10-25 12:42   ` Stefan Hajnoczi
2021-10-25 12:42     ` Stefan Hajnoczi
2021-10-25 15:21     ` Elena
2021-10-25 15:21       ` Elena
2021-10-25 16:56       ` Stefan Hajnoczi
2021-10-25 16:56         ` Stefan Hajnoczi
2021-10-26 19:01       ` John Levon
2021-10-26 19:01         ` John Levon
2021-10-27 10:15         ` Stefan Hajnoczi
2021-10-27 10:15           ` Stefan Hajnoczi
2021-10-27 12:22           ` John Levon
2021-10-27 12:22             ` John Levon
2021-10-28  8:14             ` Stefan Hajnoczi
2021-10-28  8:14               ` Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.