On Fri, Nov 27, 2020 at 11:39:23AM +0800, Jason Wang wrote: > > On 2020/11/26 下午8:36, Stefan Hajnoczi wrote: > > On Thu, Nov 26, 2020 at 11:37:30AM +0800, Jason Wang wrote: > > > On 2020/11/26 上午3:21, Elena Afanasova wrote: > > > > Hello, > > > > > > > > I'm an Outreachy intern with QEMU and I’m working on implementing the > > > > ioregionfd API in KVM. > > > > So I’d like to resume the ioregionfd design discussion. The latest > > > > version of the ioregionfd API document is provided below. > > > > > > > > Overview > > > > -------- > > > > ioregionfd is a KVM dispatch mechanism for handling MMIO/PIO accesses > > > > over a > > > > file descriptor without returning from ioctl(KVM_RUN). This allows device > > > > emulation to run in another task separate from the vCPU task. > > > > > > > > This is achieved through KVM ioctls for registering MMIO/PIO regions and > > > > a wire > > > > protocol that KVM uses to communicate with a task handling an MMIO/PIO > > > > access. > > > > > > > > The traditional ioctl(KVM_RUN) dispatch mechanism with device emulation > > > > in a > > > > separate task looks like this: > > > > > > > >    kvm.ko  <---ioctl(KVM_RUN)---> VMM vCPU task <---messages---> device > > > > task > > > > > > > > ioregionfd improves performance by eliminating the need for the vCPU > > > > task to > > > > forward MMIO/PIO exits to device emulation tasks: > > > > > > I wonder at which cases we care performance like this. (Note that vhost-user > > > suppots set|get_config() for a while). > > NVMe emulation needs this because ioeventfd cannot transfer the value > > written to the doorbell. That's why QEMU's NVMe emulation doesn't > > support IOThreads. > > > I think it depends on how many different value that can be carried via > doorbell. If it's not tons of, we can use datamatch. Anyway virtio support > differing queue index via the value wrote to doorbell. There are too many value, it's not the queue index. It's the ring index of the latest request. If the ring size is 128, we need 128 ioeventfd registrations, etc. It becomes a lot. By the way, the long-term use case for ioregionfd is to allow vfio-user device emulation processes to directly handle I/O accesses. Elena benchmarked ioeventfd vs dispatching through QEMU and can share the perform results. I think the number was around 30+% improvement via direct ioeventfd dispatch, so it will be important for high IOPS devices (network and storage controllers). > > > > > > KVM_CREATE_IOREGIONFD > > > > --------------------- > > > > :Capability: KVM_CAP_IOREGIONFD > > > > :Architectures: all > > > > :Type: system ioctl > > > > :Parameters: none > > > > :Returns: an ioregionfd file descriptor, -1 on error > > > > > > > > This ioctl creates a new ioregionfd and returns the file descriptor. The > > > > fd can > > > > be used to handle MMIO/PIO accesses instead of returning from > > > > ioctl(KVM_RUN) > > > > with KVM_EXIT_MMIO or KVM_EXIT_PIO. One or more MMIO or PIO regions must > > > > be > > > > registered with KVM_SET_IOREGION in order to receive MMIO/PIO accesses > > > > on the > > > > fd. An ioregionfd can be used with multiple VMs and its lifecycle is not > > > > tied > > > > to a specific VM. > > > > > > > > When the last file descriptor for an ioregionfd is closed, all regions > > > > registered with KVM_SET_IOREGION are dropped and guest accesses to those > > > > regions cause ioctl(KVM_RUN) to return again. > > > > > > I may miss something, but I don't see any special requirement of this fd. > > > The fd just a transport of a protocol between KVM and userspace process. So > > > instead of mandating a new type, it might be better to allow any type of fd > > > to be attached. (E.g pipe or socket). > > pipe(2) is unidirectional on Linux, so it won't work. > > > Can we accept two file descriptors to make it work? > > > > > > mkfifo(3) seems usable but creates a node on a filesystem. > > > > socketpair(2) would work, but brings in the network stack when it's not > > needed. The advantage is that some future user case might want to direct > > ioregionfd over a real socket to a remote host, which would be cool. > > > > Do you have an idea of the performance difference of socketpair(2) > > compared to a custom fd? > > > It should be slower than custom fd and UNIX socket should be faster than > TIPC. Maybe we can have a custom fd, but it's better to leave the policy to > the userspace: > > 1) KVM should not have any limitation of the fd it uses, user will risk > itself if the fd has been used wrongly, and the custom fd should be one of > the choice > 2) it's better to not have a virt specific name (e.g "KVM" or "ioregion") Okay, it looks like there are things to investigate here. Elena: My suggestion would be to start with the simplest option - letting userspace pass in 1 file descriptor. You can investigate the performance of socketpair(2)/fifo(7), 2 pipe fds, or a custom file implementation later if time permits. That way the API has maximum flexibility (userspace can decide on the file type). > Or I wonder whether we can attach an eBPF program when trapping MMIO/PIO and > allow it to decide how to proceed? The eBPF program approach is interesting, but it would probably require access to guest RAM and additional userspace state (e.g. device-specific register values). I don't know the current status of Linux eBPF - is it possible to access user memory (it could be swapped out)? Stefan