On Thu, Jul 16, 2020 at 04:34:04PM +0200, Christophe de Dinechin wrote: > > > > On 16 Jul 2020, at 16:19, Stefan Hajnoczi wrote: > > > > On Thu, Jul 16, 2020 at 01:25:37PM +0200, Christophe de Dinechin wrote: > >> > >> > >>> On 16 Jul 2020, at 12:00, Stefan Hajnoczi wrote: > >>> > >>> On Wed, Jul 15, 2020 at 05:40:33PM +0100, Alex Bennée wrote: > >>>> > >>>> Stefan Hajnoczi writes: > >>>> > >>>>> On Wed, Jul 15, 2020 at 02:29:04PM +0100, Alex Bennée wrote: > >>>>>> Stefan Hajnoczi writes: > >>>>>>> On Tue, Jul 14, 2020 at 10:43:36PM +0100, Alex Bennée wrote: > >>>>>>>> Finally I'm curious if this is just a problem avoided by the s390 > >>>>>>>> channel approach? Does the use of messages over a channel just avoid the > >>>>>>>> sort of bouncing back and forth that other hypervisors have to do when > >>>>>>>> emulating a device? > >>>>>>> > >>>>>>> What does "bouncing back and forth" mean exactly? > >>>>>> > >>>>>> Context switching between guest and hypervisor. > >>>>> > >>>>> I have CCed Cornelia Huck, who can explain the lifecycle of an I/O > >>>>> request on s390 channel I/O. > >>>> > >>>> Thanks. > >>>> > >>>> I was also wondering about the efficiency of doorbells/notifications the > >>>> other way. AFAIUI for both PCI and MMIO only a single write is required > >>>> to the notify flag which causes a trap to the hypervisor and the rest of > >>>> the processing. The hypervisor doesn't have the cost multiple exits to > >>>> read the guest state although it obviously wants to be as efficient as > >>>> possible passing the data back up to what ever is handling the backend > >>>> of the device so it doesn't need to do multiple context switches. > >>>> > >>>> Has there been any investigation into other mechanisms for notifying the > >>>> hypervisor of an event - for example using a HYP call or similar > >>>> mechanism? > >>>> > >>>> My gut tells me this probably doesn't make any difference as a trap to > >>>> the hypervisor is likely to cost the same either way because you still > >>>> need to save the guest context before actioning something but it would > >>>> be interesting to know if anyone has looked at it. Perhaps there is a > >>>> benefit in partitioned systems where core running the guest can return > >>>> straight away after initiating what it needs to internally in the > >>>> hypervisor to pass the notification to something that can deal with it? > >>> > >>> It's very architecture-specific. This is something Michael Tsirkin > >>> looked in in the past. He found that MMIO and PIO perform differently on > >>> x86. VIRTIO supports both so the device can be configured optimally. > >>> There was an old discussion from 2013 here: > >>> https://lkml.org/lkml/2013/4/4/299 > > >>> > >>> Without nested page tables MMIO was slower than PIO. But with nested > >>> page tables it was faster. > >>> > >>> Another option on x86 is using Model-Specific Registers (for hypercalls) > >>> but this doesn't fit into the PCI device model. > >> > >> (Warning: What I write below is based on experience with very different > >> architectures, both CPU and hypervisor; your mileage may vary) > >> > >> It looks to me like the discussion so far is mostly focused on a "synchronous" > >> model where presumably the same CPU is switching context between > >> guest and (host) device emulation. > >> > >> However, I/O devices on real hardware are asynchronous by construction. > >> They do their thing while the CPU processes stuff. So at least theoretically, > >> there is no reason to context switch on the same CPU. You could very well > >> have an I/O thread on some other CPU doing its thing. This allows to > >> do something some of you may have heard me talk about, called > >> "interrupt coalescing". > >> > >> As Stefan noted, this is not always a win, as it may introduce latency. > >> There are at least two cases where this latency really hurts: > >> > >> 1. When the I/O thread is in some kind of deep sleep, e.g. because it > >> was not active recently. Everything from cache to TLB may hit you here, > >> but that normally happens when there isn't much I/O activity, so this case > >> in practice does not hurt that much, or rather it hurts in a case where > >> don't really care. > >> > >> 2. When the I/O thread is preempted, or not given enough cycles to do its > >> stuff. This happens when the system is both CPU and I/O bound, and > >> addressing that is mostly a scheduling issue. A CPU thread could hand-off > >> to a specific I/O thread, reducing that case to the kind of context switch > >> Alex was mentioning, but I'm not sure how feasible it is to implement > >> that on Linux / kvm. > >> > >> In such cases, you have to pay for context switch. I'm not sure if that > >> context switch is markedly more expensive than a "vmexit". On at least > >> that alien architecture I was familiar with, there was little difference between > >> switching to "your" host CPU thread and switching to "another" host > >> I/O thread. But then the context switch was all in software, so we had > >> designed it that way. > >> > >> So let's assume now that you run your device emulation fully in an I/O > >> thread, which we will assume for simplicity sits mostly in host user-space, > >> and your guest I/O code runs in a CPU thread, which we will assume > >> sits mostly in guest user/kernel space. > >> > >> It is possible to share two-way doorbells / IRQ queues on some memory > >> page, very similar to a virtqueue. When you want to "doorbell" your device, > >> you simply write to that page. The device threads picks it up by reading > >> the same page, and posts I/O completions on the same page, with simple > >> memory writes. > >> > >> Consider this I/O exchange buffer as having (at least) a writer and reader > >> index for both doorbells and virtual interrupts. In the explanation > >> below, I will call them "dwi", "dri", "iwi", "iri" for doorbell / interrupt read > >> and write index. (Note that as a key optimization, you really > >> don't want dwi and dri to be in the same cache line, since different > >> CPUs are going to read and write them) > >> > >> You obviously still need to "kick" the I/O or CPU thread, and we are > >> talking about an IPI here since you don't know which CPU that other > >> thread is sitting on. But the interesting property is that you only need > >> to do that when dwi==dri or iwi==iri, because if not, the other side > >> has already been "kicked" and will keep working, i.e. incrementing > >> dri or iri, until it reaches back that state. > >> > >> The real "interrupt coalescing" trick can happen here. In some > >> cases, you can decide to update your dwi or iwi without kicking, > >> as long as you know that you will need to kick later. That requires > >> some heavy cooperation from guest drivers, though, and is a > >> second-order optimization. > >> > >> With a scheme like this, you replace a systematic context switch > >> for each device interrupt with a memory write and a "fire and forget" > >> kick IPI that only happens when the system is not already busy > >> processing I/Os, so that it can be eliminated when the system is > >> most busy. With interrupt coalescing, you can send IPIs at a rate > >> much lower than the actual I/O rate. > >> > >> Not sure how difficult it is to adapt a scheme like this to the current > >> state of qemu / kvm, but I'm pretty sure it works well if you implement > >> it correctly ;-) > >> > >>> > >>> A bigger issue than vmexit latency is device emulation thread wakeup > >>> latency. There is a thread (QEMU, vhost-user, vhost, etc) monitoring the > >>> ioeventfd but it may be descheduled. Its physical CPU may be in a low > >>> power state. I ran a benchmark late last year with QEMU's AioContext > >>> adaptive polling disabled so we can measure the wakeup latency: > >>> > >>> CPU 0/KVM 26102 [000] 85626.737072: kvm:kvm_fast_mmio: > >>> fast mmio at gpa 0xfde03000 > >>> IO iothread1 26099 [001] 85626.737076: syscalls:sys_exit_ppoll: 0x1 > >>> 4 microseconds ------^ > > > > Hi Christophe, > > QEMU/KVM does something similar to what you described. In the perf > > output above the vmexit kvm_fast_mmio event occurs on physical CPU > > "[000]". The IOThread wakes up on physical CPU "[001]". Physical CPU#0 > > resumes guest execution immediately after marking the ioeventfd ready. > > There is no context switch to the IOThread or a return from > > ioctl(KVM_RUN) on CPU#0. > > > Oh, that's good. > > But then the conclusion that the 4us delay limits us to 250kIOPS > is incorrect, no? Is there anything that would prevent multiple > I/O events (doorbell or interrupt) to be in flight at the same time? The number I posted is a worst-case latency scenario: 1. Queue depth = 1. No batching. Workloads that want to achieve maximum IOPS typically queue more than 1 request at a time. When multiple requests are queued both submission and completion can be batched so that only a fraction of vmexits + interrupts are required to process N requests. 2. QEMU AioContext adaptive polling disabled. Polling skips ioeventfd and instead polls on the virtqueue ring index memory that the guest updates. Polling is on by default when -device virtio-blk-pci,iothread= is used but I disabled it for this test. But if we stick with the worst-case scenario then we are really limited to 250k IOPS per virtio-blk device because a single IOThread is processing the virtqueue with a 4 microsecond wakeup latency. Stefan