> On 16 Jul 2020, at 16:19, Stefan Hajnoczi wrote: > > On Thu, Jul 16, 2020 at 01:25:37PM +0200, Christophe de Dinechin wrote: >> >> >>> On 16 Jul 2020, at 12:00, Stefan Hajnoczi wrote: >>> >>> On Wed, Jul 15, 2020 at 05:40:33PM +0100, Alex Bennée wrote: >>>> >>>> Stefan Hajnoczi writes: >>>> >>>>> On Wed, Jul 15, 2020 at 02:29:04PM +0100, Alex Bennée wrote: >>>>>> Stefan Hajnoczi writes: >>>>>>> On Tue, Jul 14, 2020 at 10:43:36PM +0100, Alex Bennée wrote: >>>>>>>> Finally I'm curious if this is just a problem avoided by the s390 >>>>>>>> channel approach? Does the use of messages over a channel just avoid the >>>>>>>> sort of bouncing back and forth that other hypervisors have to do when >>>>>>>> emulating a device? >>>>>>> >>>>>>> What does "bouncing back and forth" mean exactly? >>>>>> >>>>>> Context switching between guest and hypervisor. >>>>> >>>>> I have CCed Cornelia Huck, who can explain the lifecycle of an I/O >>>>> request on s390 channel I/O. >>>> >>>> Thanks. >>>> >>>> I was also wondering about the efficiency of doorbells/notifications the >>>> other way. AFAIUI for both PCI and MMIO only a single write is required >>>> to the notify flag which causes a trap to the hypervisor and the rest of >>>> the processing. The hypervisor doesn't have the cost multiple exits to >>>> read the guest state although it obviously wants to be as efficient as >>>> possible passing the data back up to what ever is handling the backend >>>> of the device so it doesn't need to do multiple context switches. >>>> >>>> Has there been any investigation into other mechanisms for notifying the >>>> hypervisor of an event - for example using a HYP call or similar >>>> mechanism? >>>> >>>> My gut tells me this probably doesn't make any difference as a trap to >>>> the hypervisor is likely to cost the same either way because you still >>>> need to save the guest context before actioning something but it would >>>> be interesting to know if anyone has looked at it. Perhaps there is a >>>> benefit in partitioned systems where core running the guest can return >>>> straight away after initiating what it needs to internally in the >>>> hypervisor to pass the notification to something that can deal with it? >>> >>> It's very architecture-specific. This is something Michael Tsirkin >>> looked in in the past. He found that MMIO and PIO perform differently on >>> x86. VIRTIO supports both so the device can be configured optimally. >>> There was an old discussion from 2013 here: >>> https://lkml.org/lkml/2013/4/4/299 > >>> >>> Without nested page tables MMIO was slower than PIO. But with nested >>> page tables it was faster. >>> >>> Another option on x86 is using Model-Specific Registers (for hypercalls) >>> but this doesn't fit into the PCI device model. >> >> (Warning: What I write below is based on experience with very different >> architectures, both CPU and hypervisor; your mileage may vary) >> >> It looks to me like the discussion so far is mostly focused on a "synchronous" >> model where presumably the same CPU is switching context between >> guest and (host) device emulation. >> >> However, I/O devices on real hardware are asynchronous by construction. >> They do their thing while the CPU processes stuff. So at least theoretically, >> there is no reason to context switch on the same CPU. You could very well >> have an I/O thread on some other CPU doing its thing. This allows to >> do something some of you may have heard me talk about, called >> "interrupt coalescing". >> >> As Stefan noted, this is not always a win, as it may introduce latency. >> There are at least two cases where this latency really hurts: >> >> 1. When the I/O thread is in some kind of deep sleep, e.g. because it >> was not active recently. Everything from cache to TLB may hit you here, >> but that normally happens when there isn't much I/O activity, so this case >> in practice does not hurt that much, or rather it hurts in a case where >> don't really care. >> >> 2. When the I/O thread is preempted, or not given enough cycles to do its >> stuff. This happens when the system is both CPU and I/O bound, and >> addressing that is mostly a scheduling issue. A CPU thread could hand-off >> to a specific I/O thread, reducing that case to the kind of context switch >> Alex was mentioning, but I'm not sure how feasible it is to implement >> that on Linux / kvm. >> >> In such cases, you have to pay for context switch. I'm not sure if that >> context switch is markedly more expensive than a "vmexit". On at least >> that alien architecture I was familiar with, there was little difference between >> switching to "your" host CPU thread and switching to "another" host >> I/O thread. But then the context switch was all in software, so we had >> designed it that way. >> >> So let's assume now that you run your device emulation fully in an I/O >> thread, which we will assume for simplicity sits mostly in host user-space, >> and your guest I/O code runs in a CPU thread, which we will assume >> sits mostly in guest user/kernel space. >> >> It is possible to share two-way doorbells / IRQ queues on some memory >> page, very similar to a virtqueue. When you want to "doorbell" your device, >> you simply write to that page. The device threads picks it up by reading >> the same page, and posts I/O completions on the same page, with simple >> memory writes. >> >> Consider this I/O exchange buffer as having (at least) a writer and reader >> index for both doorbells and virtual interrupts. In the explanation >> below, I will call them "dwi", "dri", "iwi", "iri" for doorbell / interrupt read >> and write index. (Note that as a key optimization, you really >> don't want dwi and dri to be in the same cache line, since different >> CPUs are going to read and write them) >> >> You obviously still need to "kick" the I/O or CPU thread, and we are >> talking about an IPI here since you don't know which CPU that other >> thread is sitting on. But the interesting property is that you only need >> to do that when dwi==dri or iwi==iri, because if not, the other side >> has already been "kicked" and will keep working, i.e. incrementing >> dri or iri, until it reaches back that state. >> >> The real "interrupt coalescing" trick can happen here. In some >> cases, you can decide to update your dwi or iwi without kicking, >> as long as you know that you will need to kick later. That requires >> some heavy cooperation from guest drivers, though, and is a >> second-order optimization. >> >> With a scheme like this, you replace a systematic context switch >> for each device interrupt with a memory write and a "fire and forget" >> kick IPI that only happens when the system is not already busy >> processing I/Os, so that it can be eliminated when the system is >> most busy. With interrupt coalescing, you can send IPIs at a rate >> much lower than the actual I/O rate. >> >> Not sure how difficult it is to adapt a scheme like this to the current >> state of qemu / kvm, but I'm pretty sure it works well if you implement >> it correctly ;-) >> >>> >>> A bigger issue than vmexit latency is device emulation thread wakeup >>> latency. There is a thread (QEMU, vhost-user, vhost, etc) monitoring the >>> ioeventfd but it may be descheduled. Its physical CPU may be in a low >>> power state. I ran a benchmark late last year with QEMU's AioContext >>> adaptive polling disabled so we can measure the wakeup latency: >>> >>> CPU 0/KVM 26102 [000] 85626.737072: kvm:kvm_fast_mmio: >>> fast mmio at gpa 0xfde03000 >>> IO iothread1 26099 [001] 85626.737076: syscalls:sys_exit_ppoll: 0x1 >>> 4 microseconds ------^ > > Hi Christophe, > QEMU/KVM does something similar to what you described. In the perf > output above the vmexit kvm_fast_mmio event occurs on physical CPU > "[000]". The IOThread wakes up on physical CPU "[001]". Physical CPU#0 > resumes guest execution immediately after marking the ioeventfd ready. > There is no context switch to the IOThread or a return from > ioctl(KVM_RUN) on CPU#0. > Oh, that's good. But then the conclusion that the 4us delay limits us to 250kIOPS is incorrect, no? Is there anything that would prevent multiple I/O events (doorbell or interrupt) to be in flight at the same time? > The IOThread reads the eventfd. An eventfd is a counter that is reset to > 0 on read. Because it's a counter you get coalescing: if the guest > performs multiple MMIO writes the ioeventfd counter increases but the > IOThread only wakes up once and reads the ioeventfd. > > VIRTIO itself also has a mechanism for suppressing notifications called > EVENT_IDX. It allows the driver to let the device know that it does not > require interrupts, and the device to let the driver know it does not > require virtqueue kicks. This reminds me a bit of the mitigation > mechanism you described. > > Stefan