> On 16 Jul 2020, at 16:19, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Thu, Jul 16, 2020 at 01:25:37PM +0200, Christophe de Dinechin wrote:
>> 
>> 
>>> On 16 Jul 2020, at 12:00, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>> 
>>> On Wed, Jul 15, 2020 at 05:40:33PM +0100, Alex Bennée wrote:
>>>> 
>>>> Stefan Hajnoczi <stefanha@redhat.com> writes:
>>>> 
>>>>> On Wed, Jul 15, 2020 at 02:29:04PM +0100, Alex Bennée wrote:
>>>>>> Stefan Hajnoczi <stefanha@redhat.com> writes:
>>>>>>> On Tue, Jul 14, 2020 at 10:43:36PM +0100, Alex Bennée wrote:
>>>>>>>> Finally I'm curious if this is just a problem avoided by the s390
>>>>>>>> channel approach? Does the use of messages over a channel just avoid the
>>>>>>>> sort of bouncing back and forth that other hypervisors have to do when
>>>>>>>> emulating a device?
>>>>>>> 
>>>>>>> What does "bouncing back and forth" mean exactly?
>>>>>> 
>>>>>> Context switching between guest and hypervisor.
>>>>> 
>>>>> I have CCed Cornelia Huck, who can explain the lifecycle of an I/O
>>>>> request on s390 channel I/O.
>>>> 
>>>> Thanks.
>>>> 
>>>> I was also wondering about the efficiency of doorbells/notifications the
>>>> other way. AFAIUI for both PCI and MMIO only a single write is required
>>>> to the notify flag which causes a trap to the hypervisor and the rest of
>>>> the processing. The hypervisor doesn't have the cost multiple exits to
>>>> read the guest state although it obviously wants to be as efficient as
>>>> possible passing the data back up to what ever is handling the backend
>>>> of the device so it doesn't need to do multiple context switches.
>>>> 
>>>> Has there been any investigation into other mechanisms for notifying the
>>>> hypervisor of an event - for example using a HYP call or similar
>>>> mechanism?
>>>> 
>>>> My gut tells me this probably doesn't make any difference as a trap to
>>>> the hypervisor is likely to cost the same either way because you still
>>>> need to save the guest context before actioning something but it would
>>>> be interesting to know if anyone has looked at it. Perhaps there is a
>>>> benefit in partitioned systems where core running the guest can return
>>>> straight away after initiating what it needs to internally in the
>>>> hypervisor to pass the notification to something that can deal with it?
>>> 
>>> It's very architecture-specific. This is something Michael Tsirkin
>>> looked in in the past. He found that MMIO and PIO perform differently on
>>> x86. VIRTIO supports both so the device can be configured optimally.
>>> There was an old discussion from 2013 here:
>>> https://lkml.org/lkml/2013/4/4/299 <https://lkml.org/lkml/2013/4/4/299> <https://lkml.org/lkml/2013/4/4/299 <https://lkml.org/lkml/2013/4/4/299>>
>>> 
>>> Without nested page tables MMIO was slower than PIO. But with nested
>>> page tables it was faster.
>>> 
>>> Another option on x86 is using Model-Specific Registers (for hypercalls)
>>> but this doesn't fit into the PCI device model.
>> 
>> (Warning: What I write below is based on experience with very different
>> architectures, both CPU and hypervisor; your mileage may vary)
>> 
>> It looks to me like the discussion so far is mostly focused on a "synchronous"
>> model where presumably the same CPU is switching context between
>> guest and (host) device emulation.
>> 
>> However, I/O devices on real hardware are asynchronous by construction.
>> They do their thing while the CPU processes stuff. So at least theoretically,
>> there is no reason to context switch on the same CPU. You could very well
>> have an I/O thread on some other CPU doing its thing. This allows to
>> do something some of you may have heard me talk about, called
>> "interrupt coalescing".
>> 
>> As Stefan noted, this is not always a win, as it may introduce latency.
>> There are at least two cases where this latency really hurts:
>> 
>> 1. When the I/O thread is in some kind of deep sleep, e.g. because it
>> was not active recently. Everything from cache to TLB may hit you here,
>> but that normally happens when there isn't much I/O activity, so this case
>> in practice does not hurt that much, or rather it hurts in a case where
>> don't really care.
>> 
>> 2. When the I/O thread is preempted, or not given enough cycles to do its
>> stuff. This happens when the system is both CPU and I/O bound, and
>> addressing that is mostly a scheduling issue. A CPU thread could hand-off
>> to a specific I/O thread, reducing that case to the kind of context switch
>> Alex was mentioning, but I'm not sure how feasible it is to implement
>> that on Linux / kvm.
>> 
>> In such cases, you have to pay for context switch. I'm not sure if that
>> context switch is markedly more expensive than a "vmexit". On at least
>> that alien architecture I was familiar with, there was little difference between
>> switching to "your" host CPU thread and switching to "another" host
>> I/O thread. But then the context switch was all in software, so we had
>> designed it that way.
>> 
>> So let's assume now that you run your device emulation fully in an I/O
>> thread, which we will assume for simplicity sits mostly in host user-space,
>> and your guest I/O code runs in a CPU thread, which we will assume
>> sits mostly in guest user/kernel space.
>> 
>> It is possible to share two-way doorbells / IRQ queues on some memory
>> page, very similar to a virtqueue. When you want to "doorbell" your device,
>> you simply write to that page. The device threads picks it up by reading
>> the same page, and posts I/O completions on the same page, with simple
>> memory writes.
>> 
>> Consider this I/O exchange buffer as having (at least) a writer and reader
>> index for both doorbells and virtual interrupts. In the explanation
>> below, I will call them "dwi", "dri", "iwi", "iri" for doorbell / interrupt read
>> and write index. (Note that as a key optimization, you really
>> don't want dwi and dri to be in the same cache line, since different
>> CPUs are going to read and write them)
>> 
>> You obviously still need to "kick" the I/O or CPU thread, and we are
>> talking about an IPI here since you don't know which CPU that other
>> thread is sitting on. But the interesting property is that you only need
>> to do that when dwi==dri or iwi==iri, because if not, the other side
>> has already been "kicked" and will keep working, i.e. incrementing
>> dri or iri, until it reaches back that state.
>> 
>> The real "interrupt coalescing" trick can happen here. In some
>> cases, you can decide to update your dwi or iwi without kicking,
>> as long as you know that you will need to kick later. That requires
>> some heavy cooperation from guest drivers, though, and is a
>> second-order optimization.
>> 
>> With a scheme like this, you replace a systematic context switch
>> for each device interrupt with a memory write and a "fire and forget"
>> kick IPI that only happens when the system is not already busy
>> processing I/Os, so that it can be eliminated when the system is
>> most busy. With interrupt coalescing, you can send IPIs at a rate
>> much lower than the actual I/O rate.
>> 
>> Not sure how difficult it is to adapt a scheme like this to the current
>> state of qemu / kvm, but I'm pretty sure it works well if you implement
>> it correctly ;-)
>> 
>>> 
>>> A bigger issue than vmexit latency is device emulation thread wakeup
>>> latency. There is a thread (QEMU, vhost-user, vhost, etc) monitoring the
>>> ioeventfd but it may be descheduled. Its physical CPU may be in a low
>>> power state. I ran a benchmark late last year with QEMU's AioContext
>>> adaptive polling disabled so we can measure the wakeup latency:
>>> 
>>>      CPU 0/KVM 26102 [000] 85626.737072:       kvm:kvm_fast_mmio:
>>> fast mmio at gpa 0xfde03000
>>>   IO iothread1 26099 [001] 85626.737076: syscalls:sys_exit_ppoll: 0x1
>>>                  4 microseconds ------^
> 
> Hi Christophe,
> QEMU/KVM does something similar to what you described. In the perf
> output above the vmexit kvm_fast_mmio event occurs on physical CPU
> "[000]".  The IOThread wakes up on physical CPU "[001]". Physical CPU#0
> resumes guest execution immediately after marking the ioeventfd ready.
> There is no context switch to the IOThread or a return from
> ioctl(KVM_RUN) on CPU#0.
> 
Oh, that's good.

But then the conclusion that the 4us delay limits us to 250kIOPS
is incorrect, no? Is there anything that would prevent multiple
I/O events (doorbell or interrupt) to be in flight at the same time?

> The IOThread reads the eventfd. An eventfd is a counter that is reset to
> 0 on read. Because it's a counter you get coalescing: if the guest
> performs multiple MMIO writes the ioeventfd counter increases but the
> IOThread only wakes up once and reads the ioeventfd.
> 
> VIRTIO itself also has a mechanism for suppressing notifications called
> EVENT_IDX. It allows the driver to let the device know that it does not
> require interrupts, and the device to let the driver know it does not
> require virtqueue kicks. This reminds me a bit of the mitigation
> mechanism you described.
> 
> Stefan