On Thu, Jul 16, 2020 at 04:34:04PM +0200, Christophe de Dinechin wrote:
> 
> 
> > On 16 Jul 2020, at 16:19, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > 
> > On Thu, Jul 16, 2020 at 01:25:37PM +0200, Christophe de Dinechin wrote:
> >> 
> >> 
> >>> On 16 Jul 2020, at 12:00, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>> 
> >>> On Wed, Jul 15, 2020 at 05:40:33PM +0100, Alex Bennée wrote:
> >>>> 
> >>>> Stefan Hajnoczi <stefanha@redhat.com> writes:
> >>>> 
> >>>>> On Wed, Jul 15, 2020 at 02:29:04PM +0100, Alex Bennée wrote:
> >>>>>> Stefan Hajnoczi <stefanha@redhat.com> writes:
> >>>>>>> On Tue, Jul 14, 2020 at 10:43:36PM +0100, Alex Bennée wrote:
> >>>>>>>> Finally I'm curious if this is just a problem avoided by the s390
> >>>>>>>> channel approach? Does the use of messages over a channel just avoid the
> >>>>>>>> sort of bouncing back and forth that other hypervisors have to do when
> >>>>>>>> emulating a device?
> >>>>>>> 
> >>>>>>> What does "bouncing back and forth" mean exactly?
> >>>>>> 
> >>>>>> Context switching between guest and hypervisor.
> >>>>> 
> >>>>> I have CCed Cornelia Huck, who can explain the lifecycle of an I/O
> >>>>> request on s390 channel I/O.
> >>>> 
> >>>> Thanks.
> >>>> 
> >>>> I was also wondering about the efficiency of doorbells/notifications the
> >>>> other way. AFAIUI for both PCI and MMIO only a single write is required
> >>>> to the notify flag which causes a trap to the hypervisor and the rest of
> >>>> the processing. The hypervisor doesn't have the cost multiple exits to
> >>>> read the guest state although it obviously wants to be as efficient as
> >>>> possible passing the data back up to what ever is handling the backend
> >>>> of the device so it doesn't need to do multiple context switches.
> >>>> 
> >>>> Has there been any investigation into other mechanisms for notifying the
> >>>> hypervisor of an event - for example using a HYP call or similar
> >>>> mechanism?
> >>>> 
> >>>> My gut tells me this probably doesn't make any difference as a trap to
> >>>> the hypervisor is likely to cost the same either way because you still
> >>>> need to save the guest context before actioning something but it would
> >>>> be interesting to know if anyone has looked at it. Perhaps there is a
> >>>> benefit in partitioned systems where core running the guest can return
> >>>> straight away after initiating what it needs to internally in the
> >>>> hypervisor to pass the notification to something that can deal with it?
> >>> 
> >>> It's very architecture-specific. This is something Michael Tsirkin
> >>> looked in in the past. He found that MMIO and PIO perform differently on
> >>> x86. VIRTIO supports both so the device can be configured optimally.
> >>> There was an old discussion from 2013 here:
> >>> https://lkml.org/lkml/2013/4/4/299 <https://lkml.org/lkml/2013/4/4/299> <https://lkml.org/lkml/2013/4/4/299 <https://lkml.org/lkml/2013/4/4/299>>
> >>> 
> >>> Without nested page tables MMIO was slower than PIO. But with nested
> >>> page tables it was faster.
> >>> 
> >>> Another option on x86 is using Model-Specific Registers (for hypercalls)
> >>> but this doesn't fit into the PCI device model.
> >> 
> >> (Warning: What I write below is based on experience with very different
> >> architectures, both CPU and hypervisor; your mileage may vary)
> >> 
> >> It looks to me like the discussion so far is mostly focused on a "synchronous"
> >> model where presumably the same CPU is switching context between
> >> guest and (host) device emulation.
> >> 
> >> However, I/O devices on real hardware are asynchronous by construction.
> >> They do their thing while the CPU processes stuff. So at least theoretically,
> >> there is no reason to context switch on the same CPU. You could very well
> >> have an I/O thread on some other CPU doing its thing. This allows to
> >> do something some of you may have heard me talk about, called
> >> "interrupt coalescing".
> >> 
> >> As Stefan noted, this is not always a win, as it may introduce latency.
> >> There are at least two cases where this latency really hurts:
> >> 
> >> 1. When the I/O thread is in some kind of deep sleep, e.g. because it
> >> was not active recently. Everything from cache to TLB may hit you here,
> >> but that normally happens when there isn't much I/O activity, so this case
> >> in practice does not hurt that much, or rather it hurts in a case where
> >> don't really care.
> >> 
> >> 2. When the I/O thread is preempted, or not given enough cycles to do its
> >> stuff. This happens when the system is both CPU and I/O bound, and
> >> addressing that is mostly a scheduling issue. A CPU thread could hand-off
> >> to a specific I/O thread, reducing that case to the kind of context switch
> >> Alex was mentioning, but I'm not sure how feasible it is to implement
> >> that on Linux / kvm.
> >> 
> >> In such cases, you have to pay for context switch. I'm not sure if that
> >> context switch is markedly more expensive than a "vmexit". On at least
> >> that alien architecture I was familiar with, there was little difference between
> >> switching to "your" host CPU thread and switching to "another" host
> >> I/O thread. But then the context switch was all in software, so we had
> >> designed it that way.
> >> 
> >> So let's assume now that you run your device emulation fully in an I/O
> >> thread, which we will assume for simplicity sits mostly in host user-space,
> >> and your guest I/O code runs in a CPU thread, which we will assume
> >> sits mostly in guest user/kernel space.
> >> 
> >> It is possible to share two-way doorbells / IRQ queues on some memory
> >> page, very similar to a virtqueue. When you want to "doorbell" your device,
> >> you simply write to that page. The device threads picks it up by reading
> >> the same page, and posts I/O completions on the same page, with simple
> >> memory writes.
> >> 
> >> Consider this I/O exchange buffer as having (at least) a writer and reader
> >> index for both doorbells and virtual interrupts. In the explanation
> >> below, I will call them "dwi", "dri", "iwi", "iri" for doorbell / interrupt read
> >> and write index. (Note that as a key optimization, you really
> >> don't want dwi and dri to be in the same cache line, since different
> >> CPUs are going to read and write them)
> >> 
> >> You obviously still need to "kick" the I/O or CPU thread, and we are
> >> talking about an IPI here since you don't know which CPU that other
> >> thread is sitting on. But the interesting property is that you only need
> >> to do that when dwi==dri or iwi==iri, because if not, the other side
> >> has already been "kicked" and will keep working, i.e. incrementing
> >> dri or iri, until it reaches back that state.
> >> 
> >> The real "interrupt coalescing" trick can happen here. In some
> >> cases, you can decide to update your dwi or iwi without kicking,
> >> as long as you know that you will need to kick later. That requires
> >> some heavy cooperation from guest drivers, though, and is a
> >> second-order optimization.
> >> 
> >> With a scheme like this, you replace a systematic context switch
> >> for each device interrupt with a memory write and a "fire and forget"
> >> kick IPI that only happens when the system is not already busy
> >> processing I/Os, so that it can be eliminated when the system is
> >> most busy. With interrupt coalescing, you can send IPIs at a rate
> >> much lower than the actual I/O rate.
> >> 
> >> Not sure how difficult it is to adapt a scheme like this to the current
> >> state of qemu / kvm, but I'm pretty sure it works well if you implement
> >> it correctly ;-)
> >> 
> >>> 
> >>> A bigger issue than vmexit latency is device emulation thread wakeup
> >>> latency. There is a thread (QEMU, vhost-user, vhost, etc) monitoring the
> >>> ioeventfd but it may be descheduled. Its physical CPU may be in a low
> >>> power state. I ran a benchmark late last year with QEMU's AioContext
> >>> adaptive polling disabled so we can measure the wakeup latency:
> >>> 
> >>>      CPU 0/KVM 26102 [000] 85626.737072:       kvm:kvm_fast_mmio:
> >>> fast mmio at gpa 0xfde03000
> >>>   IO iothread1 26099 [001] 85626.737076: syscalls:sys_exit_ppoll: 0x1
> >>>                  4 microseconds ------^
> > 
> > Hi Christophe,
> > QEMU/KVM does something similar to what you described. In the perf
> > output above the vmexit kvm_fast_mmio event occurs on physical CPU
> > "[000]".  The IOThread wakes up on physical CPU "[001]". Physical CPU#0
> > resumes guest execution immediately after marking the ioeventfd ready.
> > There is no context switch to the IOThread or a return from
> > ioctl(KVM_RUN) on CPU#0.
> > 
> Oh, that's good.
> 
> But then the conclusion that the 4us delay limits us to 250kIOPS
> is incorrect, no? Is there anything that would prevent multiple
> I/O events (doorbell or interrupt) to be in flight at the same time?

The number I posted is a worst-case latency scenario:

1. Queue depth = 1. No batching. Workloads that want to achieve maximum
   IOPS typically queue more than 1 request at a time. When multiple
   requests are queued both submission and completion can be batched so
   that only a fraction of vmexits + interrupts are required to process
   N requests.

2. QEMU AioContext adaptive polling disabled. Polling skips ioeventfd
   and instead polls on the virtqueue ring index memory that the guest
   updates. Polling is on by default when -device
   virtio-blk-pci,iothread= is used but I disabled it for this test.

But if we stick with the worst-case scenario then we are really limited
to 250k IOPS per virtio-blk device because a single IOThread is
processing the virtqueue with a 4 microsecond wakeup latency.

Stefan