It looks to me like the discussion so far is mostly focused on a "synchronous"
model where presumably the same CPU is switching context between
guest and (host) device emulation.
However, I/O devices on real hardware are asynchronous by construction.
They do their thing while the CPU processes stuff. So at least theoretically,
there is no reason to context switch on the same CPU. You could very well
have an I/O thread on some other CPU doing its thing. This allows to
do something some of you may have heard me talk about, called
"interrupt coalescing".
As Stefan noted, this is not always a win, as it may introduce latency.
There are at least two cases where this latency really hurts:
1. When the I/O thread is in some kind of deep sleep, e.g. because it
was not active recently. Everything from cache to TLB may hit you here,
but that normally happens when there isn't much I/O activity, so this case
in practice does not hurt that much, or rather it hurts in a case where
don't really care.
2. When the I/O thread is preempted, or not given enough cycles to do its
stuff. This happens when the system is both CPU and I/O bound, and
addressing that is mostly a scheduling issue. A CPU thread could hand-off
to a specific I/O thread, reducing that case to the kind of context switch
Alex was mentioning, but I'm not sure how feasible it is to implement
that on Linux / kvm.
In such cases, you have to pay for context switch. I'm not sure if that
context switch is markedly more expensive than a "vmexit". On at least
that alien architecture I was familiar with, there was little difference between
switching to "your" host CPU thread and switching to "another" host
I/O thread. But then the context switch was all in software, so we had
designed it that way.
So let's assume now that you run your device emulation fully in an I/O
thread, which we will assume for simplicity sits mostly in host user-space,
and your guest I/O code runs in a CPU thread, which we will assume
sits mostly in guest user/kernel space.
It is possible to share two-way doorbells / IRQ queues on some memory
page, very similar to a virtqueue. When you want to "doorbell" your device,
you simply write to that page. The device threads picks it up by reading
the same page, and posts I/O completions on the same page, with simple
memory writes.
Consider this I/O exchange buffer as having (at least) a writer and reader
index for both doorbells and virtual interrupts. In the explanation
below, I will call them "dwi", "dri", "iwi", "iri" for doorbell / interrupt read
and write index. (Note that as a key optimization, you really
don't want dwi and dri to be in the same cache line, since different
CPUs are going to read and write them)
You obviously still need to "kick" the I/O or CPU thread, and we are
talking about an IPI here since you don't know which CPU that other
thread is sitting on. But the interesting property is that you only need
to do that when dwi==dri or iwi==iri, because if not, the other side
has already been "kicked" and will keep working, i.e. incrementing
dri or iri, until it reaches back that state.
The real "interrupt coalescing" trick can happen here. In some
cases, you can decide to update your dwi or iwi without kicking,
as long as you know that you will need to kick later. That requires
some heavy cooperation from guest drivers, though, and is a
second-order optimization.
With a scheme like this, you replace a systematic context switch
for each device interrupt with a memory write and a "fire and forget"
kick IPI that only happens when the system is not already busy
processing I/Os, so that it can be eliminated when the system is
most busy. With interrupt coalescing, you can send IPIs at a rate
much lower than the actual I/O rate.
Not sure how difficult it is to adapt a scheme like this to the current
state of qemu / kvm, but I'm pretty sure it works well if you implement
it correctly ;-)
A bigger issue than vmexit latency is device emulation thread wakeup
latency. There is a thread (QEMU, vhost-user, vhost, etc) monitoring the
ioeventfd but it may be descheduled. Its physical CPU may be in a low
power state. I ran a benchmark late last year with QEMU's AioContext
adaptive polling disabled so we can measure the wakeup latency:
CPU 0/KVM 26102 [000] 85626.737072: kvm:kvm_fast_mmio:
fast mmio at gpa 0xfde03000
IO iothread1 26099 [001] 85626.737076: syscalls:sys_exit_ppoll: 0x1
4 microseconds ------^
(I did not manually configure physical CPU power states or use the
idle=poll host kernel parameter.)
Each virtqueue kick had 4 microseconds of latency before the device
emulation thread had a chance to process the virtqueue. This means the
maximum I/O Operations Per Second (IOPS) is capped at 250k before
virtqueue processing has even begun!
This data is what prompted me to write the above. This 4us seems
really long to me.
I recall a benchmark where the technique above was reaching at least
400k IOPs for a single VM on a medium-size system (4CPUs (*)).
I remember the time I ran this benchmark quite well, because it was just
after VMware made a big splash about reaching 100k IOPs:
(*) Yes, at the time, 4 CPUs was a medium size system. Don't laugh.
QEMU AioContext adaptive polling helps here because we skip the vmexit
entirely while the IOThread is polling the vring (for up to 32
microseconds by default).
It would be great if more people dig into this and optimize
notifications further.
Stefan