All of lore.kernel.org
 help / color / mirror / Atom feed
From: Stefan Hajnoczi <stefanha@redhat.com>
To: Christophe de Dinechin <dinechin@redhat.com>
Cc: virtio-dev@lists.oasis-open.org
Subject: Re: [virtio-dev] On doorbells (queue notifications)
Date: Fri, 17 Jul 2020 09:42:18 +0100	[thread overview]
Message-ID: <20200717084218.GA128195@stefanha-x1.localdomain> (raw)
In-Reply-To: <2D260DC5-B2FF-4760-B6A4-BCCD04BD1066@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 9928 bytes --]

On Thu, Jul 16, 2020 at 04:34:04PM +0200, Christophe de Dinechin wrote:
> 
> 
> > On 16 Jul 2020, at 16:19, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > 
> > On Thu, Jul 16, 2020 at 01:25:37PM +0200, Christophe de Dinechin wrote:
> >> 
> >> 
> >>> On 16 Jul 2020, at 12:00, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>> 
> >>> On Wed, Jul 15, 2020 at 05:40:33PM +0100, Alex Bennée wrote:
> >>>> 
> >>>> Stefan Hajnoczi <stefanha@redhat.com> writes:
> >>>> 
> >>>>> On Wed, Jul 15, 2020 at 02:29:04PM +0100, Alex Bennée wrote:
> >>>>>> Stefan Hajnoczi <stefanha@redhat.com> writes:
> >>>>>>> On Tue, Jul 14, 2020 at 10:43:36PM +0100, Alex Bennée wrote:
> >>>>>>>> Finally I'm curious if this is just a problem avoided by the s390
> >>>>>>>> channel approach? Does the use of messages over a channel just avoid the
> >>>>>>>> sort of bouncing back and forth that other hypervisors have to do when
> >>>>>>>> emulating a device?
> >>>>>>> 
> >>>>>>> What does "bouncing back and forth" mean exactly?
> >>>>>> 
> >>>>>> Context switching between guest and hypervisor.
> >>>>> 
> >>>>> I have CCed Cornelia Huck, who can explain the lifecycle of an I/O
> >>>>> request on s390 channel I/O.
> >>>> 
> >>>> Thanks.
> >>>> 
> >>>> I was also wondering about the efficiency of doorbells/notifications the
> >>>> other way. AFAIUI for both PCI and MMIO only a single write is required
> >>>> to the notify flag which causes a trap to the hypervisor and the rest of
> >>>> the processing. The hypervisor doesn't have the cost multiple exits to
> >>>> read the guest state although it obviously wants to be as efficient as
> >>>> possible passing the data back up to what ever is handling the backend
> >>>> of the device so it doesn't need to do multiple context switches.
> >>>> 
> >>>> Has there been any investigation into other mechanisms for notifying the
> >>>> hypervisor of an event - for example using a HYP call or similar
> >>>> mechanism?
> >>>> 
> >>>> My gut tells me this probably doesn't make any difference as a trap to
> >>>> the hypervisor is likely to cost the same either way because you still
> >>>> need to save the guest context before actioning something but it would
> >>>> be interesting to know if anyone has looked at it. Perhaps there is a
> >>>> benefit in partitioned systems where core running the guest can return
> >>>> straight away after initiating what it needs to internally in the
> >>>> hypervisor to pass the notification to something that can deal with it?
> >>> 
> >>> It's very architecture-specific. This is something Michael Tsirkin
> >>> looked in in the past. He found that MMIO and PIO perform differently on
> >>> x86. VIRTIO supports both so the device can be configured optimally.
> >>> There was an old discussion from 2013 here:
> >>> https://lkml.org/lkml/2013/4/4/299 <https://lkml.org/lkml/2013/4/4/299> <https://lkml.org/lkml/2013/4/4/299 <https://lkml.org/lkml/2013/4/4/299>>
> >>> 
> >>> Without nested page tables MMIO was slower than PIO. But with nested
> >>> page tables it was faster.
> >>> 
> >>> Another option on x86 is using Model-Specific Registers (for hypercalls)
> >>> but this doesn't fit into the PCI device model.
> >> 
> >> (Warning: What I write below is based on experience with very different
> >> architectures, both CPU and hypervisor; your mileage may vary)
> >> 
> >> It looks to me like the discussion so far is mostly focused on a "synchronous"
> >> model where presumably the same CPU is switching context between
> >> guest and (host) device emulation.
> >> 
> >> However, I/O devices on real hardware are asynchronous by construction.
> >> They do their thing while the CPU processes stuff. So at least theoretically,
> >> there is no reason to context switch on the same CPU. You could very well
> >> have an I/O thread on some other CPU doing its thing. This allows to
> >> do something some of you may have heard me talk about, called
> >> "interrupt coalescing".
> >> 
> >> As Stefan noted, this is not always a win, as it may introduce latency.
> >> There are at least two cases where this latency really hurts:
> >> 
> >> 1. When the I/O thread is in some kind of deep sleep, e.g. because it
> >> was not active recently. Everything from cache to TLB may hit you here,
> >> but that normally happens when there isn't much I/O activity, so this case
> >> in practice does not hurt that much, or rather it hurts in a case where
> >> don't really care.
> >> 
> >> 2. When the I/O thread is preempted, or not given enough cycles to do its
> >> stuff. This happens when the system is both CPU and I/O bound, and
> >> addressing that is mostly a scheduling issue. A CPU thread could hand-off
> >> to a specific I/O thread, reducing that case to the kind of context switch
> >> Alex was mentioning, but I'm not sure how feasible it is to implement
> >> that on Linux / kvm.
> >> 
> >> In such cases, you have to pay for context switch. I'm not sure if that
> >> context switch is markedly more expensive than a "vmexit". On at least
> >> that alien architecture I was familiar with, there was little difference between
> >> switching to "your" host CPU thread and switching to "another" host
> >> I/O thread. But then the context switch was all in software, so we had
> >> designed it that way.
> >> 
> >> So let's assume now that you run your device emulation fully in an I/O
> >> thread, which we will assume for simplicity sits mostly in host user-space,
> >> and your guest I/O code runs in a CPU thread, which we will assume
> >> sits mostly in guest user/kernel space.
> >> 
> >> It is possible to share two-way doorbells / IRQ queues on some memory
> >> page, very similar to a virtqueue. When you want to "doorbell" your device,
> >> you simply write to that page. The device threads picks it up by reading
> >> the same page, and posts I/O completions on the same page, with simple
> >> memory writes.
> >> 
> >> Consider this I/O exchange buffer as having (at least) a writer and reader
> >> index for both doorbells and virtual interrupts. In the explanation
> >> below, I will call them "dwi", "dri", "iwi", "iri" for doorbell / interrupt read
> >> and write index. (Note that as a key optimization, you really
> >> don't want dwi and dri to be in the same cache line, since different
> >> CPUs are going to read and write them)
> >> 
> >> You obviously still need to "kick" the I/O or CPU thread, and we are
> >> talking about an IPI here since you don't know which CPU that other
> >> thread is sitting on. But the interesting property is that you only need
> >> to do that when dwi==dri or iwi==iri, because if not, the other side
> >> has already been "kicked" and will keep working, i.e. incrementing
> >> dri or iri, until it reaches back that state.
> >> 
> >> The real "interrupt coalescing" trick can happen here. In some
> >> cases, you can decide to update your dwi or iwi without kicking,
> >> as long as you know that you will need to kick later. That requires
> >> some heavy cooperation from guest drivers, though, and is a
> >> second-order optimization.
> >> 
> >> With a scheme like this, you replace a systematic context switch
> >> for each device interrupt with a memory write and a "fire and forget"
> >> kick IPI that only happens when the system is not already busy
> >> processing I/Os, so that it can be eliminated when the system is
> >> most busy. With interrupt coalescing, you can send IPIs at a rate
> >> much lower than the actual I/O rate.
> >> 
> >> Not sure how difficult it is to adapt a scheme like this to the current
> >> state of qemu / kvm, but I'm pretty sure it works well if you implement
> >> it correctly ;-)
> >> 
> >>> 
> >>> A bigger issue than vmexit latency is device emulation thread wakeup
> >>> latency. There is a thread (QEMU, vhost-user, vhost, etc) monitoring the
> >>> ioeventfd but it may be descheduled. Its physical CPU may be in a low
> >>> power state. I ran a benchmark late last year with QEMU's AioContext
> >>> adaptive polling disabled so we can measure the wakeup latency:
> >>> 
> >>>      CPU 0/KVM 26102 [000] 85626.737072:       kvm:kvm_fast_mmio:
> >>> fast mmio at gpa 0xfde03000
> >>>   IO iothread1 26099 [001] 85626.737076: syscalls:sys_exit_ppoll: 0x1
> >>>                  4 microseconds ------^
> > 
> > Hi Christophe,
> > QEMU/KVM does something similar to what you described. In the perf
> > output above the vmexit kvm_fast_mmio event occurs on physical CPU
> > "[000]".  The IOThread wakes up on physical CPU "[001]". Physical CPU#0
> > resumes guest execution immediately after marking the ioeventfd ready.
> > There is no context switch to the IOThread or a return from
> > ioctl(KVM_RUN) on CPU#0.
> > 
> Oh, that's good.
> 
> But then the conclusion that the 4us delay limits us to 250kIOPS
> is incorrect, no? Is there anything that would prevent multiple
> I/O events (doorbell or interrupt) to be in flight at the same time?

The number I posted is a worst-case latency scenario:

1. Queue depth = 1. No batching. Workloads that want to achieve maximum
   IOPS typically queue more than 1 request at a time. When multiple
   requests are queued both submission and completion can be batched so
   that only a fraction of vmexits + interrupts are required to process
   N requests.

2. QEMU AioContext adaptive polling disabled. Polling skips ioeventfd
   and instead polls on the virtqueue ring index memory that the guest
   updates. Polling is on by default when -device
   virtio-blk-pci,iothread= is used but I disabled it for this test.

But if we stick with the worst-case scenario then we are really limited
to 250k IOPS per virtio-blk device because a single IOThread is
processing the virtqueue with a 4 microsecond wakeup latency.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  reply	other threads:[~2020-07-17  8:42 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-14 21:43 [virtio-dev] On doorbells (queue notifications) Alex Bennée
2020-07-15 11:48 ` Stefan Hajnoczi
2020-07-15 13:29   ` Alex Bennée
2020-07-15 15:47     ` Stefan Hajnoczi
2020-07-15 16:40       ` Alex Bennée
2020-07-15 17:09         ` Cornelia Huck
2020-07-16 10:00         ` Stefan Hajnoczi
2020-07-16 11:25           ` Christophe de Dinechin
2020-07-16 14:19             ` Stefan Hajnoczi
2020-07-16 14:31               ` Christophe de Dinechin
2020-07-16 14:34               ` Christophe de Dinechin
2020-07-17  8:42                 ` Stefan Hajnoczi [this message]
2020-07-15 17:01       ` Cornelia Huck
2020-07-15 17:25         ` Alex Bennée
2020-07-15 20:04           ` Halil Pasic
2020-07-16  9:41             ` Cornelia Huck

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200717084218.GA128195@stefanha-x1.localdomain \
    --to=stefanha@redhat.com \
    --cc=dinechin@redhat.com \
    --cc=virtio-dev@lists.oasis-open.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.