* [virtio-dev] On doorbells (queue notifications) @ 2020-07-14 21:43 Alex Bennée 2020-07-15 11:48 ` Stefan Hajnoczi 0 siblings, 1 reply; 16+ messages in thread From: Alex Bennée @ 2020-07-14 21:43 UTC (permalink / raw) To: virtio-dev; +Cc: Zha Bin, Zha Bin, Jing Liu, Chao Peng Hi, At some point in the life cycle of a virt queue there needs to be a notification event to signal to the guest or host that there has been an update to the virtio structures. The number of context switches made by the hypervisor is therefor a limiting factor to the overall performance of the system. As I understand it on the host side this can either be reported up via eventfd or user-space can simply busy-wait loop reading the virt queue status so it can process the queue as soon as a status change is visible. On the guest side however a busy-wait approach is undesirable as it prevents the hypervisor from properly sharing resources. This will also be the case if you want to service the backend of a particular virtio device in another separate VMs. For Virtio PCI you get an IRQ number from a (hopefully virtualised) GIC and eventually end up at the IRQ handler which does a ioread8(vp_dev->isr). This implicitly clears the current IRQ event. The value of the ISR tells the driver what event has occurred (config or queue update). I'm slightly confused by the MSI terminology because this only seems to be relevant for the PCI legacy interface and AFAICT only touch the outgoing path in setup and del_vq. Do incoming MSI interrupts just get mapped directly to the appropriate handler function to process the queues or the config? For MMIO based transports there is a more traditional setup of a InterruptStatus register which is read and then acknowledged with write to a InterruptACK register. If the device memory is handled with trap and emulate this means a two round trips to the hypervisor just to process one vq update. I see there was a proposal to introduce the concept of MSI based IRQs to virtio-mmio last year: https://lore.kernel.org/patchwork/patch/1171065/ with proposed kernel changes in January: https://patchwork.kernel.org/cover/11343097/ I haven't seen any follow up since those series were posted. Is this a proposal that support? Is there another version likely to be proposed to the virtio-comment list? Finally I'm curious if this is just a problem avoided by the s390 channel approach? Does the use of messages over a channel just avoid the sort of bouncing back and forth that other hypervisors have to do when emulating a device? -- Alex Bennée --------------------------------------------------------------------- To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [virtio-dev] On doorbells (queue notifications) 2020-07-14 21:43 [virtio-dev] On doorbells (queue notifications) Alex Bennée @ 2020-07-15 11:48 ` Stefan Hajnoczi 2020-07-15 13:29 ` Alex Bennée 0 siblings, 1 reply; 16+ messages in thread From: Stefan Hajnoczi @ 2020-07-15 11:48 UTC (permalink / raw) To: Alex Bennée; +Cc: virtio-dev, Zha Bin, Jing Liu, Chao Peng [-- Attachment #1: Type: text/plain, Size: 785 bytes --] On Tue, Jul 14, 2020 at 10:43:36PM +0100, Alex Bennée wrote: > I'm slightly confused by the MSI terminology because this only seems to > be relevant for the PCI legacy interface and AFAICT only touch the > outgoing path in setup and del_vq. Do incoming MSI interrupts just get > mapped directly to the appropriate handler function to process the > queues or the config? When MSI is used the VIRTIO ISR register does not need to be read by the guest interrupt handler. > Finally I'm curious if this is just a problem avoided by the s390 > channel approach? Does the use of messages over a channel just avoid the > sort of bouncing back and forth that other hypervisors have to do when > emulating a device? What does "bouncing back and forth" mean exactly? Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [virtio-dev] On doorbells (queue notifications) 2020-07-15 11:48 ` Stefan Hajnoczi @ 2020-07-15 13:29 ` Alex Bennée 2020-07-15 15:47 ` Stefan Hajnoczi 0 siblings, 1 reply; 16+ messages in thread From: Alex Bennée @ 2020-07-15 13:29 UTC (permalink / raw) To: Stefan Hajnoczi; +Cc: virtio-dev, Zha Bin, Jing Liu, Chao Peng Stefan Hajnoczi <stefanha@redhat.com> writes: > On Tue, Jul 14, 2020 at 10:43:36PM +0100, Alex Bennée wrote: >> I'm slightly confused by the MSI terminology because this only seems to >> be relevant for the PCI legacy interface and AFAICT only touch the >> outgoing path in setup and del_vq. Do incoming MSI interrupts just get >> mapped directly to the appropriate handler function to process the >> queues or the config? > > When MSI is used the VIRTIO ISR register does not need to be read by the > guest interrupt handler. > >> Finally I'm curious if this is just a problem avoided by the s390 >> channel approach? Does the use of messages over a channel just avoid the >> sort of bouncing back and forth that other hypervisors have to do when >> emulating a device? > > What does "bouncing back and forth" mean exactly? Context switching between guest and hypervisor. > > Stefan -- Alex Bennée --------------------------------------------------------------------- To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [virtio-dev] On doorbells (queue notifications) 2020-07-15 13:29 ` Alex Bennée @ 2020-07-15 15:47 ` Stefan Hajnoczi 2020-07-15 16:40 ` Alex Bennée 2020-07-15 17:01 ` Cornelia Huck 0 siblings, 2 replies; 16+ messages in thread From: Stefan Hajnoczi @ 2020-07-15 15:47 UTC (permalink / raw) To: Alex Bennée; +Cc: virtio-dev, Zha Bin, Jing Liu, Chao Peng, cohuck [-- Attachment #1: Type: text/plain, Size: 656 bytes --] On Wed, Jul 15, 2020 at 02:29:04PM +0100, Alex Bennée wrote: > Stefan Hajnoczi <stefanha@redhat.com> writes: > > On Tue, Jul 14, 2020 at 10:43:36PM +0100, Alex Bennée wrote: > >> Finally I'm curious if this is just a problem avoided by the s390 > >> channel approach? Does the use of messages over a channel just avoid the > >> sort of bouncing back and forth that other hypervisors have to do when > >> emulating a device? > > > > What does "bouncing back and forth" mean exactly? > > Context switching between guest and hypervisor. I have CCed Cornelia Huck, who can explain the lifecycle of an I/O request on s390 channel I/O. Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [virtio-dev] On doorbells (queue notifications) 2020-07-15 15:47 ` Stefan Hajnoczi @ 2020-07-15 16:40 ` Alex Bennée 2020-07-15 17:09 ` Cornelia Huck 2020-07-16 10:00 ` Stefan Hajnoczi 2020-07-15 17:01 ` Cornelia Huck 1 sibling, 2 replies; 16+ messages in thread From: Alex Bennée @ 2020-07-15 16:40 UTC (permalink / raw) To: Stefan Hajnoczi Cc: virtio-dev, Zha Bin, Jing Liu, Chao Peng, cohuck, Jan Kiszka Stefan Hajnoczi <stefanha@redhat.com> writes: > On Wed, Jul 15, 2020 at 02:29:04PM +0100, Alex Bennée wrote: >> Stefan Hajnoczi <stefanha@redhat.com> writes: >> > On Tue, Jul 14, 2020 at 10:43:36PM +0100, Alex Bennée wrote: >> >> Finally I'm curious if this is just a problem avoided by the s390 >> >> channel approach? Does the use of messages over a channel just avoid the >> >> sort of bouncing back and forth that other hypervisors have to do when >> >> emulating a device? >> > >> > What does "bouncing back and forth" mean exactly? >> >> Context switching between guest and hypervisor. > > I have CCed Cornelia Huck, who can explain the lifecycle of an I/O > request on s390 channel I/O. Thanks. I was also wondering about the efficiency of doorbells/notifications the other way. AFAIUI for both PCI and MMIO only a single write is required to the notify flag which causes a trap to the hypervisor and the rest of the processing. The hypervisor doesn't have the cost multiple exits to read the guest state although it obviously wants to be as efficient as possible passing the data back up to what ever is handling the backend of the device so it doesn't need to do multiple context switches. Has there been any investigation into other mechanisms for notifying the hypervisor of an event - for example using a HYP call or similar mechanism? My gut tells me this probably doesn't make any difference as a trap to the hypervisor is likely to cost the same either way because you still need to save the guest context before actioning something but it would be interesting to know if anyone has looked at it. Perhaps there is a benefit in partitioned systems where core running the guest can return straight away after initiating what it needs to internally in the hypervisor to pass the notification to something that can deal with it? -- Alex Bennée --------------------------------------------------------------------- To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [virtio-dev] On doorbells (queue notifications) 2020-07-15 16:40 ` Alex Bennée @ 2020-07-15 17:09 ` Cornelia Huck 2020-07-16 10:00 ` Stefan Hajnoczi 1 sibling, 0 replies; 16+ messages in thread From: Cornelia Huck @ 2020-07-15 17:09 UTC (permalink / raw) To: Alex Bennée Cc: Stefan Hajnoczi, virtio-dev, Zha Bin, Jing Liu, Chao Peng, Jan Kiszka On Wed, 15 Jul 2020 17:40:33 +0100 Alex Bennée <alex.bennee@linaro.org> wrote: > I was also wondering about the efficiency of doorbells/notifications the > other way. AFAIUI for both PCI and MMIO only a single write is required > to the notify flag which causes a trap to the hypervisor and the rest of > the processing. The hypervisor doesn't have the cost multiple exits to > read the guest state although it obviously wants to be as efficient as > possible passing the data back up to what ever is handling the backend > of the device so it doesn't need to do multiple context switches. > > Has there been any investigation into other mechanisms for notifying the > hypervisor of an event - for example using a HYP call or similar > mechanism? For ccw devices, we do a 'diagnose' call (which is basically a hypercall). It has some parameters (including the queue the guest is notifying for), so the host already has that information when the exit happens. > My gut tells me this probably doesn't make any difference as a trap to > the hypervisor is likely to cost the same either way because you still > need to save the guest context before actioning something but it would > be interesting to know if anyone has looked at it. Perhaps there is a > benefit in partitioned systems where core running the guest can return > straight away after initiating what it needs to internally in the > hypervisor to pass the notification to something that can deal with it? I guess that depends mostly upon whether there is further interaction needed, or if the guest can give the host all needed info in one go. --------------------------------------------------------------------- To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [virtio-dev] On doorbells (queue notifications) 2020-07-15 16:40 ` Alex Bennée 2020-07-15 17:09 ` Cornelia Huck @ 2020-07-16 10:00 ` Stefan Hajnoczi 2020-07-16 11:25 ` Christophe de Dinechin 1 sibling, 1 reply; 16+ messages in thread From: Stefan Hajnoczi @ 2020-07-16 10:00 UTC (permalink / raw) To: Alex Bennée Cc: virtio-dev, Zha Bin, Jing Liu, Chao Peng, cohuck, Jan Kiszka, Michael S. Tsirkin [-- Attachment #1: Type: text/plain, Size: 3729 bytes --] On Wed, Jul 15, 2020 at 05:40:33PM +0100, Alex Bennée wrote: > > Stefan Hajnoczi <stefanha@redhat.com> writes: > > > On Wed, Jul 15, 2020 at 02:29:04PM +0100, Alex Bennée wrote: > >> Stefan Hajnoczi <stefanha@redhat.com> writes: > >> > On Tue, Jul 14, 2020 at 10:43:36PM +0100, Alex Bennée wrote: > >> >> Finally I'm curious if this is just a problem avoided by the s390 > >> >> channel approach? Does the use of messages over a channel just avoid the > >> >> sort of bouncing back and forth that other hypervisors have to do when > >> >> emulating a device? > >> > > >> > What does "bouncing back and forth" mean exactly? > >> > >> Context switching between guest and hypervisor. > > > > I have CCed Cornelia Huck, who can explain the lifecycle of an I/O > > request on s390 channel I/O. > > Thanks. > > I was also wondering about the efficiency of doorbells/notifications the > other way. AFAIUI for both PCI and MMIO only a single write is required > to the notify flag which causes a trap to the hypervisor and the rest of > the processing. The hypervisor doesn't have the cost multiple exits to > read the guest state although it obviously wants to be as efficient as > possible passing the data back up to what ever is handling the backend > of the device so it doesn't need to do multiple context switches. > > Has there been any investigation into other mechanisms for notifying the > hypervisor of an event - for example using a HYP call or similar > mechanism? > > My gut tells me this probably doesn't make any difference as a trap to > the hypervisor is likely to cost the same either way because you still > need to save the guest context before actioning something but it would > be interesting to know if anyone has looked at it. Perhaps there is a > benefit in partitioned systems where core running the guest can return > straight away after initiating what it needs to internally in the > hypervisor to pass the notification to something that can deal with it? It's very architecture-specific. This is something Michael Tsirkin looked in in the past. He found that MMIO and PIO perform differently on x86. VIRTIO supports both so the device can be configured optimally. There was an old discussion from 2013 here: https://lkml.org/lkml/2013/4/4/299 Without nested page tables MMIO was slower than PIO. But with nested page tables it was faster. Another option on x86 is using Model-Specific Registers (for hypercalls) but this doesn't fit into the PCI device model. A bigger issue than vmexit latency is device emulation thread wakeup latency. There is a thread (QEMU, vhost-user, vhost, etc) monitoring the ioeventfd but it may be descheduled. Its physical CPU may be in a low power state. I ran a benchmark late last year with QEMU's AioContext adaptive polling disabled so we can measure the wakeup latency: CPU 0/KVM 26102 [000] 85626.737072: kvm:kvm_fast_mmio: fast mmio at gpa 0xfde03000 IO iothread1 26099 [001] 85626.737076: syscalls:sys_exit_ppoll: 0x1 4 microseconds ------^ (I did not manually configure physical CPU power states or use the idle=poll host kernel parameter.) Each virtqueue kick had 4 microseconds of latency before the device emulation thread had a chance to process the virtqueue. This means the maximum I/O Operations Per Second (IOPS) is capped at 250k before virtqueue processing has even begun! QEMU AioContext adaptive polling helps here because we skip the vmexit entirely while the IOThread is polling the vring (for up to 32 microseconds by default). It would be great if more people dig into this and optimize notifications further. Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [virtio-dev] On doorbells (queue notifications) 2020-07-16 10:00 ` Stefan Hajnoczi @ 2020-07-16 11:25 ` Christophe de Dinechin 2020-07-16 14:19 ` Stefan Hajnoczi 0 siblings, 1 reply; 16+ messages in thread From: Christophe de Dinechin @ 2020-07-16 11:25 UTC (permalink / raw) To: Stefan Hajnoczi Cc: Alex Bennée, virtio-dev, Zha Bin, Jing Liu, Chao Peng, Cornelia Huck, Jan Kiszka, Michael S. Tsirkin [-- Attachment #1: Type: text/plain, Size: 8601 bytes --] > On 16 Jul 2020, at 12:00, Stefan Hajnoczi <stefanha@redhat.com> wrote: > > On Wed, Jul 15, 2020 at 05:40:33PM +0100, Alex Bennée wrote: >> >> Stefan Hajnoczi <stefanha@redhat.com> writes: >> >>> On Wed, Jul 15, 2020 at 02:29:04PM +0100, Alex Bennée wrote: >>>> Stefan Hajnoczi <stefanha@redhat.com> writes: >>>>> On Tue, Jul 14, 2020 at 10:43:36PM +0100, Alex Bennée wrote: >>>>>> Finally I'm curious if this is just a problem avoided by the s390 >>>>>> channel approach? Does the use of messages over a channel just avoid the >>>>>> sort of bouncing back and forth that other hypervisors have to do when >>>>>> emulating a device? >>>>> >>>>> What does "bouncing back and forth" mean exactly? >>>> >>>> Context switching between guest and hypervisor. >>> >>> I have CCed Cornelia Huck, who can explain the lifecycle of an I/O >>> request on s390 channel I/O. >> >> Thanks. >> >> I was also wondering about the efficiency of doorbells/notifications the >> other way. AFAIUI for both PCI and MMIO only a single write is required >> to the notify flag which causes a trap to the hypervisor and the rest of >> the processing. The hypervisor doesn't have the cost multiple exits to >> read the guest state although it obviously wants to be as efficient as >> possible passing the data back up to what ever is handling the backend >> of the device so it doesn't need to do multiple context switches. >> >> Has there been any investigation into other mechanisms for notifying the >> hypervisor of an event - for example using a HYP call or similar >> mechanism? >> >> My gut tells me this probably doesn't make any difference as a trap to >> the hypervisor is likely to cost the same either way because you still >> need to save the guest context before actioning something but it would >> be interesting to know if anyone has looked at it. Perhaps there is a >> benefit in partitioned systems where core running the guest can return >> straight away after initiating what it needs to internally in the >> hypervisor to pass the notification to something that can deal with it? > > It's very architecture-specific. This is something Michael Tsirkin > looked in in the past. He found that MMIO and PIO perform differently on > x86. VIRTIO supports both so the device can be configured optimally. > There was an old discussion from 2013 here: > https://lkml.org/lkml/2013/4/4/299 <https://lkml.org/lkml/2013/4/4/299> > > Without nested page tables MMIO was slower than PIO. But with nested > page tables it was faster. > > Another option on x86 is using Model-Specific Registers (for hypercalls) > but this doesn't fit into the PCI device model. (Warning: What I write below is based on experience with very different architectures, both CPU and hypervisor; your mileage may vary) It looks to me like the discussion so far is mostly focused on a "synchronous" model where presumably the same CPU is switching context between guest and (host) device emulation. However, I/O devices on real hardware are asynchronous by construction. They do their thing while the CPU processes stuff. So at least theoretically, there is no reason to context switch on the same CPU. You could very well have an I/O thread on some other CPU doing its thing. This allows to do something some of you may have heard me talk about, called "interrupt coalescing". As Stefan noted, this is not always a win, as it may introduce latency. There are at least two cases where this latency really hurts: 1. When the I/O thread is in some kind of deep sleep, e.g. because it was not active recently. Everything from cache to TLB may hit you here, but that normally happens when there isn't much I/O activity, so this case in practice does not hurt that much, or rather it hurts in a case where don't really care. 2. When the I/O thread is preempted, or not given enough cycles to do its stuff. This happens when the system is both CPU and I/O bound, and addressing that is mostly a scheduling issue. A CPU thread could hand-off to a specific I/O thread, reducing that case to the kind of context switch Alex was mentioning, but I'm not sure how feasible it is to implement that on Linux / kvm. In such cases, you have to pay for context switch. I'm not sure if that context switch is markedly more expensive than a "vmexit". On at least that alien architecture I was familiar with, there was little difference between switching to "your" host CPU thread and switching to "another" host I/O thread. But then the context switch was all in software, so we had designed it that way. So let's assume now that you run your device emulation fully in an I/O thread, which we will assume for simplicity sits mostly in host user-space, and your guest I/O code runs in a CPU thread, which we will assume sits mostly in guest user/kernel space. It is possible to share two-way doorbells / IRQ queues on some memory page, very similar to a virtqueue. When you want to "doorbell" your device, you simply write to that page. The device threads picks it up by reading the same page, and posts I/O completions on the same page, with simple memory writes. Consider this I/O exchange buffer as having (at least) a writer and reader index for both doorbells and virtual interrupts. In the explanation below, I will call them "dwi", "dri", "iwi", "iri" for doorbell / interrupt read and write index. (Note that as a key optimization, you really don't want dwi and dri to be in the same cache line, since different CPUs are going to read and write them) You obviously still need to "kick" the I/O or CPU thread, and we are talking about an IPI here since you don't know which CPU that other thread is sitting on. But the interesting property is that you only need to do that when dwi==dri or iwi==iri, because if not, the other side has already been "kicked" and will keep working, i.e. incrementing dri or iri, until it reaches back that state. The real "interrupt coalescing" trick can happen here. In some cases, you can decide to update your dwi or iwi without kicking, as long as you know that you will need to kick later. That requires some heavy cooperation from guest drivers, though, and is a second-order optimization. With a scheme like this, you replace a systematic context switch for each device interrupt with a memory write and a "fire and forget" kick IPI that only happens when the system is not already busy processing I/Os, so that it can be eliminated when the system is most busy. With interrupt coalescing, you can send IPIs at a rate much lower than the actual I/O rate. Not sure how difficult it is to adapt a scheme like this to the current state of qemu / kvm, but I'm pretty sure it works well if you implement it correctly ;-) > > A bigger issue than vmexit latency is device emulation thread wakeup > latency. There is a thread (QEMU, vhost-user, vhost, etc) monitoring the > ioeventfd but it may be descheduled. Its physical CPU may be in a low > power state. I ran a benchmark late last year with QEMU's AioContext > adaptive polling disabled so we can measure the wakeup latency: > > CPU 0/KVM 26102 [000] 85626.737072: kvm:kvm_fast_mmio: > fast mmio at gpa 0xfde03000 > IO iothread1 26099 [001] 85626.737076: syscalls:sys_exit_ppoll: 0x1 > 4 microseconds ------^ > > (I did not manually configure physical CPU power states or use the > idle=poll host kernel parameter.) > > Each virtqueue kick had 4 microseconds of latency before the device > emulation thread had a chance to process the virtqueue. This means the > maximum I/O Operations Per Second (IOPS) is capped at 250k before > virtqueue processing has even begun! This data is what prompted me to write the above. This 4us seems really long to me. I recall a benchmark where the technique above was reaching at least 400k IOPs for a single VM on a medium-size system (4CPUs (*)). I remember the time I ran this benchmark quite well, because it was just after VMware made a big splash about reaching 100k IOPs: https://blogs.vmware.com/performance/2008/05/100000-io-opera.html. (*) Yes, at the time, 4 CPUs was a medium size system. Don't laugh. > > QEMU AioContext adaptive polling helps here because we skip the vmexit > entirely while the IOThread is polling the vring (for up to 32 > microseconds by default). > > It would be great if more people dig into this and optimize > notifications further. > > Stefan [-- Attachment #2: Type: text/html, Size: 36614 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [virtio-dev] On doorbells (queue notifications) 2020-07-16 11:25 ` Christophe de Dinechin @ 2020-07-16 14:19 ` Stefan Hajnoczi 2020-07-16 14:31 ` Christophe de Dinechin 2020-07-16 14:34 ` Christophe de Dinechin 0 siblings, 2 replies; 16+ messages in thread From: Stefan Hajnoczi @ 2020-07-16 14:19 UTC (permalink / raw) To: Christophe de Dinechin Cc: Alex Bennée, virtio-dev, Zha Bin, Jing Liu, Chao Peng, Cornelia Huck, Jan Kiszka, Michael S. Tsirkin [-- Attachment #1: Type: text/plain, Size: 8792 bytes --] On Thu, Jul 16, 2020 at 01:25:37PM +0200, Christophe de Dinechin wrote: > > > > On 16 Jul 2020, at 12:00, Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > On Wed, Jul 15, 2020 at 05:40:33PM +0100, Alex Bennée wrote: > >> > >> Stefan Hajnoczi <stefanha@redhat.com> writes: > >> > >>> On Wed, Jul 15, 2020 at 02:29:04PM +0100, Alex Bennée wrote: > >>>> Stefan Hajnoczi <stefanha@redhat.com> writes: > >>>>> On Tue, Jul 14, 2020 at 10:43:36PM +0100, Alex Bennée wrote: > >>>>>> Finally I'm curious if this is just a problem avoided by the s390 > >>>>>> channel approach? Does the use of messages over a channel just avoid the > >>>>>> sort of bouncing back and forth that other hypervisors have to do when > >>>>>> emulating a device? > >>>>> > >>>>> What does "bouncing back and forth" mean exactly? > >>>> > >>>> Context switching between guest and hypervisor. > >>> > >>> I have CCed Cornelia Huck, who can explain the lifecycle of an I/O > >>> request on s390 channel I/O. > >> > >> Thanks. > >> > >> I was also wondering about the efficiency of doorbells/notifications the > >> other way. AFAIUI for both PCI and MMIO only a single write is required > >> to the notify flag which causes a trap to the hypervisor and the rest of > >> the processing. The hypervisor doesn't have the cost multiple exits to > >> read the guest state although it obviously wants to be as efficient as > >> possible passing the data back up to what ever is handling the backend > >> of the device so it doesn't need to do multiple context switches. > >> > >> Has there been any investigation into other mechanisms for notifying the > >> hypervisor of an event - for example using a HYP call or similar > >> mechanism? > >> > >> My gut tells me this probably doesn't make any difference as a trap to > >> the hypervisor is likely to cost the same either way because you still > >> need to save the guest context before actioning something but it would > >> be interesting to know if anyone has looked at it. Perhaps there is a > >> benefit in partitioned systems where core running the guest can return > >> straight away after initiating what it needs to internally in the > >> hypervisor to pass the notification to something that can deal with it? > > > > It's very architecture-specific. This is something Michael Tsirkin > > looked in in the past. He found that MMIO and PIO perform differently on > > x86. VIRTIO supports both so the device can be configured optimally. > > There was an old discussion from 2013 here: > > https://lkml.org/lkml/2013/4/4/299 <https://lkml.org/lkml/2013/4/4/299> > > > > Without nested page tables MMIO was slower than PIO. But with nested > > page tables it was faster. > > > > Another option on x86 is using Model-Specific Registers (for hypercalls) > > but this doesn't fit into the PCI device model. > > (Warning: What I write below is based on experience with very different > architectures, both CPU and hypervisor; your mileage may vary) > > It looks to me like the discussion so far is mostly focused on a "synchronous" > model where presumably the same CPU is switching context between > guest and (host) device emulation. > > However, I/O devices on real hardware are asynchronous by construction. > They do their thing while the CPU processes stuff. So at least theoretically, > there is no reason to context switch on the same CPU. You could very well > have an I/O thread on some other CPU doing its thing. This allows to > do something some of you may have heard me talk about, called > "interrupt coalescing". > > As Stefan noted, this is not always a win, as it may introduce latency. > There are at least two cases where this latency really hurts: > > 1. When the I/O thread is in some kind of deep sleep, e.g. because it > was not active recently. Everything from cache to TLB may hit you here, > but that normally happens when there isn't much I/O activity, so this case > in practice does not hurt that much, or rather it hurts in a case where > don't really care. > > 2. When the I/O thread is preempted, or not given enough cycles to do its > stuff. This happens when the system is both CPU and I/O bound, and > addressing that is mostly a scheduling issue. A CPU thread could hand-off > to a specific I/O thread, reducing that case to the kind of context switch > Alex was mentioning, but I'm not sure how feasible it is to implement > that on Linux / kvm. > > In such cases, you have to pay for context switch. I'm not sure if that > context switch is markedly more expensive than a "vmexit". On at least > that alien architecture I was familiar with, there was little difference between > switching to "your" host CPU thread and switching to "another" host > I/O thread. But then the context switch was all in software, so we had > designed it that way. > > So let's assume now that you run your device emulation fully in an I/O > thread, which we will assume for simplicity sits mostly in host user-space, > and your guest I/O code runs in a CPU thread, which we will assume > sits mostly in guest user/kernel space. > > It is possible to share two-way doorbells / IRQ queues on some memory > page, very similar to a virtqueue. When you want to "doorbell" your device, > you simply write to that page. The device threads picks it up by reading > the same page, and posts I/O completions on the same page, with simple > memory writes. > > Consider this I/O exchange buffer as having (at least) a writer and reader > index for both doorbells and virtual interrupts. In the explanation > below, I will call them "dwi", "dri", "iwi", "iri" for doorbell / interrupt read > and write index. (Note that as a key optimization, you really > don't want dwi and dri to be in the same cache line, since different > CPUs are going to read and write them) > > You obviously still need to "kick" the I/O or CPU thread, and we are > talking about an IPI here since you don't know which CPU that other > thread is sitting on. But the interesting property is that you only need > to do that when dwi==dri or iwi==iri, because if not, the other side > has already been "kicked" and will keep working, i.e. incrementing > dri or iri, until it reaches back that state. > > The real "interrupt coalescing" trick can happen here. In some > cases, you can decide to update your dwi or iwi without kicking, > as long as you know that you will need to kick later. That requires > some heavy cooperation from guest drivers, though, and is a > second-order optimization. > > With a scheme like this, you replace a systematic context switch > for each device interrupt with a memory write and a "fire and forget" > kick IPI that only happens when the system is not already busy > processing I/Os, so that it can be eliminated when the system is > most busy. With interrupt coalescing, you can send IPIs at a rate > much lower than the actual I/O rate. > > Not sure how difficult it is to adapt a scheme like this to the current > state of qemu / kvm, but I'm pretty sure it works well if you implement > it correctly ;-) > > > > > A bigger issue than vmexit latency is device emulation thread wakeup > > latency. There is a thread (QEMU, vhost-user, vhost, etc) monitoring the > > ioeventfd but it may be descheduled. Its physical CPU may be in a low > > power state. I ran a benchmark late last year with QEMU's AioContext > > adaptive polling disabled so we can measure the wakeup latency: > > > > CPU 0/KVM 26102 [000] 85626.737072: kvm:kvm_fast_mmio: > > fast mmio at gpa 0xfde03000 > > IO iothread1 26099 [001] 85626.737076: syscalls:sys_exit_ppoll: 0x1 > > 4 microseconds ------^ Hi Christophe, QEMU/KVM does something similar to what you described. In the perf output above the vmexit kvm_fast_mmio event occurs on physical CPU "[000]". The IOThread wakes up on physical CPU "[001]". Physical CPU#0 resumes guest execution immediately after marking the ioeventfd ready. There is no context switch to the IOThread or a return from ioctl(KVM_RUN) on CPU#0. The IOThread reads the eventfd. An eventfd is a counter that is reset to 0 on read. Because it's a counter you get coalescing: if the guest performs multiple MMIO writes the ioeventfd counter increases but the IOThread only wakes up once and reads the ioeventfd. VIRTIO itself also has a mechanism for suppressing notifications called EVENT_IDX. It allows the driver to let the device know that it does not require interrupts, and the device to let the driver know it does not require virtqueue kicks. This reminds me a bit of the mitigation mechanism you described. Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [virtio-dev] On doorbells (queue notifications) 2020-07-16 14:19 ` Stefan Hajnoczi @ 2020-07-16 14:31 ` Christophe de Dinechin 2020-07-16 14:34 ` Christophe de Dinechin 1 sibling, 0 replies; 16+ messages in thread From: Christophe de Dinechin @ 2020-07-16 14:31 UTC (permalink / raw) To: Stefan Hajnoczi Cc: Alex Bennée, virtio-dev, Zha Bin, Jing Liu, Chao Peng, Cornelia Huck, Jan Kiszka, Michael S. Tsirkin [-- Attachment #1: Type: text/plain, Size: 9301 bytes --] > On 16 Jul 2020, at 16:19, Stefan Hajnoczi <stefanha@redhat.com> wrote: > > On Thu, Jul 16, 2020 at 01:25:37PM +0200, Christophe de Dinechin wrote: >> >> >>> On 16 Jul 2020, at 12:00, Stefan Hajnoczi <stefanha@redhat.com> wrote: >>> >>> On Wed, Jul 15, 2020 at 05:40:33PM +0100, Alex Bennée wrote: >>>> >>>> Stefan Hajnoczi <stefanha@redhat.com> writes: >>>> >>>>> On Wed, Jul 15, 2020 at 02:29:04PM +0100, Alex Bennée wrote: >>>>>> Stefan Hajnoczi <stefanha@redhat.com> writes: >>>>>>> On Tue, Jul 14, 2020 at 10:43:36PM +0100, Alex Bennée wrote: >>>>>>>> Finally I'm curious if this is just a problem avoided by the s390 >>>>>>>> channel approach? Does the use of messages over a channel just avoid the >>>>>>>> sort of bouncing back and forth that other hypervisors have to do when >>>>>>>> emulating a device? >>>>>>> >>>>>>> What does "bouncing back and forth" mean exactly? >>>>>> >>>>>> Context switching between guest and hypervisor. >>>>> >>>>> I have CCed Cornelia Huck, who can explain the lifecycle of an I/O >>>>> request on s390 channel I/O. >>>> >>>> Thanks. >>>> >>>> I was also wondering about the efficiency of doorbells/notifications the >>>> other way. AFAIUI for both PCI and MMIO only a single write is required >>>> to the notify flag which causes a trap to the hypervisor and the rest of >>>> the processing. The hypervisor doesn't have the cost multiple exits to >>>> read the guest state although it obviously wants to be as efficient as >>>> possible passing the data back up to what ever is handling the backend >>>> of the device so it doesn't need to do multiple context switches. >>>> >>>> Has there been any investigation into other mechanisms for notifying the >>>> hypervisor of an event - for example using a HYP call or similar >>>> mechanism? >>>> >>>> My gut tells me this probably doesn't make any difference as a trap to >>>> the hypervisor is likely to cost the same either way because you still >>>> need to save the guest context before actioning something but it would >>>> be interesting to know if anyone has looked at it. Perhaps there is a >>>> benefit in partitioned systems where core running the guest can return >>>> straight away after initiating what it needs to internally in the >>>> hypervisor to pass the notification to something that can deal with it? >>> >>> It's very architecture-specific. This is something Michael Tsirkin >>> looked in in the past. He found that MMIO and PIO perform differently on >>> x86. VIRTIO supports both so the device can be configured optimally. >>> There was an old discussion from 2013 here: >>> https://lkml.org/lkml/2013/4/4/299 <https://lkml.org/lkml/2013/4/4/299> <https://lkml.org/lkml/2013/4/4/299 <https://lkml.org/lkml/2013/4/4/299>> >>> >>> Without nested page tables MMIO was slower than PIO. But with nested >>> page tables it was faster. >>> >>> Another option on x86 is using Model-Specific Registers (for hypercalls) >>> but this doesn't fit into the PCI device model. >> >> (Warning: What I write below is based on experience with very different >> architectures, both CPU and hypervisor; your mileage may vary) >> >> It looks to me like the discussion so far is mostly focused on a "synchronous" >> model where presumably the same CPU is switching context between >> guest and (host) device emulation. >> >> However, I/O devices on real hardware are asynchronous by construction. >> They do their thing while the CPU processes stuff. So at least theoretically, >> there is no reason to context switch on the same CPU. You could very well >> have an I/O thread on some other CPU doing its thing. This allows to >> do something some of you may have heard me talk about, called >> "interrupt coalescing". >> >> As Stefan noted, this is not always a win, as it may introduce latency. >> There are at least two cases where this latency really hurts: >> >> 1. When the I/O thread is in some kind of deep sleep, e.g. because it >> was not active recently. Everything from cache to TLB may hit you here, >> but that normally happens when there isn't much I/O activity, so this case >> in practice does not hurt that much, or rather it hurts in a case where >> don't really care. >> >> 2. When the I/O thread is preempted, or not given enough cycles to do its >> stuff. This happens when the system is both CPU and I/O bound, and >> addressing that is mostly a scheduling issue. A CPU thread could hand-off >> to a specific I/O thread, reducing that case to the kind of context switch >> Alex was mentioning, but I'm not sure how feasible it is to implement >> that on Linux / kvm. >> >> In such cases, you have to pay for context switch. I'm not sure if that >> context switch is markedly more expensive than a "vmexit". On at least >> that alien architecture I was familiar with, there was little difference between >> switching to "your" host CPU thread and switching to "another" host >> I/O thread. But then the context switch was all in software, so we had >> designed it that way. >> >> So let's assume now that you run your device emulation fully in an I/O >> thread, which we will assume for simplicity sits mostly in host user-space, >> and your guest I/O code runs in a CPU thread, which we will assume >> sits mostly in guest user/kernel space. >> >> It is possible to share two-way doorbells / IRQ queues on some memory >> page, very similar to a virtqueue. When you want to "doorbell" your device, >> you simply write to that page. The device threads picks it up by reading >> the same page, and posts I/O completions on the same page, with simple >> memory writes. >> >> Consider this I/O exchange buffer as having (at least) a writer and reader >> index for both doorbells and virtual interrupts. In the explanation >> below, I will call them "dwi", "dri", "iwi", "iri" for doorbell / interrupt read >> and write index. (Note that as a key optimization, you really >> don't want dwi and dri to be in the same cache line, since different >> CPUs are going to read and write them) >> >> You obviously still need to "kick" the I/O or CPU thread, and we are >> talking about an IPI here since you don't know which CPU that other >> thread is sitting on. But the interesting property is that you only need >> to do that when dwi==dri or iwi==iri, because if not, the other side >> has already been "kicked" and will keep working, i.e. incrementing >> dri or iri, until it reaches back that state. >> >> The real "interrupt coalescing" trick can happen here. In some >> cases, you can decide to update your dwi or iwi without kicking, >> as long as you know that you will need to kick later. That requires >> some heavy cooperation from guest drivers, though, and is a >> second-order optimization. >> >> With a scheme like this, you replace a systematic context switch >> for each device interrupt with a memory write and a "fire and forget" >> kick IPI that only happens when the system is not already busy >> processing I/Os, so that it can be eliminated when the system is >> most busy. With interrupt coalescing, you can send IPIs at a rate >> much lower than the actual I/O rate. >> >> Not sure how difficult it is to adapt a scheme like this to the current >> state of qemu / kvm, but I'm pretty sure it works well if you implement >> it correctly ;-) >> >>> >>> A bigger issue than vmexit latency is device emulation thread wakeup >>> latency. There is a thread (QEMU, vhost-user, vhost, etc) monitoring the >>> ioeventfd but it may be descheduled. Its physical CPU may be in a low >>> power state. I ran a benchmark late last year with QEMU's AioContext >>> adaptive polling disabled so we can measure the wakeup latency: >>> >>> CPU 0/KVM 26102 [000] 85626.737072: kvm:kvm_fast_mmio: >>> fast mmio at gpa 0xfde03000 >>> IO iothread1 26099 [001] 85626.737076: syscalls:sys_exit_ppoll: 0x1 >>> 4 microseconds ------^ > > Hi Christophe, > QEMU/KVM does something similar to what you described. In the perf > output above the vmexit kvm_fast_mmio event occurs on physical CPU > "[000]". The IOThread wakes up on physical CPU "[001]". Physical CPU#0 > resumes guest execution immediately after marking the ioeventfd ready. > There is no context switch to the IOThread or a return from > ioctl(KVM_RUN) on CPU#0. Oh, that's good. But then the conclusion that the 4us delay limits us to 250kIOPS is incorrect, no? Is there anything that would prevent multiple I/O events (doorbell or interrupt) to be in flight at the same time? > > The IOThread reads the eventfd. An eventfd is a counter that is reset to > 0 on read. Because it's a counter you get coalescing: if the guest > performs multiple MMIO writes the ioeventfd counter increases but the > IOThread only wakes up once and reads the ioeventfd. > > VIRTIO itself also has a mechanism for suppressing notifications called > EVENT_IDX. It allows the driver to let the device know that it does not > require interrupts, and the device to let the driver know it does not > require virtqueue kicks. This reminds me a bit of the mitigation > mechanism you described. > > Stefan [-- Attachment #2: Type: text/html, Size: 25934 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [virtio-dev] On doorbells (queue notifications) 2020-07-16 14:19 ` Stefan Hajnoczi 2020-07-16 14:31 ` Christophe de Dinechin @ 2020-07-16 14:34 ` Christophe de Dinechin 2020-07-17 8:42 ` Stefan Hajnoczi 1 sibling, 1 reply; 16+ messages in thread From: Christophe de Dinechin @ 2020-07-16 14:34 UTC (permalink / raw) Cc: virtio-dev [-- Attachment #1: Type: text/plain, Size: 9299 bytes --] > On 16 Jul 2020, at 16:19, Stefan Hajnoczi <stefanha@redhat.com> wrote: > > On Thu, Jul 16, 2020 at 01:25:37PM +0200, Christophe de Dinechin wrote: >> >> >>> On 16 Jul 2020, at 12:00, Stefan Hajnoczi <stefanha@redhat.com> wrote: >>> >>> On Wed, Jul 15, 2020 at 05:40:33PM +0100, Alex Bennée wrote: >>>> >>>> Stefan Hajnoczi <stefanha@redhat.com> writes: >>>> >>>>> On Wed, Jul 15, 2020 at 02:29:04PM +0100, Alex Bennée wrote: >>>>>> Stefan Hajnoczi <stefanha@redhat.com> writes: >>>>>>> On Tue, Jul 14, 2020 at 10:43:36PM +0100, Alex Bennée wrote: >>>>>>>> Finally I'm curious if this is just a problem avoided by the s390 >>>>>>>> channel approach? Does the use of messages over a channel just avoid the >>>>>>>> sort of bouncing back and forth that other hypervisors have to do when >>>>>>>> emulating a device? >>>>>>> >>>>>>> What does "bouncing back and forth" mean exactly? >>>>>> >>>>>> Context switching between guest and hypervisor. >>>>> >>>>> I have CCed Cornelia Huck, who can explain the lifecycle of an I/O >>>>> request on s390 channel I/O. >>>> >>>> Thanks. >>>> >>>> I was also wondering about the efficiency of doorbells/notifications the >>>> other way. AFAIUI for both PCI and MMIO only a single write is required >>>> to the notify flag which causes a trap to the hypervisor and the rest of >>>> the processing. The hypervisor doesn't have the cost multiple exits to >>>> read the guest state although it obviously wants to be as efficient as >>>> possible passing the data back up to what ever is handling the backend >>>> of the device so it doesn't need to do multiple context switches. >>>> >>>> Has there been any investigation into other mechanisms for notifying the >>>> hypervisor of an event - for example using a HYP call or similar >>>> mechanism? >>>> >>>> My gut tells me this probably doesn't make any difference as a trap to >>>> the hypervisor is likely to cost the same either way because you still >>>> need to save the guest context before actioning something but it would >>>> be interesting to know if anyone has looked at it. Perhaps there is a >>>> benefit in partitioned systems where core running the guest can return >>>> straight away after initiating what it needs to internally in the >>>> hypervisor to pass the notification to something that can deal with it? >>> >>> It's very architecture-specific. This is something Michael Tsirkin >>> looked in in the past. He found that MMIO and PIO perform differently on >>> x86. VIRTIO supports both so the device can be configured optimally. >>> There was an old discussion from 2013 here: >>> https://lkml.org/lkml/2013/4/4/299 <https://lkml.org/lkml/2013/4/4/299> <https://lkml.org/lkml/2013/4/4/299 <https://lkml.org/lkml/2013/4/4/299>> >>> >>> Without nested page tables MMIO was slower than PIO. But with nested >>> page tables it was faster. >>> >>> Another option on x86 is using Model-Specific Registers (for hypercalls) >>> but this doesn't fit into the PCI device model. >> >> (Warning: What I write below is based on experience with very different >> architectures, both CPU and hypervisor; your mileage may vary) >> >> It looks to me like the discussion so far is mostly focused on a "synchronous" >> model where presumably the same CPU is switching context between >> guest and (host) device emulation. >> >> However, I/O devices on real hardware are asynchronous by construction. >> They do their thing while the CPU processes stuff. So at least theoretically, >> there is no reason to context switch on the same CPU. You could very well >> have an I/O thread on some other CPU doing its thing. This allows to >> do something some of you may have heard me talk about, called >> "interrupt coalescing". >> >> As Stefan noted, this is not always a win, as it may introduce latency. >> There are at least two cases where this latency really hurts: >> >> 1. When the I/O thread is in some kind of deep sleep, e.g. because it >> was not active recently. Everything from cache to TLB may hit you here, >> but that normally happens when there isn't much I/O activity, so this case >> in practice does not hurt that much, or rather it hurts in a case where >> don't really care. >> >> 2. When the I/O thread is preempted, or not given enough cycles to do its >> stuff. This happens when the system is both CPU and I/O bound, and >> addressing that is mostly a scheduling issue. A CPU thread could hand-off >> to a specific I/O thread, reducing that case to the kind of context switch >> Alex was mentioning, but I'm not sure how feasible it is to implement >> that on Linux / kvm. >> >> In such cases, you have to pay for context switch. I'm not sure if that >> context switch is markedly more expensive than a "vmexit". On at least >> that alien architecture I was familiar with, there was little difference between >> switching to "your" host CPU thread and switching to "another" host >> I/O thread. But then the context switch was all in software, so we had >> designed it that way. >> >> So let's assume now that you run your device emulation fully in an I/O >> thread, which we will assume for simplicity sits mostly in host user-space, >> and your guest I/O code runs in a CPU thread, which we will assume >> sits mostly in guest user/kernel space. >> >> It is possible to share two-way doorbells / IRQ queues on some memory >> page, very similar to a virtqueue. When you want to "doorbell" your device, >> you simply write to that page. The device threads picks it up by reading >> the same page, and posts I/O completions on the same page, with simple >> memory writes. >> >> Consider this I/O exchange buffer as having (at least) a writer and reader >> index for both doorbells and virtual interrupts. In the explanation >> below, I will call them "dwi", "dri", "iwi", "iri" for doorbell / interrupt read >> and write index. (Note that as a key optimization, you really >> don't want dwi and dri to be in the same cache line, since different >> CPUs are going to read and write them) >> >> You obviously still need to "kick" the I/O or CPU thread, and we are >> talking about an IPI here since you don't know which CPU that other >> thread is sitting on. But the interesting property is that you only need >> to do that when dwi==dri or iwi==iri, because if not, the other side >> has already been "kicked" and will keep working, i.e. incrementing >> dri or iri, until it reaches back that state. >> >> The real "interrupt coalescing" trick can happen here. In some >> cases, you can decide to update your dwi or iwi without kicking, >> as long as you know that you will need to kick later. That requires >> some heavy cooperation from guest drivers, though, and is a >> second-order optimization. >> >> With a scheme like this, you replace a systematic context switch >> for each device interrupt with a memory write and a "fire and forget" >> kick IPI that only happens when the system is not already busy >> processing I/Os, so that it can be eliminated when the system is >> most busy. With interrupt coalescing, you can send IPIs at a rate >> much lower than the actual I/O rate. >> >> Not sure how difficult it is to adapt a scheme like this to the current >> state of qemu / kvm, but I'm pretty sure it works well if you implement >> it correctly ;-) >> >>> >>> A bigger issue than vmexit latency is device emulation thread wakeup >>> latency. There is a thread (QEMU, vhost-user, vhost, etc) monitoring the >>> ioeventfd but it may be descheduled. Its physical CPU may be in a low >>> power state. I ran a benchmark late last year with QEMU's AioContext >>> adaptive polling disabled so we can measure the wakeup latency: >>> >>> CPU 0/KVM 26102 [000] 85626.737072: kvm:kvm_fast_mmio: >>> fast mmio at gpa 0xfde03000 >>> IO iothread1 26099 [001] 85626.737076: syscalls:sys_exit_ppoll: 0x1 >>> 4 microseconds ------^ > > Hi Christophe, > QEMU/KVM does something similar to what you described. In the perf > output above the vmexit kvm_fast_mmio event occurs on physical CPU > "[000]". The IOThread wakes up on physical CPU "[001]". Physical CPU#0 > resumes guest execution immediately after marking the ioeventfd ready. > There is no context switch to the IOThread or a return from > ioctl(KVM_RUN) on CPU#0. > Oh, that's good. But then the conclusion that the 4us delay limits us to 250kIOPS is incorrect, no? Is there anything that would prevent multiple I/O events (doorbell or interrupt) to be in flight at the same time? > The IOThread reads the eventfd. An eventfd is a counter that is reset to > 0 on read. Because it's a counter you get coalescing: if the guest > performs multiple MMIO writes the ioeventfd counter increases but the > IOThread only wakes up once and reads the ioeventfd. > > VIRTIO itself also has a mechanism for suppressing notifications called > EVENT_IDX. It allows the driver to let the device know that it does not > require interrupts, and the device to let the driver know it does not > require virtqueue kicks. This reminds me a bit of the mitigation > mechanism you described. > > Stefan [-- Attachment #2: Type: text/html, Size: 25975 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [virtio-dev] On doorbells (queue notifications) 2020-07-16 14:34 ` Christophe de Dinechin @ 2020-07-17 8:42 ` Stefan Hajnoczi 0 siblings, 0 replies; 16+ messages in thread From: Stefan Hajnoczi @ 2020-07-17 8:42 UTC (permalink / raw) To: Christophe de Dinechin; +Cc: virtio-dev [-- Attachment #1: Type: text/plain, Size: 9928 bytes --] On Thu, Jul 16, 2020 at 04:34:04PM +0200, Christophe de Dinechin wrote: > > > > On 16 Jul 2020, at 16:19, Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > On Thu, Jul 16, 2020 at 01:25:37PM +0200, Christophe de Dinechin wrote: > >> > >> > >>> On 16 Jul 2020, at 12:00, Stefan Hajnoczi <stefanha@redhat.com> wrote: > >>> > >>> On Wed, Jul 15, 2020 at 05:40:33PM +0100, Alex Bennée wrote: > >>>> > >>>> Stefan Hajnoczi <stefanha@redhat.com> writes: > >>>> > >>>>> On Wed, Jul 15, 2020 at 02:29:04PM +0100, Alex Bennée wrote: > >>>>>> Stefan Hajnoczi <stefanha@redhat.com> writes: > >>>>>>> On Tue, Jul 14, 2020 at 10:43:36PM +0100, Alex Bennée wrote: > >>>>>>>> Finally I'm curious if this is just a problem avoided by the s390 > >>>>>>>> channel approach? Does the use of messages over a channel just avoid the > >>>>>>>> sort of bouncing back and forth that other hypervisors have to do when > >>>>>>>> emulating a device? > >>>>>>> > >>>>>>> What does "bouncing back and forth" mean exactly? > >>>>>> > >>>>>> Context switching between guest and hypervisor. > >>>>> > >>>>> I have CCed Cornelia Huck, who can explain the lifecycle of an I/O > >>>>> request on s390 channel I/O. > >>>> > >>>> Thanks. > >>>> > >>>> I was also wondering about the efficiency of doorbells/notifications the > >>>> other way. AFAIUI for both PCI and MMIO only a single write is required > >>>> to the notify flag which causes a trap to the hypervisor and the rest of > >>>> the processing. The hypervisor doesn't have the cost multiple exits to > >>>> read the guest state although it obviously wants to be as efficient as > >>>> possible passing the data back up to what ever is handling the backend > >>>> of the device so it doesn't need to do multiple context switches. > >>>> > >>>> Has there been any investigation into other mechanisms for notifying the > >>>> hypervisor of an event - for example using a HYP call or similar > >>>> mechanism? > >>>> > >>>> My gut tells me this probably doesn't make any difference as a trap to > >>>> the hypervisor is likely to cost the same either way because you still > >>>> need to save the guest context before actioning something but it would > >>>> be interesting to know if anyone has looked at it. Perhaps there is a > >>>> benefit in partitioned systems where core running the guest can return > >>>> straight away after initiating what it needs to internally in the > >>>> hypervisor to pass the notification to something that can deal with it? > >>> > >>> It's very architecture-specific. This is something Michael Tsirkin > >>> looked in in the past. He found that MMIO and PIO perform differently on > >>> x86. VIRTIO supports both so the device can be configured optimally. > >>> There was an old discussion from 2013 here: > >>> https://lkml.org/lkml/2013/4/4/299 <https://lkml.org/lkml/2013/4/4/299> <https://lkml.org/lkml/2013/4/4/299 <https://lkml.org/lkml/2013/4/4/299>> > >>> > >>> Without nested page tables MMIO was slower than PIO. But with nested > >>> page tables it was faster. > >>> > >>> Another option on x86 is using Model-Specific Registers (for hypercalls) > >>> but this doesn't fit into the PCI device model. > >> > >> (Warning: What I write below is based on experience with very different > >> architectures, both CPU and hypervisor; your mileage may vary) > >> > >> It looks to me like the discussion so far is mostly focused on a "synchronous" > >> model where presumably the same CPU is switching context between > >> guest and (host) device emulation. > >> > >> However, I/O devices on real hardware are asynchronous by construction. > >> They do their thing while the CPU processes stuff. So at least theoretically, > >> there is no reason to context switch on the same CPU. You could very well > >> have an I/O thread on some other CPU doing its thing. This allows to > >> do something some of you may have heard me talk about, called > >> "interrupt coalescing". > >> > >> As Stefan noted, this is not always a win, as it may introduce latency. > >> There are at least two cases where this latency really hurts: > >> > >> 1. When the I/O thread is in some kind of deep sleep, e.g. because it > >> was not active recently. Everything from cache to TLB may hit you here, > >> but that normally happens when there isn't much I/O activity, so this case > >> in practice does not hurt that much, or rather it hurts in a case where > >> don't really care. > >> > >> 2. When the I/O thread is preempted, or not given enough cycles to do its > >> stuff. This happens when the system is both CPU and I/O bound, and > >> addressing that is mostly a scheduling issue. A CPU thread could hand-off > >> to a specific I/O thread, reducing that case to the kind of context switch > >> Alex was mentioning, but I'm not sure how feasible it is to implement > >> that on Linux / kvm. > >> > >> In such cases, you have to pay for context switch. I'm not sure if that > >> context switch is markedly more expensive than a "vmexit". On at least > >> that alien architecture I was familiar with, there was little difference between > >> switching to "your" host CPU thread and switching to "another" host > >> I/O thread. But then the context switch was all in software, so we had > >> designed it that way. > >> > >> So let's assume now that you run your device emulation fully in an I/O > >> thread, which we will assume for simplicity sits mostly in host user-space, > >> and your guest I/O code runs in a CPU thread, which we will assume > >> sits mostly in guest user/kernel space. > >> > >> It is possible to share two-way doorbells / IRQ queues on some memory > >> page, very similar to a virtqueue. When you want to "doorbell" your device, > >> you simply write to that page. The device threads picks it up by reading > >> the same page, and posts I/O completions on the same page, with simple > >> memory writes. > >> > >> Consider this I/O exchange buffer as having (at least) a writer and reader > >> index for both doorbells and virtual interrupts. In the explanation > >> below, I will call them "dwi", "dri", "iwi", "iri" for doorbell / interrupt read > >> and write index. (Note that as a key optimization, you really > >> don't want dwi and dri to be in the same cache line, since different > >> CPUs are going to read and write them) > >> > >> You obviously still need to "kick" the I/O or CPU thread, and we are > >> talking about an IPI here since you don't know which CPU that other > >> thread is sitting on. But the interesting property is that you only need > >> to do that when dwi==dri or iwi==iri, because if not, the other side > >> has already been "kicked" and will keep working, i.e. incrementing > >> dri or iri, until it reaches back that state. > >> > >> The real "interrupt coalescing" trick can happen here. In some > >> cases, you can decide to update your dwi or iwi without kicking, > >> as long as you know that you will need to kick later. That requires > >> some heavy cooperation from guest drivers, though, and is a > >> second-order optimization. > >> > >> With a scheme like this, you replace a systematic context switch > >> for each device interrupt with a memory write and a "fire and forget" > >> kick IPI that only happens when the system is not already busy > >> processing I/Os, so that it can be eliminated when the system is > >> most busy. With interrupt coalescing, you can send IPIs at a rate > >> much lower than the actual I/O rate. > >> > >> Not sure how difficult it is to adapt a scheme like this to the current > >> state of qemu / kvm, but I'm pretty sure it works well if you implement > >> it correctly ;-) > >> > >>> > >>> A bigger issue than vmexit latency is device emulation thread wakeup > >>> latency. There is a thread (QEMU, vhost-user, vhost, etc) monitoring the > >>> ioeventfd but it may be descheduled. Its physical CPU may be in a low > >>> power state. I ran a benchmark late last year with QEMU's AioContext > >>> adaptive polling disabled so we can measure the wakeup latency: > >>> > >>> CPU 0/KVM 26102 [000] 85626.737072: kvm:kvm_fast_mmio: > >>> fast mmio at gpa 0xfde03000 > >>> IO iothread1 26099 [001] 85626.737076: syscalls:sys_exit_ppoll: 0x1 > >>> 4 microseconds ------^ > > > > Hi Christophe, > > QEMU/KVM does something similar to what you described. In the perf > > output above the vmexit kvm_fast_mmio event occurs on physical CPU > > "[000]". The IOThread wakes up on physical CPU "[001]". Physical CPU#0 > > resumes guest execution immediately after marking the ioeventfd ready. > > There is no context switch to the IOThread or a return from > > ioctl(KVM_RUN) on CPU#0. > > > Oh, that's good. > > But then the conclusion that the 4us delay limits us to 250kIOPS > is incorrect, no? Is there anything that would prevent multiple > I/O events (doorbell or interrupt) to be in flight at the same time? The number I posted is a worst-case latency scenario: 1. Queue depth = 1. No batching. Workloads that want to achieve maximum IOPS typically queue more than 1 request at a time. When multiple requests are queued both submission and completion can be batched so that only a fraction of vmexits + interrupts are required to process N requests. 2. QEMU AioContext adaptive polling disabled. Polling skips ioeventfd and instead polls on the virtqueue ring index memory that the guest updates. Polling is on by default when -device virtio-blk-pci,iothread= is used but I disabled it for this test. But if we stick with the worst-case scenario then we are really limited to 250k IOPS per virtio-blk device because a single IOThread is processing the virtqueue with a 4 microsecond wakeup latency. Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [virtio-dev] On doorbells (queue notifications) 2020-07-15 15:47 ` Stefan Hajnoczi 2020-07-15 16:40 ` Alex Bennée @ 2020-07-15 17:01 ` Cornelia Huck 2020-07-15 17:25 ` Alex Bennée 1 sibling, 1 reply; 16+ messages in thread From: Cornelia Huck @ 2020-07-15 17:01 UTC (permalink / raw) To: Stefan Hajnoczi Cc: Alex Bennée, virtio-dev, Zha Bin, Jing Liu, Chao Peng [-- Attachment #1: Type: text/plain, Size: 2960 bytes --] On Wed, 15 Jul 2020 16:47:32 +0100 Stefan Hajnoczi <stefanha@redhat.com> wrote: > On Wed, Jul 15, 2020 at 02:29:04PM +0100, Alex Bennée wrote: > > Stefan Hajnoczi <stefanha@redhat.com> writes: > > > On Tue, Jul 14, 2020 at 10:43:36PM +0100, Alex Bennée wrote: > > >> Finally I'm curious if this is just a problem avoided by the s390 > > >> channel approach? Does the use of messages over a channel just avoid the > > >> sort of bouncing back and forth that other hypervisors have to do when > > >> emulating a device? > > > > > > What does "bouncing back and forth" mean exactly? > > > > Context switching between guest and hypervisor. > > I have CCed Cornelia Huck, who can explain the lifecycle of an I/O > request on s390 channel I/O. Having read through this thread, I think this is mostly about notifications? These are not using channel programs (which are only used for things like feature negotiation, or emulating reading/writing a config space, which does not really exist for channel devices.) First, I/O and interrupts are highly abstracted on s390; much of the register accesses or writes done on other architectures is just not seen on s390. Traditionally, I/O interrupts on s390 are tied to a subchannel; you have a rather heavyweight process for that: guest host put status into subchannel queue interrupt open up for I/O interrupt store some data into lowcore do PSW swap interrupt handler called read from lowcore call tsch for subchannel store subchannel status into control block process control block look at subchannel indicators virtio queue processing This is only used for configuration change notifications, or for very old legacy virtio implementations. There's an alternative mechanism not tied to a subchannel, called 'adapter interrupts'. (It is even used to implement MSI-X on s390x, which is why only virtio-pci devices using MSI-X are supported on s390x.) It uses two-staged indicators: a global indicator to show whether any secondary indicator is set, and secondary indicators (which are per virtqueue in the virtio case.) guest host set queue indicator(s) set global indicator queue interrupt iff global indicator had not been set open up for I/O interrupt store some data into lowcore do PSW swap interrupt handler called read from lowcore look at indicators virtio queue processing This has less context switches than traditional I/O interrupts; but I think the main benefit comes from the ability to batch notifications: as long as the guest is still processing indicators, the host does not need to notify again, it can just set indicators (which is why the guest always needs to do two passes at processing.) We can already batch per-device indicators with the classic approach, but adapter interrupts allow to batch even across many devices. [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [virtio-dev] On doorbells (queue notifications) 2020-07-15 17:01 ` Cornelia Huck @ 2020-07-15 17:25 ` Alex Bennée 2020-07-15 20:04 ` Halil Pasic 0 siblings, 1 reply; 16+ messages in thread From: Alex Bennée @ 2020-07-15 17:25 UTC (permalink / raw) To: Cornelia Huck; +Cc: Stefan Hajnoczi, virtio-dev, Zha Bin, Jing Liu, Chao Peng Cornelia Huck <cohuck@redhat.com> writes: > On Wed, 15 Jul 2020 16:47:32 +0100 > Stefan Hajnoczi <stefanha@redhat.com> wrote: > >> On Wed, Jul 15, 2020 at 02:29:04PM +0100, Alex Bennée wrote: >> > Stefan Hajnoczi <stefanha@redhat.com> writes: >> > > On Tue, Jul 14, 2020 at 10:43:36PM +0100, Alex Bennée wrote: >> > >> Finally I'm curious if this is just a problem avoided by the s390 >> > >> channel approach? Does the use of messages over a channel just avoid the >> > >> sort of bouncing back and forth that other hypervisors have to do when >> > >> emulating a device? >> > > >> > > What does "bouncing back and forth" mean exactly? >> > >> > Context switching between guest and hypervisor. >> >> I have CCed Cornelia Huck, who can explain the lifecycle of an I/O >> request on s390 channel I/O. > > Having read through this thread, I think this is mostly about > notifications? Yes - as I understand it they are the only things that really cause a context switch between guest/hypervisor/host. > These are not using channel programs (which are only > used for things like feature negotiation, or emulating reading/writing > a config space, which does not really exist for channel devices.) > > First, I/O and interrupts are highly abstracted on s390; much of the > register accesses or writes done on other architectures is just not > seen on s390. > > Traditionally, I/O interrupts on s390 are tied to a subchannel; you > have a rather heavyweight process for that: > > guest host > > put status into subchannel > queue interrupt > open up for I/O interrupt > store some data into lowcore > do PSW swap > interrupt handler called > read from lowcore > call tsch for subchannel > store subchannel status into > control block > process control block > look at subchannel indicators > virtio queue processing > > This is only used for configuration change notifications, or for very > old legacy virtio implementations. > > There's an alternative mechanism not tied to a subchannel, called > 'adapter interrupts'. (It is even used to implement MSI-X on s390x, > which is why only virtio-pci devices using MSI-X are supported on > s390x.) It uses two-staged indicators: a global indicator to show > whether any secondary indicator is set, and secondary indicators (which > are per virtqueue in the virtio case.) > > guest host > > set queue indicator(s) > set global indicator > queue interrupt iff global > indicator had not been set > open up for I/O interrupt > store some data into lowcore > do PSW swap > interrupt handler called > read from lowcore > look at indicators > virtio queue processing > > This has less context switches than traditional I/O interrupts; but I > think the main benefit comes from the ability to batch notifications: > as long as the guest is still processing indicators, the host does not > need to notify again, it can just set indicators (which is why the > guest always needs to do two passes at processing.) We can already > batch per-device indicators with the classic approach, but adapter > interrupts allow to batch even across many devices. Thanks for the explanation. I'm curious why the data that's going to be read from lowcore isn't loaded before the guest opens up (is this the same as unmasking?) for the interrupt? Is this because the host has to set up the guest IRQ itself? -- Alex Bennée --------------------------------------------------------------------- To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [virtio-dev] On doorbells (queue notifications) 2020-07-15 17:25 ` Alex Bennée @ 2020-07-15 20:04 ` Halil Pasic 2020-07-16 9:41 ` Cornelia Huck 0 siblings, 1 reply; 16+ messages in thread From: Halil Pasic @ 2020-07-15 20:04 UTC (permalink / raw) To: Alex Bennée Cc: Cornelia Huck, Stefan Hajnoczi, virtio-dev, Zha Bin, Jing Liu, Chao Peng On Wed, 15 Jul 2020 18:25:14 +0100 Alex Bennée <alex.bennee@linaro.org> wrote: > > Cornelia Huck <cohuck@redhat.com> writes: > > > On Wed, 15 Jul 2020 16:47:32 +0100 > > Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > >> On Wed, Jul 15, 2020 at 02:29:04PM +0100, Alex Bennée wrote: > >> > Stefan Hajnoczi <stefanha@redhat.com> writes: > >> > > On Tue, Jul 14, 2020 at 10:43:36PM +0100, Alex Bennée wrote: > >> > >> Finally I'm curious if this is just a problem avoided by the s390 > >> > >> channel approach? Does the use of messages over a channel just avoid the > >> > >> sort of bouncing back and forth that other hypervisors have to do when > >> > >> emulating a device? > >> > > > >> > > What does "bouncing back and forth" mean exactly? > >> > > >> > Context switching between guest and hypervisor. > >> > >> I have CCed Cornelia Huck, who can explain the lifecycle of an I/O > >> request on s390 channel I/O. > > > > Having read through this thread, I think this is mostly about > > notifications? > > Yes - as I understand it they are the only things that really cause a > context switch between guest/hypervisor/host. > > > These are not using channel programs (which are only > > used for things like feature negotiation, or emulating reading/writing > > a config space, which does not really exist for channel devices.) > > > > First, I/O and interrupts are highly abstracted on s390; much of the > > register accesses or writes done on other architectures is just not > > seen on s390. > > > > Traditionally, I/O interrupts on s390 are tied to a subchannel; you > > have a rather heavyweight process for that: > > > > guest host > > > > put status into subchannel > > queue interrupt > > open up for I/O interrupt > > store some data into lowcore > > do PSW swap > > interrupt handler called > > read from lowcore > > call tsch for subchannel > > store subchannel status into > > control block > > process control block > > look at subchannel indicators > > virtio queue processing > > > > This is only used for configuration change notifications, or for very > > old legacy virtio implementations. > > > > There's an alternative mechanism not tied to a subchannel, called > > 'adapter interrupts'. (It is even used to implement MSI-X on s390x, > > which is why only virtio-pci devices using MSI-X are supported on > > s390x.) It uses two-staged indicators: a global indicator to show > > whether any secondary indicator is set, and secondary indicators (which > > are per virtqueue in the virtio case.) > > > > guest host > > > > set queue indicator(s) > > set global indicator > > queue interrupt iff global > > indicator had not been set > > open up for I/O interrupt > > store some data into lowcore > > do PSW swap > > interrupt handler called > > read from lowcore > > look at indicators > > virtio queue processing > > > > This has less context switches than traditional I/O interrupts; but I > > think the main benefit comes from the ability to batch notifications: > > as long as the guest is still processing indicators, the host does not > > need to notify again, it can just set indicators (which is why the > > guest always needs to do two passes at processing.) We can already > > batch per-device indicators with the classic approach, but adapter > > interrupts allow to batch even across many devices. > > Thanks for the explanation. > > I'm curious why the data that's going to be read from lowcore isn't > loaded before the guest opens up (is this the same as unmasking?) for You mean stored and not loaded, or? > the interrupt? Is this because the host has to set up the guest IRQ > itself? > Hi Alex! IMHO Connie provided a detailed jet simplified and a little confusing description of the process of taking an IO interrupt on s390, which is also called the interruption action. A prerequisite for a CPU accepting an I/O interruption request is of course the CPU being open for it (controls: PSW, CR6). And yes this is the masking/unmasking. The unmasking may or may not happen at the point indicated in the ascii figures by Connie, what is important the cpu is unmasked at that point. Right after the interruption action the execution resumes at the interruption handler, whose address was read (as a part of the interruption action) from the lowcore. In that sense, there is only one interrupt handler for IO, as there is only one new PSW slot in the lowcore. To figure out what sort of event or events correspond to the interruption. This IO interrupt handler looks at the so called IO interruption code. The IO interruption code tells us if this is a subchannel associated, or an adapter IO interruption. If subchannel associated then, the interruption code also tells us which subchannel is asking for attention. If adapter interruption, further information is found (e.g. interruption subclass) that may allow us (the guest) to limit the amount of processing needed in order to figure out what events are associated with this interruption. We may not need to scan all the indicator bits (used by the guest). The interruption code is in turn stored by the interruption action, might be executed by the hypervisor (is executed by the hypervisor for subchannel interrupts, and may or may not be for adapter interrupts), and must not happen if the cpu can not take the interruption, because it is masked. Regarding the number of context switches, if adapter interrupts are used, if everything goes well even host->guest queue notifications that involve an interrupt are done without getting a VCPU out of SIE (roughly corresponds to VM EXIT) thanks to the mechanism called GISA. But that is very s390 specific. Regards, Halil --------------------------------------------------------------------- To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [virtio-dev] On doorbells (queue notifications) 2020-07-15 20:04 ` Halil Pasic @ 2020-07-16 9:41 ` Cornelia Huck 0 siblings, 0 replies; 16+ messages in thread From: Cornelia Huck @ 2020-07-16 9:41 UTC (permalink / raw) To: Halil Pasic Cc: Alex Bennée, Stefan Hajnoczi, virtio-dev, Zha Bin, Jing Liu, Chao Peng On Wed, 15 Jul 2020 22:04:57 +0200 Halil Pasic <pasic@linux.ibm.com> wrote: > On Wed, 15 Jul 2020 18:25:14 +0100 > Alex Bennée <alex.bennee@linaro.org> wrote: > > > > > Cornelia Huck <cohuck@redhat.com> writes: > > > > > On Wed, 15 Jul 2020 16:47:32 +0100 > > > Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > > >> On Wed, Jul 15, 2020 at 02:29:04PM +0100, Alex Bennée wrote: > > >> > Stefan Hajnoczi <stefanha@redhat.com> writes: > > >> > > On Tue, Jul 14, 2020 at 10:43:36PM +0100, Alex Bennée wrote: > > >> > >> Finally I'm curious if this is just a problem avoided by the s390 > > >> > >> channel approach? Does the use of messages over a channel just avoid the > > >> > >> sort of bouncing back and forth that other hypervisors have to do when > > >> > >> emulating a device? > > >> > > > > >> > > What does "bouncing back and forth" mean exactly? > > >> > > > >> > Context switching between guest and hypervisor. > > >> > > >> I have CCed Cornelia Huck, who can explain the lifecycle of an I/O > > >> request on s390 channel I/O. > > > > > > Having read through this thread, I think this is mostly about > > > notifications? > > > > Yes - as I understand it they are the only things that really cause a > > context switch between guest/hypervisor/host. > > > > > These are not using channel programs (which are only > > > used for things like feature negotiation, or emulating reading/writing > > > a config space, which does not really exist for channel devices.) > > > > > > First, I/O and interrupts are highly abstracted on s390; much of the > > > register accesses or writes done on other architectures is just not > > > seen on s390. > > > > > > Traditionally, I/O interrupts on s390 are tied to a subchannel; you > > > have a rather heavyweight process for that: > > > > > > guest host > > > > > > put status into subchannel > > > queue interrupt > > > open up for I/O interrupt > > > store some data into lowcore > > > do PSW swap > > > interrupt handler called > > > read from lowcore > > > call tsch for subchannel > > > store subchannel status into > > > control block > > > process control block > > > look at subchannel indicators > > > virtio queue processing > > > > > > This is only used for configuration change notifications, or for very > > > old legacy virtio implementations. > > > > > > There's an alternative mechanism not tied to a subchannel, called > > > 'adapter interrupts'. (It is even used to implement MSI-X on s390x, > > > which is why only virtio-pci devices using MSI-X are supported on > > > s390x.) It uses two-staged indicators: a global indicator to show > > > whether any secondary indicator is set, and secondary indicators (which > > > are per virtqueue in the virtio case.) > > > > > > guest host > > > > > > set queue indicator(s) > > > set global indicator > > > queue interrupt iff global > > > indicator had not been set > > > open up for I/O interrupt > > > store some data into lowcore > > > do PSW swap > > > interrupt handler called > > > read from lowcore > > > look at indicators > > > virtio queue processing > > > > > > This has less context switches than traditional I/O interrupts; but I > > > think the main benefit comes from the ability to batch notifications: > > > as long as the guest is still processing indicators, the host does not > > > need to notify again, it can just set indicators (which is why the > > > guest always needs to do two passes at processing.) We can already > > > batch per-device indicators with the classic approach, but adapter > > > interrupts allow to batch even across many devices. > > > > Thanks for the explanation. > > > > I'm curious why the data that's going to be read from lowcore isn't > > loaded before the guest opens up (is this the same as unmasking?) for > > You mean stored and not loaded, or? > > > the interrupt? Is this because the host has to set up the guest IRQ > > itself? > > > > Hi Alex! IMHO Connie provided a detailed jet simplified and a little > confusing description of the process of taking an IO interrupt on s390, > which is also called the interruption action. Yeah, I tried to make this understandable by people without an s390 background. Not sure if I simplified too much while doing so :) > > A prerequisite for a CPU accepting an I/O interruption request is of > course the CPU being open for it (controls: PSW, CR6). And yes this is > the masking/unmasking. The unmasking may or may not happen at the point > indicated in the ascii figures by Connie, what is important the cpu is > unmasked at that point. Right after the interruption action the > execution resumes at the interruption handler, whose address was read > (as a part of the interruption action) from the lowcore. Right, the unmasking at that point was only to be able to explain what is happening when I/O interrupts open up. > > In that sense, there is only one interrupt handler for IO, as there > is only one new PSW slot in the lowcore. To figure out what sort of > event or events correspond to the interruption. This IO interrupt handler > looks at the so called IO interruption code. The IO interruption code > tells us if this is a subchannel associated, or an adapter IO > interruption. To compare to unmasking on other platforms: - The guest controls per vcpu whether it currently masks I/O interrupts in general or not. (The "interruption subclass" can give slightly more fine-grained control, including where interrupts for a specific device may go; I'm ignoring this for brevity.) - The guest can register one I/O interrupt handler per vcpu. If an I/O interrupt arrives, more information is available in a fixed per-cpu location. - The guest can enable/disable any subchannel (and therefore, the device), but while it is enabled, interrupts/status pending for it are always possible, they cannot be masked off. The guest can certainly set up different handlers for different vcpus, and it may decide to never enable I/O interrupts for a certain vcpu (e.g. to limit interrupt processing to a single vcpu); Linux uses the same handler everywhere, and any vcpu may be enabled for I/O interrupts. Another thing that might not be obvious: The host can pick *any* vcpu currently enabled for I/O interrupts, but a specific interrupt will only be delivered once. > > If subchannel associated then, the interruption code also tells us which > subchannel is asking for attention. > > If adapter interruption, further information is found (e.g. > interruption subclass) that may allow us (the guest) to limit the amount > of processing needed in order to figure out what events are associated > with this interruption. We may not need to scan all the indicator bits > (used by the > guest). The guest has quite a high level of control on how it wants to set up adapter interrupts, this may vary from OS to OS. I think the important point is that it can reduce interrupts by not asking for them as long it is still processing the last one -- and that this processing may include looking at many places that might create events, like many virtqueues that the host notifies for. > > The interruption code is in turn stored by the interruption action, > might be executed by the hypervisor (is executed by the hypervisor for > subchannel interrupts, and may or may not be for adapter interrupts), and > must not happen if the cpu can not take the interruption, because it is > masked. > > Regarding the number of context switches, if adapter interrupts are used, > if everything goes well even host->guest queue notifications that involve > an interrupt are done without getting a VCPU out of SIE (roughly > corresponds to VM EXIT) thanks to the mechanism > called GISA. But that is very s390 specific. ISTR that there has been work on exitless interrupts for other architectures as well. If you have any possibility to request hardware support for this, it is probably a good idea to do so :) --------------------------------------------------------------------- To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2020-07-17 8:42 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-07-14 21:43 [virtio-dev] On doorbells (queue notifications) Alex Bennée 2020-07-15 11:48 ` Stefan Hajnoczi 2020-07-15 13:29 ` Alex Bennée 2020-07-15 15:47 ` Stefan Hajnoczi 2020-07-15 16:40 ` Alex Bennée 2020-07-15 17:09 ` Cornelia Huck 2020-07-16 10:00 ` Stefan Hajnoczi 2020-07-16 11:25 ` Christophe de Dinechin 2020-07-16 14:19 ` Stefan Hajnoczi 2020-07-16 14:31 ` Christophe de Dinechin 2020-07-16 14:34 ` Christophe de Dinechin 2020-07-17 8:42 ` Stefan Hajnoczi 2020-07-15 17:01 ` Cornelia Huck 2020-07-15 17:25 ` Alex Bennée 2020-07-15 20:04 ` Halil Pasic 2020-07-16 9:41 ` Cornelia Huck
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.