All of lore.kernel.org
 help / color / mirror / Atom feed
* [virtio-dev] On doorbells (queue notifications)
@ 2020-07-14 21:43 Alex Bennée
  2020-07-15 11:48 ` Stefan Hajnoczi
  0 siblings, 1 reply; 16+ messages in thread
From: Alex Bennée @ 2020-07-14 21:43 UTC (permalink / raw)
  To: virtio-dev; +Cc: Zha Bin, Zha Bin, Jing Liu, Chao Peng


Hi,

At some point in the life cycle of a virt queue there needs to be a
notification event to signal to the guest or host that there has been an
update to the virtio structures. The number of context switches made by
the hypervisor is therefor a limiting factor to the overall performance
of the system.

As I understand it on the host side this can either be reported up via
eventfd or user-space can simply busy-wait loop reading the virt queue
status so it can process the queue as soon as a status change is
visible.

On the guest side however a busy-wait approach is undesirable as it
prevents the hypervisor from properly sharing resources. This will also
be the case if you want to service the backend of a particular virtio
device in another separate VMs.

For Virtio PCI you get an IRQ number from a (hopefully virtualised) GIC
and eventually end up at the IRQ handler which does a
ioread8(vp_dev->isr). This implicitly clears the current IRQ event. The
value of the ISR tells the driver what event has occurred (config or
queue update).

I'm slightly confused by the MSI terminology because this only seems to
be relevant for the PCI legacy interface and AFAICT only touch the
outgoing path in setup and del_vq. Do incoming MSI interrupts just get
mapped directly to the appropriate handler function to process the
queues or the config?

For MMIO based transports there is a more traditional setup of a
InterruptStatus register which is read and then acknowledged with write
to a InterruptACK register. If the device memory is handled with trap
and emulate this means a two round trips to the hypervisor just to
process one vq update.

I see there was a proposal to introduce the concept of MSI based IRQs to
virtio-mmio last year:

  https://lore.kernel.org/patchwork/patch/1171065/

with proposed kernel changes in January:

  https://patchwork.kernel.org/cover/11343097/

I haven't seen any follow up since those series were posted. Is this a
proposal that support? Is there another version likely to be proposed to
the virtio-comment list?

Finally I'm curious if this is just a problem avoided by the s390
channel approach? Does the use of messages over a channel just avoid the
sort of bouncing back and forth that other hypervisors have to do when
emulating a device?

-- 
Alex Bennée

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [virtio-dev] On doorbells (queue notifications)
  2020-07-14 21:43 [virtio-dev] On doorbells (queue notifications) Alex Bennée
@ 2020-07-15 11:48 ` Stefan Hajnoczi
  2020-07-15 13:29   ` Alex Bennée
  0 siblings, 1 reply; 16+ messages in thread
From: Stefan Hajnoczi @ 2020-07-15 11:48 UTC (permalink / raw)
  To: Alex Bennée; +Cc: virtio-dev, Zha Bin, Jing Liu, Chao Peng

[-- Attachment #1: Type: text/plain, Size: 785 bytes --]

On Tue, Jul 14, 2020 at 10:43:36PM +0100, Alex Bennée wrote:
> I'm slightly confused by the MSI terminology because this only seems to
> be relevant for the PCI legacy interface and AFAICT only touch the
> outgoing path in setup and del_vq. Do incoming MSI interrupts just get
> mapped directly to the appropriate handler function to process the
> queues or the config?

When MSI is used the VIRTIO ISR register does not need to be read by the
guest interrupt handler.

> Finally I'm curious if this is just a problem avoided by the s390
> channel approach? Does the use of messages over a channel just avoid the
> sort of bouncing back and forth that other hypervisors have to do when
> emulating a device?

What does "bouncing back and forth" mean exactly?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [virtio-dev] On doorbells (queue notifications)
  2020-07-15 11:48 ` Stefan Hajnoczi
@ 2020-07-15 13:29   ` Alex Bennée
  2020-07-15 15:47     ` Stefan Hajnoczi
  0 siblings, 1 reply; 16+ messages in thread
From: Alex Bennée @ 2020-07-15 13:29 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-dev, Zha Bin, Jing Liu, Chao Peng


Stefan Hajnoczi <stefanha@redhat.com> writes:

> On Tue, Jul 14, 2020 at 10:43:36PM +0100, Alex Bennée wrote:
>> I'm slightly confused by the MSI terminology because this only seems to
>> be relevant for the PCI legacy interface and AFAICT only touch the
>> outgoing path in setup and del_vq. Do incoming MSI interrupts just get
>> mapped directly to the appropriate handler function to process the
>> queues or the config?
>
> When MSI is used the VIRTIO ISR register does not need to be read by the
> guest interrupt handler.
>
>> Finally I'm curious if this is just a problem avoided by the s390
>> channel approach? Does the use of messages over a channel just avoid the
>> sort of bouncing back and forth that other hypervisors have to do when
>> emulating a device?
>
> What does "bouncing back and forth" mean exactly?

Context switching between guest and hypervisor.

>
> Stefan


-- 
Alex Bennée

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [virtio-dev] On doorbells (queue notifications)
  2020-07-15 13:29   ` Alex Bennée
@ 2020-07-15 15:47     ` Stefan Hajnoczi
  2020-07-15 16:40       ` Alex Bennée
  2020-07-15 17:01       ` Cornelia Huck
  0 siblings, 2 replies; 16+ messages in thread
From: Stefan Hajnoczi @ 2020-07-15 15:47 UTC (permalink / raw)
  To: Alex Bennée; +Cc: virtio-dev, Zha Bin, Jing Liu, Chao Peng, cohuck

[-- Attachment #1: Type: text/plain, Size: 656 bytes --]

On Wed, Jul 15, 2020 at 02:29:04PM +0100, Alex Bennée wrote:
> Stefan Hajnoczi <stefanha@redhat.com> writes:
> > On Tue, Jul 14, 2020 at 10:43:36PM +0100, Alex Bennée wrote:
> >> Finally I'm curious if this is just a problem avoided by the s390
> >> channel approach? Does the use of messages over a channel just avoid the
> >> sort of bouncing back and forth that other hypervisors have to do when
> >> emulating a device?
> >
> > What does "bouncing back and forth" mean exactly?
> 
> Context switching between guest and hypervisor.

I have CCed Cornelia Huck, who can explain the lifecycle of an I/O
request on s390 channel I/O.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [virtio-dev] On doorbells (queue notifications)
  2020-07-15 15:47     ` Stefan Hajnoczi
@ 2020-07-15 16:40       ` Alex Bennée
  2020-07-15 17:09         ` Cornelia Huck
  2020-07-16 10:00         ` Stefan Hajnoczi
  2020-07-15 17:01       ` Cornelia Huck
  1 sibling, 2 replies; 16+ messages in thread
From: Alex Bennée @ 2020-07-15 16:40 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: virtio-dev, Zha Bin, Jing Liu, Chao Peng, cohuck, Jan Kiszka


Stefan Hajnoczi <stefanha@redhat.com> writes:

> On Wed, Jul 15, 2020 at 02:29:04PM +0100, Alex Bennée wrote:
>> Stefan Hajnoczi <stefanha@redhat.com> writes:
>> > On Tue, Jul 14, 2020 at 10:43:36PM +0100, Alex Bennée wrote:
>> >> Finally I'm curious if this is just a problem avoided by the s390
>> >> channel approach? Does the use of messages over a channel just avoid the
>> >> sort of bouncing back and forth that other hypervisors have to do when
>> >> emulating a device?
>> >
>> > What does "bouncing back and forth" mean exactly?
>> 
>> Context switching between guest and hypervisor.
>
> I have CCed Cornelia Huck, who can explain the lifecycle of an I/O
> request on s390 channel I/O.

Thanks.

I was also wondering about the efficiency of doorbells/notifications the
other way. AFAIUI for both PCI and MMIO only a single write is required
to the notify flag which causes a trap to the hypervisor and the rest of
the processing. The hypervisor doesn't have the cost multiple exits to
read the guest state although it obviously wants to be as efficient as
possible passing the data back up to what ever is handling the backend
of the device so it doesn't need to do multiple context switches.

Has there been any investigation into other mechanisms for notifying the
hypervisor of an event - for example using a HYP call or similar
mechanism?

My gut tells me this probably doesn't make any difference as a trap to
the hypervisor is likely to cost the same either way because you still
need to save the guest context before actioning something but it would
be interesting to know if anyone has looked at it. Perhaps there is a
benefit in partitioned systems where core running the guest can return
straight away after initiating what it needs to internally in the
hypervisor to pass the notification to something that can deal with it?

-- 
Alex Bennée

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [virtio-dev] On doorbells (queue notifications)
  2020-07-15 15:47     ` Stefan Hajnoczi
  2020-07-15 16:40       ` Alex Bennée
@ 2020-07-15 17:01       ` Cornelia Huck
  2020-07-15 17:25         ` Alex Bennée
  1 sibling, 1 reply; 16+ messages in thread
From: Cornelia Huck @ 2020-07-15 17:01 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Alex Bennée, virtio-dev, Zha Bin, Jing Liu, Chao Peng

[-- Attachment #1: Type: text/plain, Size: 2960 bytes --]

On Wed, 15 Jul 2020 16:47:32 +0100
Stefan Hajnoczi <stefanha@redhat.com> wrote:

> On Wed, Jul 15, 2020 at 02:29:04PM +0100, Alex Bennée wrote:
> > Stefan Hajnoczi <stefanha@redhat.com> writes:  
> > > On Tue, Jul 14, 2020 at 10:43:36PM +0100, Alex Bennée wrote:  
> > >> Finally I'm curious if this is just a problem avoided by the s390
> > >> channel approach? Does the use of messages over a channel just avoid the
> > >> sort of bouncing back and forth that other hypervisors have to do when
> > >> emulating a device?  
> > >
> > > What does "bouncing back and forth" mean exactly?  
> > 
> > Context switching between guest and hypervisor.  
> 
> I have CCed Cornelia Huck, who can explain the lifecycle of an I/O
> request on s390 channel I/O.

Having read through this thread, I think this is mostly about
notifications? These are not using channel programs (which are only
used for things like feature negotiation, or emulating reading/writing
a config space, which does not really exist for channel devices.)

First, I/O and interrupts are highly abstracted on s390; much of the
register accesses or writes done on other architectures is just not
seen on s390.

Traditionally, I/O interrupts on s390 are tied to a subchannel; you
have a rather heavyweight process for that:

guest								host

					put status into subchannel
					queue interrupt
open up for I/O interrupt
					store some data into lowcore
					do PSW swap
interrupt handler called
read from lowcore
call tsch for subchannel
					store subchannel status into
					control block
process control block
look at subchannel indicators
virtio queue processing

This is only used for configuration change notifications, or for very
old legacy virtio implementations.

There's an alternative mechanism not tied to a subchannel, called
'adapter interrupts'. (It is even used to implement MSI-X on s390x,
which is why only virtio-pci devices using MSI-X are supported on
s390x.) It uses two-staged indicators: a global indicator to show
whether any secondary indicator is set, and secondary indicators (which
are per virtqueue in the virtio case.)

guest								host

					set queue indicator(s)
					set global indicator
					queue interrupt iff global
					indicator had not been set
open up for I/O interrupt
					store some data into lowcore
					do PSW swap
interrupt handler called
read from lowcore
look at indicators
virtio queue processing

This has less context switches than traditional I/O interrupts; but I
think the main benefit comes from the ability to batch notifications:
as long as the guest is still processing indicators, the host does not
need to notify again, it can just set indicators (which is why the
guest always needs to do two passes at processing.) We can already
batch per-device indicators with the classic approach, but adapter
interrupts allow to batch even across many devices.

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [virtio-dev] On doorbells (queue notifications)
  2020-07-15 16:40       ` Alex Bennée
@ 2020-07-15 17:09         ` Cornelia Huck
  2020-07-16 10:00         ` Stefan Hajnoczi
  1 sibling, 0 replies; 16+ messages in thread
From: Cornelia Huck @ 2020-07-15 17:09 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Stefan Hajnoczi, virtio-dev, Zha Bin, Jing Liu, Chao Peng, Jan Kiszka

On Wed, 15 Jul 2020 17:40:33 +0100
Alex Bennée <alex.bennee@linaro.org> wrote:

> I was also wondering about the efficiency of doorbells/notifications the
> other way. AFAIUI for both PCI and MMIO only a single write is required
> to the notify flag which causes a trap to the hypervisor and the rest of
> the processing. The hypervisor doesn't have the cost multiple exits to
> read the guest state although it obviously wants to be as efficient as
> possible passing the data back up to what ever is handling the backend
> of the device so it doesn't need to do multiple context switches.
> 
> Has there been any investigation into other mechanisms for notifying the
> hypervisor of an event - for example using a HYP call or similar
> mechanism?

For ccw devices, we do a 'diagnose' call (which is basically a
hypercall). It has some parameters (including the queue the guest is
notifying for), so the host already has that information when the exit
happens.

> My gut tells me this probably doesn't make any difference as a trap to
> the hypervisor is likely to cost the same either way because you still
> need to save the guest context before actioning something but it would
> be interesting to know if anyone has looked at it. Perhaps there is a
> benefit in partitioned systems where core running the guest can return
> straight away after initiating what it needs to internally in the
> hypervisor to pass the notification to something that can deal with it?

I guess that depends mostly upon whether there is further interaction
needed, or if the guest can give the host all needed info in one go.


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [virtio-dev] On doorbells (queue notifications)
  2020-07-15 17:01       ` Cornelia Huck
@ 2020-07-15 17:25         ` Alex Bennée
  2020-07-15 20:04           ` Halil Pasic
  0 siblings, 1 reply; 16+ messages in thread
From: Alex Bennée @ 2020-07-15 17:25 UTC (permalink / raw)
  To: Cornelia Huck; +Cc: Stefan Hajnoczi, virtio-dev, Zha Bin, Jing Liu, Chao Peng


Cornelia Huck <cohuck@redhat.com> writes:

> On Wed, 15 Jul 2020 16:47:32 +0100
> Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
>> On Wed, Jul 15, 2020 at 02:29:04PM +0100, Alex Bennée wrote:
>> > Stefan Hajnoczi <stefanha@redhat.com> writes:  
>> > > On Tue, Jul 14, 2020 at 10:43:36PM +0100, Alex Bennée wrote:  
>> > >> Finally I'm curious if this is just a problem avoided by the s390
>> > >> channel approach? Does the use of messages over a channel just avoid the
>> > >> sort of bouncing back and forth that other hypervisors have to do when
>> > >> emulating a device?  
>> > >
>> > > What does "bouncing back and forth" mean exactly?  
>> > 
>> > Context switching between guest and hypervisor.  
>> 
>> I have CCed Cornelia Huck, who can explain the lifecycle of an I/O
>> request on s390 channel I/O.
>
> Having read through this thread, I think this is mostly about
> notifications?

Yes - as I understand it they are the only things that really cause a
context switch between guest/hypervisor/host.

> These are not using channel programs (which are only
> used for things like feature negotiation, or emulating reading/writing
> a config space, which does not really exist for channel devices.)
>
> First, I/O and interrupts are highly abstracted on s390; much of the
> register accesses or writes done on other architectures is just not
> seen on s390.
>
> Traditionally, I/O interrupts on s390 are tied to a subchannel; you
> have a rather heavyweight process for that:
>
> guest								host
>
> 					put status into subchannel
> 					queue interrupt
> open up for I/O interrupt
> 					store some data into lowcore
> 					do PSW swap
> interrupt handler called
> read from lowcore
> call tsch for subchannel
> 					store subchannel status into
> 					control block
> process control block
> look at subchannel indicators
> virtio queue processing
>
> This is only used for configuration change notifications, or for very
> old legacy virtio implementations.
>
> There's an alternative mechanism not tied to a subchannel, called
> 'adapter interrupts'. (It is even used to implement MSI-X on s390x,
> which is why only virtio-pci devices using MSI-X are supported on
> s390x.) It uses two-staged indicators: a global indicator to show
> whether any secondary indicator is set, and secondary indicators (which
> are per virtqueue in the virtio case.)
>
> guest								host
>
> 					set queue indicator(s)
> 					set global indicator
> 					queue interrupt iff global
> 					indicator had not been set
> open up for I/O interrupt
> 					store some data into lowcore
> 					do PSW swap
> interrupt handler called
> read from lowcore
> look at indicators
> virtio queue processing
>
> This has less context switches than traditional I/O interrupts; but I
> think the main benefit comes from the ability to batch notifications:
> as long as the guest is still processing indicators, the host does not
> need to notify again, it can just set indicators (which is why the
> guest always needs to do two passes at processing.) We can already
> batch per-device indicators with the classic approach, but adapter
> interrupts allow to batch even across many devices.

Thanks for the explanation.

I'm curious why the data that's going to be read from lowcore isn't
loaded before the guest opens up (is this the same as unmasking?) for
the interrupt? Is this because the host has to set up the guest IRQ
itself?

-- 
Alex Bennée

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [virtio-dev] On doorbells (queue notifications)
  2020-07-15 17:25         ` Alex Bennée
@ 2020-07-15 20:04           ` Halil Pasic
  2020-07-16  9:41             ` Cornelia Huck
  0 siblings, 1 reply; 16+ messages in thread
From: Halil Pasic @ 2020-07-15 20:04 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Cornelia Huck, Stefan Hajnoczi, virtio-dev, Zha Bin, Jing Liu, Chao Peng

On Wed, 15 Jul 2020 18:25:14 +0100
Alex Bennée <alex.bennee@linaro.org> wrote:

> 
> Cornelia Huck <cohuck@redhat.com> writes:
> 
> > On Wed, 15 Jul 2020 16:47:32 +0100
> > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> >> On Wed, Jul 15, 2020 at 02:29:04PM +0100, Alex Bennée wrote:
> >> > Stefan Hajnoczi <stefanha@redhat.com> writes:  
> >> > > On Tue, Jul 14, 2020 at 10:43:36PM +0100, Alex Bennée wrote:  
> >> > >> Finally I'm curious if this is just a problem avoided by the s390
> >> > >> channel approach? Does the use of messages over a channel just avoid the
> >> > >> sort of bouncing back and forth that other hypervisors have to do when
> >> > >> emulating a device?  
> >> > >
> >> > > What does "bouncing back and forth" mean exactly?  
> >> > 
> >> > Context switching between guest and hypervisor.  
> >> 
> >> I have CCed Cornelia Huck, who can explain the lifecycle of an I/O
> >> request on s390 channel I/O.
> >
> > Having read through this thread, I think this is mostly about
> > notifications?
> 
> Yes - as I understand it they are the only things that really cause a
> context switch between guest/hypervisor/host.
> 
> > These are not using channel programs (which are only
> > used for things like feature negotiation, or emulating reading/writing
> > a config space, which does not really exist for channel devices.)
> >
> > First, I/O and interrupts are highly abstracted on s390; much of the
> > register accesses or writes done on other architectures is just not
> > seen on s390.
> >
> > Traditionally, I/O interrupts on s390 are tied to a subchannel; you
> > have a rather heavyweight process for that:
> >
> > guest								host
> >
> > 					put status into subchannel
> > 					queue interrupt
> > open up for I/O interrupt
> > 					store some data into lowcore
> > 					do PSW swap
> > interrupt handler called
> > read from lowcore
> > call tsch for subchannel
> > 					store subchannel status into
> > 					control block
> > process control block
> > look at subchannel indicators
> > virtio queue processing
> >
> > This is only used for configuration change notifications, or for very
> > old legacy virtio implementations.
> >
> > There's an alternative mechanism not tied to a subchannel, called
> > 'adapter interrupts'. (It is even used to implement MSI-X on s390x,
> > which is why only virtio-pci devices using MSI-X are supported on
> > s390x.) It uses two-staged indicators: a global indicator to show
> > whether any secondary indicator is set, and secondary indicators (which
> > are per virtqueue in the virtio case.)
> >
> > guest								host
> >
> > 					set queue indicator(s)
> > 					set global indicator
> > 					queue interrupt iff global
> > 					indicator had not been set
> > open up for I/O interrupt
> > 					store some data into lowcore
> > 					do PSW swap
> > interrupt handler called
> > read from lowcore
> > look at indicators
> > virtio queue processing
> >
> > This has less context switches than traditional I/O interrupts; but I
> > think the main benefit comes from the ability to batch notifications:
> > as long as the guest is still processing indicators, the host does not
> > need to notify again, it can just set indicators (which is why the
> > guest always needs to do two passes at processing.) We can already
> > batch per-device indicators with the classic approach, but adapter
> > interrupts allow to batch even across many devices.
> 
> Thanks for the explanation.
> 
> I'm curious why the data that's going to be read from lowcore isn't
> loaded before the guest opens up (is this the same as unmasking?) for

You mean stored and not loaded, or?

> the interrupt? Is this because the host has to set up the guest IRQ
> itself?
> 

Hi Alex! IMHO Connie provided a detailed jet simplified and a little
confusing description  of the process of taking an IO interrupt on s390,
which is also called the interruption action.

A prerequisite for a CPU accepting an I/O interruption request is of
course  the CPU being open for it (controls: PSW, CR6). And yes this is
the masking/unmasking. The unmasking may or may not happen at the point
indicated in the ascii figures by Connie, what is important the cpu is
unmasked at that point. Right after the interruption action the
execution resumes at the interruption handler, whose address was read
(as a part of the interruption action) from the lowcore.

In that sense, there is only one interrupt handler for IO, as there
is only one new PSW slot in the lowcore. To figure out what sort of
event or events correspond to the interruption. This IO interrupt handler
looks at the so called IO interruption code. The IO interruption code
tells us if this is a subchannel associated, or an adapter IO
interruption.

If subchannel associated then, the interruption code also tells us which
subchannel is asking for attention.

If adapter interruption, further information is found (e.g.
interruption subclass) that may allow us (the guest) to limit the amount
of processing needed in order to figure out what events are associated
with this interruption. We may not need to scan all the indicator bits
(used by the
guest).

The interruption code is in turn stored by the interruption action,
might be executed by the hypervisor (is executed by the hypervisor for
subchannel interrupts, and may or may not be for adapter interrupts), and
must not happen if the cpu can not take the interruption, because it is
masked.

Regarding the number of context switches, if adapter interrupts are used,
if everything goes well even host->guest queue notifications that involve
an interrupt are done without getting a VCPU out of SIE (roughly
corresponds to VM EXIT) thanks to the mechanism
called GISA. But that is very s390 specific.

Regards,
Halil



---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [virtio-dev] On doorbells (queue notifications)
  2020-07-15 20:04           ` Halil Pasic
@ 2020-07-16  9:41             ` Cornelia Huck
  0 siblings, 0 replies; 16+ messages in thread
From: Cornelia Huck @ 2020-07-16  9:41 UTC (permalink / raw)
  To: Halil Pasic
  Cc: Alex Bennée, Stefan Hajnoczi, virtio-dev, Zha Bin, Jing Liu,
	Chao Peng

On Wed, 15 Jul 2020 22:04:57 +0200
Halil Pasic <pasic@linux.ibm.com> wrote:

> On Wed, 15 Jul 2020 18:25:14 +0100
> Alex Bennée <alex.bennee@linaro.org> wrote:
> 
> > 
> > Cornelia Huck <cohuck@redhat.com> writes:
> >   
> > > On Wed, 15 Jul 2020 16:47:32 +0100
> > > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >  
> > >> On Wed, Jul 15, 2020 at 02:29:04PM +0100, Alex Bennée wrote:  
> > >> > Stefan Hajnoczi <stefanha@redhat.com> writes:    
> > >> > > On Tue, Jul 14, 2020 at 10:43:36PM +0100, Alex Bennée wrote:    
> > >> > >> Finally I'm curious if this is just a problem avoided by the s390
> > >> > >> channel approach? Does the use of messages over a channel just avoid the
> > >> > >> sort of bouncing back and forth that other hypervisors have to do when
> > >> > >> emulating a device?    
> > >> > >
> > >> > > What does "bouncing back and forth" mean exactly?    
> > >> > 
> > >> > Context switching between guest and hypervisor.    
> > >> 
> > >> I have CCed Cornelia Huck, who can explain the lifecycle of an I/O
> > >> request on s390 channel I/O.  
> > >
> > > Having read through this thread, I think this is mostly about
> > > notifications?  
> > 
> > Yes - as I understand it they are the only things that really cause a
> > context switch between guest/hypervisor/host.
> >   
> > > These are not using channel programs (which are only
> > > used for things like feature negotiation, or emulating reading/writing
> > > a config space, which does not really exist for channel devices.)
> > >
> > > First, I/O and interrupts are highly abstracted on s390; much of the
> > > register accesses or writes done on other architectures is just not
> > > seen on s390.
> > >
> > > Traditionally, I/O interrupts on s390 are tied to a subchannel; you
> > > have a rather heavyweight process for that:
> > >
> > > guest								host
> > >
> > > 					put status into subchannel
> > > 					queue interrupt
> > > open up for I/O interrupt
> > > 					store some data into lowcore
> > > 					do PSW swap
> > > interrupt handler called
> > > read from lowcore
> > > call tsch for subchannel
> > > 					store subchannel status into
> > > 					control block
> > > process control block
> > > look at subchannel indicators
> > > virtio queue processing
> > >
> > > This is only used for configuration change notifications, or for very
> > > old legacy virtio implementations.
> > >
> > > There's an alternative mechanism not tied to a subchannel, called
> > > 'adapter interrupts'. (It is even used to implement MSI-X on s390x,
> > > which is why only virtio-pci devices using MSI-X are supported on
> > > s390x.) It uses two-staged indicators: a global indicator to show
> > > whether any secondary indicator is set, and secondary indicators (which
> > > are per virtqueue in the virtio case.)
> > >
> > > guest								host
> > >
> > > 					set queue indicator(s)
> > > 					set global indicator
> > > 					queue interrupt iff global
> > > 					indicator had not been set
> > > open up for I/O interrupt
> > > 					store some data into lowcore
> > > 					do PSW swap
> > > interrupt handler called
> > > read from lowcore
> > > look at indicators
> > > virtio queue processing
> > >
> > > This has less context switches than traditional I/O interrupts; but I
> > > think the main benefit comes from the ability to batch notifications:
> > > as long as the guest is still processing indicators, the host does not
> > > need to notify again, it can just set indicators (which is why the
> > > guest always needs to do two passes at processing.) We can already
> > > batch per-device indicators with the classic approach, but adapter
> > > interrupts allow to batch even across many devices.  
> > 
> > Thanks for the explanation.
> > 
> > I'm curious why the data that's going to be read from lowcore isn't
> > loaded before the guest opens up (is this the same as unmasking?) for  
> 
> You mean stored and not loaded, or?
> 
> > the interrupt? Is this because the host has to set up the guest IRQ
> > itself?
> >   
> 
> Hi Alex! IMHO Connie provided a detailed jet simplified and a little
> confusing description  of the process of taking an IO interrupt on s390,
> which is also called the interruption action.

Yeah, I tried to make this understandable by people without an s390
background. Not sure if I simplified too much while doing so :)

> 
> A prerequisite for a CPU accepting an I/O interruption request is of
> course  the CPU being open for it (controls: PSW, CR6). And yes this is
> the masking/unmasking. The unmasking may or may not happen at the point
> indicated in the ascii figures by Connie, what is important the cpu is
> unmasked at that point. Right after the interruption action the
> execution resumes at the interruption handler, whose address was read
> (as a part of the interruption action) from the lowcore.

Right, the unmasking at that point was only to be able to explain what
is happening when I/O interrupts open up.

> 
> In that sense, there is only one interrupt handler for IO, as there
> is only one new PSW slot in the lowcore. To figure out what sort of
> event or events correspond to the interruption. This IO interrupt handler
> looks at the so called IO interruption code. The IO interruption code
> tells us if this is a subchannel associated, or an adapter IO
> interruption.

To compare to unmasking on other platforms:

- The guest controls per vcpu whether it currently masks I/O interrupts
  in general or not. (The "interruption subclass" can give slightly
  more fine-grained control, including where interrupts for a specific
  device may go; I'm ignoring this for brevity.)
- The guest can register one I/O interrupt handler per vcpu. If an I/O
  interrupt arrives, more information is available in a fixed per-cpu
  location.
- The guest can enable/disable any subchannel (and therefore, the
  device), but while it is enabled, interrupts/status pending for it
  are always possible, they cannot be masked off.

The guest can certainly set up different handlers for different vcpus,
and it may decide to never enable I/O interrupts for a certain vcpu
(e.g. to limit interrupt processing to a single vcpu); Linux uses the
same handler everywhere, and any vcpu may be enabled for I/O interrupts.

Another thing that might not be obvious: The host can pick *any* vcpu
currently enabled for I/O interrupts, but a specific interrupt will
only be delivered once.

> 
> If subchannel associated then, the interruption code also tells us which
> subchannel is asking for attention.
> 
> If adapter interruption, further information is found (e.g.
> interruption subclass) that may allow us (the guest) to limit the amount
> of processing needed in order to figure out what events are associated
> with this interruption. We may not need to scan all the indicator bits
> (used by the
> guest).

The guest has quite a high level of control on how it wants to set up
adapter interrupts, this may vary from OS to OS. I think the important
point is that it can reduce interrupts by not asking for them as long
it is still processing the last one -- and that this processing may
include looking at many places that might create events, like many
virtqueues that the host notifies for.

> 
> The interruption code is in turn stored by the interruption action,
> might be executed by the hypervisor (is executed by the hypervisor for
> subchannel interrupts, and may or may not be for adapter interrupts), and
> must not happen if the cpu can not take the interruption, because it is
> masked.
> 
> Regarding the number of context switches, if adapter interrupts are used,
> if everything goes well even host->guest queue notifications that involve
> an interrupt are done without getting a VCPU out of SIE (roughly
> corresponds to VM EXIT) thanks to the mechanism
> called GISA. But that is very s390 specific.

ISTR that there has been work on exitless interrupts for other
architectures as well. If you have any possibility to request hardware
support for this, it is probably a good idea to do so :)


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [virtio-dev] On doorbells (queue notifications)
  2020-07-15 16:40       ` Alex Bennée
  2020-07-15 17:09         ` Cornelia Huck
@ 2020-07-16 10:00         ` Stefan Hajnoczi
  2020-07-16 11:25           ` Christophe de Dinechin
  1 sibling, 1 reply; 16+ messages in thread
From: Stefan Hajnoczi @ 2020-07-16 10:00 UTC (permalink / raw)
  To: Alex Bennée
  Cc: virtio-dev, Zha Bin, Jing Liu, Chao Peng, cohuck, Jan Kiszka,
	Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 3729 bytes --]

On Wed, Jul 15, 2020 at 05:40:33PM +0100, Alex Bennée wrote:
> 
> Stefan Hajnoczi <stefanha@redhat.com> writes:
> 
> > On Wed, Jul 15, 2020 at 02:29:04PM +0100, Alex Bennée wrote:
> >> Stefan Hajnoczi <stefanha@redhat.com> writes:
> >> > On Tue, Jul 14, 2020 at 10:43:36PM +0100, Alex Bennée wrote:
> >> >> Finally I'm curious if this is just a problem avoided by the s390
> >> >> channel approach? Does the use of messages over a channel just avoid the
> >> >> sort of bouncing back and forth that other hypervisors have to do when
> >> >> emulating a device?
> >> >
> >> > What does "bouncing back and forth" mean exactly?
> >> 
> >> Context switching between guest and hypervisor.
> >
> > I have CCed Cornelia Huck, who can explain the lifecycle of an I/O
> > request on s390 channel I/O.
> 
> Thanks.
> 
> I was also wondering about the efficiency of doorbells/notifications the
> other way. AFAIUI for both PCI and MMIO only a single write is required
> to the notify flag which causes a trap to the hypervisor and the rest of
> the processing. The hypervisor doesn't have the cost multiple exits to
> read the guest state although it obviously wants to be as efficient as
> possible passing the data back up to what ever is handling the backend
> of the device so it doesn't need to do multiple context switches.
> 
> Has there been any investigation into other mechanisms for notifying the
> hypervisor of an event - for example using a HYP call or similar
> mechanism?
> 
> My gut tells me this probably doesn't make any difference as a trap to
> the hypervisor is likely to cost the same either way because you still
> need to save the guest context before actioning something but it would
> be interesting to know if anyone has looked at it. Perhaps there is a
> benefit in partitioned systems where core running the guest can return
> straight away after initiating what it needs to internally in the
> hypervisor to pass the notification to something that can deal with it?

It's very architecture-specific. This is something Michael Tsirkin
looked in in the past. He found that MMIO and PIO perform differently on
x86. VIRTIO supports both so the device can be configured optimally.
There was an old discussion from 2013 here:
https://lkml.org/lkml/2013/4/4/299

Without nested page tables MMIO was slower than PIO. But with nested
page tables it was faster.

Another option on x86 is using Model-Specific Registers (for hypercalls)
but this doesn't fit into the PCI device model.

A bigger issue than vmexit latency is device emulation thread wakeup
latency. There is a thread (QEMU, vhost-user, vhost, etc) monitoring the
ioeventfd but it may be descheduled. Its physical CPU may be in a low
power state. I ran a benchmark late last year with QEMU's AioContext
adaptive polling disabled so we can measure the wakeup latency:

       CPU 0/KVM 26102 [000] 85626.737072:       kvm:kvm_fast_mmio:
fast mmio at gpa 0xfde03000
    IO iothread1 26099 [001] 85626.737076: syscalls:sys_exit_ppoll: 0x1
                   4 microseconds ------^

(I did not manually configure physical CPU power states or use the
idle=poll host kernel parameter.)

Each virtqueue kick had 4 microseconds of latency before the device
emulation thread had a chance to process the virtqueue. This means the
maximum I/O Operations Per Second (IOPS) is capped at 250k before
virtqueue processing has even begun!

QEMU AioContext adaptive polling helps here because we skip the vmexit
entirely while the IOThread is polling the vring (for up to 32
microseconds by default).

It would be great if more people dig into this and optimize
notifications further.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [virtio-dev] On doorbells (queue notifications)
  2020-07-16 10:00         ` Stefan Hajnoczi
@ 2020-07-16 11:25           ` Christophe de Dinechin
  2020-07-16 14:19             ` Stefan Hajnoczi
  0 siblings, 1 reply; 16+ messages in thread
From: Christophe de Dinechin @ 2020-07-16 11:25 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Alex Bennée, virtio-dev, Zha Bin, Jing Liu, Chao Peng,
	Cornelia Huck, Jan Kiszka, Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 8601 bytes --]



> On 16 Jul 2020, at 12:00, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Wed, Jul 15, 2020 at 05:40:33PM +0100, Alex Bennée wrote:
>> 
>> Stefan Hajnoczi <stefanha@redhat.com> writes:
>> 
>>> On Wed, Jul 15, 2020 at 02:29:04PM +0100, Alex Bennée wrote:
>>>> Stefan Hajnoczi <stefanha@redhat.com> writes:
>>>>> On Tue, Jul 14, 2020 at 10:43:36PM +0100, Alex Bennée wrote:
>>>>>> Finally I'm curious if this is just a problem avoided by the s390
>>>>>> channel approach? Does the use of messages over a channel just avoid the
>>>>>> sort of bouncing back and forth that other hypervisors have to do when
>>>>>> emulating a device?
>>>>> 
>>>>> What does "bouncing back and forth" mean exactly?
>>>> 
>>>> Context switching between guest and hypervisor.
>>> 
>>> I have CCed Cornelia Huck, who can explain the lifecycle of an I/O
>>> request on s390 channel I/O.
>> 
>> Thanks.
>> 
>> I was also wondering about the efficiency of doorbells/notifications the
>> other way. AFAIUI for both PCI and MMIO only a single write is required
>> to the notify flag which causes a trap to the hypervisor and the rest of
>> the processing. The hypervisor doesn't have the cost multiple exits to
>> read the guest state although it obviously wants to be as efficient as
>> possible passing the data back up to what ever is handling the backend
>> of the device so it doesn't need to do multiple context switches.
>> 
>> Has there been any investigation into other mechanisms for notifying the
>> hypervisor of an event - for example using a HYP call or similar
>> mechanism?
>> 
>> My gut tells me this probably doesn't make any difference as a trap to
>> the hypervisor is likely to cost the same either way because you still
>> need to save the guest context before actioning something but it would
>> be interesting to know if anyone has looked at it. Perhaps there is a
>> benefit in partitioned systems where core running the guest can return
>> straight away after initiating what it needs to internally in the
>> hypervisor to pass the notification to something that can deal with it?
> 
> It's very architecture-specific. This is something Michael Tsirkin
> looked in in the past. He found that MMIO and PIO perform differently on
> x86. VIRTIO supports both so the device can be configured optimally.
> There was an old discussion from 2013 here:
> https://lkml.org/lkml/2013/4/4/299 <https://lkml.org/lkml/2013/4/4/299>
> 
> Without nested page tables MMIO was slower than PIO. But with nested
> page tables it was faster.
> 
> Another option on x86 is using Model-Specific Registers (for hypercalls)
> but this doesn't fit into the PCI device model.

(Warning: What I write below is based on experience with very different
architectures, both CPU and hypervisor; your mileage may vary)

It looks to me like the discussion so far is mostly focused on a "synchronous"
model where presumably the same CPU is switching context between
guest and (host) device emulation.

However, I/O devices on real hardware are asynchronous by construction.
They do their thing while the CPU processes stuff. So at least theoretically,
there is no reason to context switch on the same CPU. You could very well
have an I/O thread on some other CPU doing its thing. This allows to
do something some of you may have heard me talk about, called
"interrupt coalescing".

As Stefan noted, this is not always a win, as it may introduce latency.
There are at least two cases where this latency really hurts:

1. When the I/O thread is in some kind of deep sleep, e.g. because it
was not active recently. Everything from cache to TLB may hit you here,
but that normally happens when there isn't much I/O activity, so this case
in practice does not hurt that much, or rather it hurts in a case where
don't really care.

2. When the I/O thread is preempted, or not given enough cycles to do its
stuff. This happens when the system is both CPU and I/O bound, and
addressing that is mostly a scheduling issue. A CPU thread could hand-off
to a specific I/O thread, reducing that case to the kind of context switch
Alex was mentioning, but I'm not sure how feasible it is to implement
that on Linux / kvm.

In such cases, you have to pay for context switch. I'm not sure if that
context switch is markedly more expensive than a "vmexit". On at least
that alien architecture I was familiar with, there was little difference between
switching to "your" host CPU thread and switching to "another" host
I/O thread. But then the context switch was all in software, so we had
designed it that way.

So let's assume now that you run your device emulation fully in an I/O
thread, which we will assume for simplicity sits mostly in host user-space,
and your guest I/O code runs in a CPU thread, which we will assume
sits mostly in guest user/kernel space.

It is possible to share two-way doorbells / IRQ queues on some memory
page, very similar to a virtqueue. When you want to "doorbell" your device,
you simply write to that page. The device threads picks it up by reading
the same page, and posts I/O completions on the same page, with simple
memory writes.

Consider this I/O exchange buffer as having (at least) a writer and reader
index for both doorbells and virtual interrupts. In the explanation
below, I will call them "dwi", "dri", "iwi", "iri" for doorbell / interrupt read
and write index. (Note that as a key optimization, you really
don't want dwi and dri to be in the same cache line, since different
CPUs are going to read and write them)

You obviously still need to "kick" the I/O or CPU thread, and we are
talking about an IPI here since you don't know which CPU that other
thread is sitting on. But the interesting property is that you only need
to do that when dwi==dri or iwi==iri, because if not, the other side
has already been "kicked" and will keep working, i.e. incrementing
dri or iri, until it reaches back that state.

The real "interrupt coalescing" trick can happen here. In some
cases, you can decide to update your dwi or iwi without kicking,
as long as you know that you will need to kick later. That requires
some heavy cooperation from guest drivers, though, and is a
second-order optimization.

With a scheme like this, you replace a systematic context switch
for each device interrupt with a memory write and a "fire and forget"
kick IPI that only happens when the system is not already busy
processing I/Os, so that it can be eliminated when the system is
most busy. With interrupt coalescing, you can send IPIs at a rate
much lower than the actual I/O rate.

Not sure how difficult it is to adapt a scheme like this to the current
state of qemu / kvm, but I'm pretty sure it works well if you implement
it correctly ;-)

> 
> A bigger issue than vmexit latency is device emulation thread wakeup
> latency. There is a thread (QEMU, vhost-user, vhost, etc) monitoring the
> ioeventfd but it may be descheduled. Its physical CPU may be in a low
> power state. I ran a benchmark late last year with QEMU's AioContext
> adaptive polling disabled so we can measure the wakeup latency:
> 
>       CPU 0/KVM 26102 [000] 85626.737072:       kvm:kvm_fast_mmio:
> fast mmio at gpa 0xfde03000
>    IO iothread1 26099 [001] 85626.737076: syscalls:sys_exit_ppoll: 0x1
>                   4 microseconds ------^
> 
> (I did not manually configure physical CPU power states or use the
> idle=poll host kernel parameter.)
> 
> Each virtqueue kick had 4 microseconds of latency before the device
> emulation thread had a chance to process the virtqueue. This means the
> maximum I/O Operations Per Second (IOPS) is capped at 250k before
> virtqueue processing has even begun!

This data is what prompted me to write the above. This 4us seems
really long to me.

I recall a benchmark where the technique above was reaching at least
400k IOPs for a single VM on a medium-size system (4CPUs (*)). 
I remember the time I ran this benchmark quite well, because it was just
after VMware made a big splash about reaching 100k IOPs:
https://blogs.vmware.com/performance/2008/05/100000-io-opera.html.

(*) Yes, at the time, 4 CPUs was a medium size system. Don't laugh.

> 
> QEMU AioContext adaptive polling helps here because we skip the vmexit
> entirely while the IOThread is polling the vring (for up to 32
> microseconds by default).
> 
> It would be great if more people dig into this and optimize
> notifications further.
> 
> Stefan


[-- Attachment #2: Type: text/html, Size: 36614 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [virtio-dev] On doorbells (queue notifications)
  2020-07-16 11:25           ` Christophe de Dinechin
@ 2020-07-16 14:19             ` Stefan Hajnoczi
  2020-07-16 14:31               ` Christophe de Dinechin
  2020-07-16 14:34               ` Christophe de Dinechin
  0 siblings, 2 replies; 16+ messages in thread
From: Stefan Hajnoczi @ 2020-07-16 14:19 UTC (permalink / raw)
  To: Christophe de Dinechin
  Cc: Alex Bennée, virtio-dev, Zha Bin, Jing Liu, Chao Peng,
	Cornelia Huck, Jan Kiszka, Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 8792 bytes --]

On Thu, Jul 16, 2020 at 01:25:37PM +0200, Christophe de Dinechin wrote:
> 
> 
> > On 16 Jul 2020, at 12:00, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > 
> > On Wed, Jul 15, 2020 at 05:40:33PM +0100, Alex Bennée wrote:
> >> 
> >> Stefan Hajnoczi <stefanha@redhat.com> writes:
> >> 
> >>> On Wed, Jul 15, 2020 at 02:29:04PM +0100, Alex Bennée wrote:
> >>>> Stefan Hajnoczi <stefanha@redhat.com> writes:
> >>>>> On Tue, Jul 14, 2020 at 10:43:36PM +0100, Alex Bennée wrote:
> >>>>>> Finally I'm curious if this is just a problem avoided by the s390
> >>>>>> channel approach? Does the use of messages over a channel just avoid the
> >>>>>> sort of bouncing back and forth that other hypervisors have to do when
> >>>>>> emulating a device?
> >>>>> 
> >>>>> What does "bouncing back and forth" mean exactly?
> >>>> 
> >>>> Context switching between guest and hypervisor.
> >>> 
> >>> I have CCed Cornelia Huck, who can explain the lifecycle of an I/O
> >>> request on s390 channel I/O.
> >> 
> >> Thanks.
> >> 
> >> I was also wondering about the efficiency of doorbells/notifications the
> >> other way. AFAIUI for both PCI and MMIO only a single write is required
> >> to the notify flag which causes a trap to the hypervisor and the rest of
> >> the processing. The hypervisor doesn't have the cost multiple exits to
> >> read the guest state although it obviously wants to be as efficient as
> >> possible passing the data back up to what ever is handling the backend
> >> of the device so it doesn't need to do multiple context switches.
> >> 
> >> Has there been any investigation into other mechanisms for notifying the
> >> hypervisor of an event - for example using a HYP call or similar
> >> mechanism?
> >> 
> >> My gut tells me this probably doesn't make any difference as a trap to
> >> the hypervisor is likely to cost the same either way because you still
> >> need to save the guest context before actioning something but it would
> >> be interesting to know if anyone has looked at it. Perhaps there is a
> >> benefit in partitioned systems where core running the guest can return
> >> straight away after initiating what it needs to internally in the
> >> hypervisor to pass the notification to something that can deal with it?
> > 
> > It's very architecture-specific. This is something Michael Tsirkin
> > looked in in the past. He found that MMIO and PIO perform differently on
> > x86. VIRTIO supports both so the device can be configured optimally.
> > There was an old discussion from 2013 here:
> > https://lkml.org/lkml/2013/4/4/299 <https://lkml.org/lkml/2013/4/4/299>
> > 
> > Without nested page tables MMIO was slower than PIO. But with nested
> > page tables it was faster.
> > 
> > Another option on x86 is using Model-Specific Registers (for hypercalls)
> > but this doesn't fit into the PCI device model.
> 
> (Warning: What I write below is based on experience with very different
> architectures, both CPU and hypervisor; your mileage may vary)
> 
> It looks to me like the discussion so far is mostly focused on a "synchronous"
> model where presumably the same CPU is switching context between
> guest and (host) device emulation.
> 
> However, I/O devices on real hardware are asynchronous by construction.
> They do their thing while the CPU processes stuff. So at least theoretically,
> there is no reason to context switch on the same CPU. You could very well
> have an I/O thread on some other CPU doing its thing. This allows to
> do something some of you may have heard me talk about, called
> "interrupt coalescing".
> 
> As Stefan noted, this is not always a win, as it may introduce latency.
> There are at least two cases where this latency really hurts:
> 
> 1. When the I/O thread is in some kind of deep sleep, e.g. because it
> was not active recently. Everything from cache to TLB may hit you here,
> but that normally happens when there isn't much I/O activity, so this case
> in practice does not hurt that much, or rather it hurts in a case where
> don't really care.
> 
> 2. When the I/O thread is preempted, or not given enough cycles to do its
> stuff. This happens when the system is both CPU and I/O bound, and
> addressing that is mostly a scheduling issue. A CPU thread could hand-off
> to a specific I/O thread, reducing that case to the kind of context switch
> Alex was mentioning, but I'm not sure how feasible it is to implement
> that on Linux / kvm.
> 
> In such cases, you have to pay for context switch. I'm not sure if that
> context switch is markedly more expensive than a "vmexit". On at least
> that alien architecture I was familiar with, there was little difference between
> switching to "your" host CPU thread and switching to "another" host
> I/O thread. But then the context switch was all in software, so we had
> designed it that way.
> 
> So let's assume now that you run your device emulation fully in an I/O
> thread, which we will assume for simplicity sits mostly in host user-space,
> and your guest I/O code runs in a CPU thread, which we will assume
> sits mostly in guest user/kernel space.
> 
> It is possible to share two-way doorbells / IRQ queues on some memory
> page, very similar to a virtqueue. When you want to "doorbell" your device,
> you simply write to that page. The device threads picks it up by reading
> the same page, and posts I/O completions on the same page, with simple
> memory writes.
> 
> Consider this I/O exchange buffer as having (at least) a writer and reader
> index for both doorbells and virtual interrupts. In the explanation
> below, I will call them "dwi", "dri", "iwi", "iri" for doorbell / interrupt read
> and write index. (Note that as a key optimization, you really
> don't want dwi and dri to be in the same cache line, since different
> CPUs are going to read and write them)
> 
> You obviously still need to "kick" the I/O or CPU thread, and we are
> talking about an IPI here since you don't know which CPU that other
> thread is sitting on. But the interesting property is that you only need
> to do that when dwi==dri or iwi==iri, because if not, the other side
> has already been "kicked" and will keep working, i.e. incrementing
> dri or iri, until it reaches back that state.
> 
> The real "interrupt coalescing" trick can happen here. In some
> cases, you can decide to update your dwi or iwi without kicking,
> as long as you know that you will need to kick later. That requires
> some heavy cooperation from guest drivers, though, and is a
> second-order optimization.
> 
> With a scheme like this, you replace a systematic context switch
> for each device interrupt with a memory write and a "fire and forget"
> kick IPI that only happens when the system is not already busy
> processing I/Os, so that it can be eliminated when the system is
> most busy. With interrupt coalescing, you can send IPIs at a rate
> much lower than the actual I/O rate.
> 
> Not sure how difficult it is to adapt a scheme like this to the current
> state of qemu / kvm, but I'm pretty sure it works well if you implement
> it correctly ;-)
> 
> > 
> > A bigger issue than vmexit latency is device emulation thread wakeup
> > latency. There is a thread (QEMU, vhost-user, vhost, etc) monitoring the
> > ioeventfd but it may be descheduled. Its physical CPU may be in a low
> > power state. I ran a benchmark late last year with QEMU's AioContext
> > adaptive polling disabled so we can measure the wakeup latency:
> > 
> >       CPU 0/KVM 26102 [000] 85626.737072:       kvm:kvm_fast_mmio:
> > fast mmio at gpa 0xfde03000
> >    IO iothread1 26099 [001] 85626.737076: syscalls:sys_exit_ppoll: 0x1
> >                   4 microseconds ------^

Hi Christophe,
QEMU/KVM does something similar to what you described. In the perf
output above the vmexit kvm_fast_mmio event occurs on physical CPU
"[000]".  The IOThread wakes up on physical CPU "[001]". Physical CPU#0
resumes guest execution immediately after marking the ioeventfd ready.
There is no context switch to the IOThread or a return from
ioctl(KVM_RUN) on CPU#0.

The IOThread reads the eventfd. An eventfd is a counter that is reset to
0 on read. Because it's a counter you get coalescing: if the guest
performs multiple MMIO writes the ioeventfd counter increases but the
IOThread only wakes up once and reads the ioeventfd.

VIRTIO itself also has a mechanism for suppressing notifications called
EVENT_IDX. It allows the driver to let the device know that it does not
require interrupts, and the device to let the driver know it does not
require virtqueue kicks. This reminds me a bit of the mitigation
mechanism you described.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [virtio-dev] On doorbells (queue notifications)
  2020-07-16 14:19             ` Stefan Hajnoczi
@ 2020-07-16 14:31               ` Christophe de Dinechin
  2020-07-16 14:34               ` Christophe de Dinechin
  1 sibling, 0 replies; 16+ messages in thread
From: Christophe de Dinechin @ 2020-07-16 14:31 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Alex Bennée, virtio-dev, Zha Bin, Jing Liu, Chao Peng,
	Cornelia Huck, Jan Kiszka, Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 9301 bytes --]



> On 16 Jul 2020, at 16:19, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Thu, Jul 16, 2020 at 01:25:37PM +0200, Christophe de Dinechin wrote:
>> 
>> 
>>> On 16 Jul 2020, at 12:00, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>> 
>>> On Wed, Jul 15, 2020 at 05:40:33PM +0100, Alex Bennée wrote:
>>>> 
>>>> Stefan Hajnoczi <stefanha@redhat.com> writes:
>>>> 
>>>>> On Wed, Jul 15, 2020 at 02:29:04PM +0100, Alex Bennée wrote:
>>>>>> Stefan Hajnoczi <stefanha@redhat.com> writes:
>>>>>>> On Tue, Jul 14, 2020 at 10:43:36PM +0100, Alex Bennée wrote:
>>>>>>>> Finally I'm curious if this is just a problem avoided by the s390
>>>>>>>> channel approach? Does the use of messages over a channel just avoid the
>>>>>>>> sort of bouncing back and forth that other hypervisors have to do when
>>>>>>>> emulating a device?
>>>>>>> 
>>>>>>> What does "bouncing back and forth" mean exactly?
>>>>>> 
>>>>>> Context switching between guest and hypervisor.
>>>>> 
>>>>> I have CCed Cornelia Huck, who can explain the lifecycle of an I/O
>>>>> request on s390 channel I/O.
>>>> 
>>>> Thanks.
>>>> 
>>>> I was also wondering about the efficiency of doorbells/notifications the
>>>> other way. AFAIUI for both PCI and MMIO only a single write is required
>>>> to the notify flag which causes a trap to the hypervisor and the rest of
>>>> the processing. The hypervisor doesn't have the cost multiple exits to
>>>> read the guest state although it obviously wants to be as efficient as
>>>> possible passing the data back up to what ever is handling the backend
>>>> of the device so it doesn't need to do multiple context switches.
>>>> 
>>>> Has there been any investigation into other mechanisms for notifying the
>>>> hypervisor of an event - for example using a HYP call or similar
>>>> mechanism?
>>>> 
>>>> My gut tells me this probably doesn't make any difference as a trap to
>>>> the hypervisor is likely to cost the same either way because you still
>>>> need to save the guest context before actioning something but it would
>>>> be interesting to know if anyone has looked at it. Perhaps there is a
>>>> benefit in partitioned systems where core running the guest can return
>>>> straight away after initiating what it needs to internally in the
>>>> hypervisor to pass the notification to something that can deal with it?
>>> 
>>> It's very architecture-specific. This is something Michael Tsirkin
>>> looked in in the past. He found that MMIO and PIO perform differently on
>>> x86. VIRTIO supports both so the device can be configured optimally.
>>> There was an old discussion from 2013 here:
>>> https://lkml.org/lkml/2013/4/4/299 <https://lkml.org/lkml/2013/4/4/299> <https://lkml.org/lkml/2013/4/4/299 <https://lkml.org/lkml/2013/4/4/299>>
>>> 
>>> Without nested page tables MMIO was slower than PIO. But with nested
>>> page tables it was faster.
>>> 
>>> Another option on x86 is using Model-Specific Registers (for hypercalls)
>>> but this doesn't fit into the PCI device model.
>> 
>> (Warning: What I write below is based on experience with very different
>> architectures, both CPU and hypervisor; your mileage may vary)
>> 
>> It looks to me like the discussion so far is mostly focused on a "synchronous"
>> model where presumably the same CPU is switching context between
>> guest and (host) device emulation.
>> 
>> However, I/O devices on real hardware are asynchronous by construction.
>> They do their thing while the CPU processes stuff. So at least theoretically,
>> there is no reason to context switch on the same CPU. You could very well
>> have an I/O thread on some other CPU doing its thing. This allows to
>> do something some of you may have heard me talk about, called
>> "interrupt coalescing".
>> 
>> As Stefan noted, this is not always a win, as it may introduce latency.
>> There are at least two cases where this latency really hurts:
>> 
>> 1. When the I/O thread is in some kind of deep sleep, e.g. because it
>> was not active recently. Everything from cache to TLB may hit you here,
>> but that normally happens when there isn't much I/O activity, so this case
>> in practice does not hurt that much, or rather it hurts in a case where
>> don't really care.
>> 
>> 2. When the I/O thread is preempted, or not given enough cycles to do its
>> stuff. This happens when the system is both CPU and I/O bound, and
>> addressing that is mostly a scheduling issue. A CPU thread could hand-off
>> to a specific I/O thread, reducing that case to the kind of context switch
>> Alex was mentioning, but I'm not sure how feasible it is to implement
>> that on Linux / kvm.
>> 
>> In such cases, you have to pay for context switch. I'm not sure if that
>> context switch is markedly more expensive than a "vmexit". On at least
>> that alien architecture I was familiar with, there was little difference between
>> switching to "your" host CPU thread and switching to "another" host
>> I/O thread. But then the context switch was all in software, so we had
>> designed it that way.
>> 
>> So let's assume now that you run your device emulation fully in an I/O
>> thread, which we will assume for simplicity sits mostly in host user-space,
>> and your guest I/O code runs in a CPU thread, which we will assume
>> sits mostly in guest user/kernel space.
>> 
>> It is possible to share two-way doorbells / IRQ queues on some memory
>> page, very similar to a virtqueue. When you want to "doorbell" your device,
>> you simply write to that page. The device threads picks it up by reading
>> the same page, and posts I/O completions on the same page, with simple
>> memory writes.
>> 
>> Consider this I/O exchange buffer as having (at least) a writer and reader
>> index for both doorbells and virtual interrupts. In the explanation
>> below, I will call them "dwi", "dri", "iwi", "iri" for doorbell / interrupt read
>> and write index. (Note that as a key optimization, you really
>> don't want dwi and dri to be in the same cache line, since different
>> CPUs are going to read and write them)
>> 
>> You obviously still need to "kick" the I/O or CPU thread, and we are
>> talking about an IPI here since you don't know which CPU that other
>> thread is sitting on. But the interesting property is that you only need
>> to do that when dwi==dri or iwi==iri, because if not, the other side
>> has already been "kicked" and will keep working, i.e. incrementing
>> dri or iri, until it reaches back that state.
>> 
>> The real "interrupt coalescing" trick can happen here. In some
>> cases, you can decide to update your dwi or iwi without kicking,
>> as long as you know that you will need to kick later. That requires
>> some heavy cooperation from guest drivers, though, and is a
>> second-order optimization.
>> 
>> With a scheme like this, you replace a systematic context switch
>> for each device interrupt with a memory write and a "fire and forget"
>> kick IPI that only happens when the system is not already busy
>> processing I/Os, so that it can be eliminated when the system is
>> most busy. With interrupt coalescing, you can send IPIs at a rate
>> much lower than the actual I/O rate.
>> 
>> Not sure how difficult it is to adapt a scheme like this to the current
>> state of qemu / kvm, but I'm pretty sure it works well if you implement
>> it correctly ;-)
>> 
>>> 
>>> A bigger issue than vmexit latency is device emulation thread wakeup
>>> latency. There is a thread (QEMU, vhost-user, vhost, etc) monitoring the
>>> ioeventfd but it may be descheduled. Its physical CPU may be in a low
>>> power state. I ran a benchmark late last year with QEMU's AioContext
>>> adaptive polling disabled so we can measure the wakeup latency:
>>> 
>>>      CPU 0/KVM 26102 [000] 85626.737072:       kvm:kvm_fast_mmio:
>>> fast mmio at gpa 0xfde03000
>>>   IO iothread1 26099 [001] 85626.737076: syscalls:sys_exit_ppoll: 0x1
>>>                  4 microseconds ------^
> 
> Hi Christophe,
> QEMU/KVM does something similar to what you described. In the perf
> output above the vmexit kvm_fast_mmio event occurs on physical CPU
> "[000]".  The IOThread wakes up on physical CPU "[001]". Physical CPU#0
> resumes guest execution immediately after marking the ioeventfd ready.
> There is no context switch to the IOThread or a return from
> ioctl(KVM_RUN) on CPU#0.

Oh, that's good.

But then the conclusion that the 4us delay limits us to 250kIOPS
is incorrect, no? Is there anything that would prevent multiple
I/O events (doorbell or interrupt) to be in flight at the same time?

> 
> The IOThread reads the eventfd. An eventfd is a counter that is reset to
> 0 on read. Because it's a counter you get coalescing: if the guest
> performs multiple MMIO writes the ioeventfd counter increases but the
> IOThread only wakes up once and reads the ioeventfd.
> 
> VIRTIO itself also has a mechanism for suppressing notifications called
> EVENT_IDX. It allows the driver to let the device know that it does not
> require interrupts, and the device to let the driver know it does not
> require virtqueue kicks. This reminds me a bit of the mitigation
> mechanism you described.
> 
> Stefan


[-- Attachment #2: Type: text/html, Size: 25934 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [virtio-dev] On doorbells (queue notifications)
  2020-07-16 14:19             ` Stefan Hajnoczi
  2020-07-16 14:31               ` Christophe de Dinechin
@ 2020-07-16 14:34               ` Christophe de Dinechin
  2020-07-17  8:42                 ` Stefan Hajnoczi
  1 sibling, 1 reply; 16+ messages in thread
From: Christophe de Dinechin @ 2020-07-16 14:34 UTC (permalink / raw)
  Cc: virtio-dev

[-- Attachment #1: Type: text/plain, Size: 9299 bytes --]



> On 16 Jul 2020, at 16:19, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Thu, Jul 16, 2020 at 01:25:37PM +0200, Christophe de Dinechin wrote:
>> 
>> 
>>> On 16 Jul 2020, at 12:00, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>> 
>>> On Wed, Jul 15, 2020 at 05:40:33PM +0100, Alex Bennée wrote:
>>>> 
>>>> Stefan Hajnoczi <stefanha@redhat.com> writes:
>>>> 
>>>>> On Wed, Jul 15, 2020 at 02:29:04PM +0100, Alex Bennée wrote:
>>>>>> Stefan Hajnoczi <stefanha@redhat.com> writes:
>>>>>>> On Tue, Jul 14, 2020 at 10:43:36PM +0100, Alex Bennée wrote:
>>>>>>>> Finally I'm curious if this is just a problem avoided by the s390
>>>>>>>> channel approach? Does the use of messages over a channel just avoid the
>>>>>>>> sort of bouncing back and forth that other hypervisors have to do when
>>>>>>>> emulating a device?
>>>>>>> 
>>>>>>> What does "bouncing back and forth" mean exactly?
>>>>>> 
>>>>>> Context switching between guest and hypervisor.
>>>>> 
>>>>> I have CCed Cornelia Huck, who can explain the lifecycle of an I/O
>>>>> request on s390 channel I/O.
>>>> 
>>>> Thanks.
>>>> 
>>>> I was also wondering about the efficiency of doorbells/notifications the
>>>> other way. AFAIUI for both PCI and MMIO only a single write is required
>>>> to the notify flag which causes a trap to the hypervisor and the rest of
>>>> the processing. The hypervisor doesn't have the cost multiple exits to
>>>> read the guest state although it obviously wants to be as efficient as
>>>> possible passing the data back up to what ever is handling the backend
>>>> of the device so it doesn't need to do multiple context switches.
>>>> 
>>>> Has there been any investigation into other mechanisms for notifying the
>>>> hypervisor of an event - for example using a HYP call or similar
>>>> mechanism?
>>>> 
>>>> My gut tells me this probably doesn't make any difference as a trap to
>>>> the hypervisor is likely to cost the same either way because you still
>>>> need to save the guest context before actioning something but it would
>>>> be interesting to know if anyone has looked at it. Perhaps there is a
>>>> benefit in partitioned systems where core running the guest can return
>>>> straight away after initiating what it needs to internally in the
>>>> hypervisor to pass the notification to something that can deal with it?
>>> 
>>> It's very architecture-specific. This is something Michael Tsirkin
>>> looked in in the past. He found that MMIO and PIO perform differently on
>>> x86. VIRTIO supports both so the device can be configured optimally.
>>> There was an old discussion from 2013 here:
>>> https://lkml.org/lkml/2013/4/4/299 <https://lkml.org/lkml/2013/4/4/299> <https://lkml.org/lkml/2013/4/4/299 <https://lkml.org/lkml/2013/4/4/299>>
>>> 
>>> Without nested page tables MMIO was slower than PIO. But with nested
>>> page tables it was faster.
>>> 
>>> Another option on x86 is using Model-Specific Registers (for hypercalls)
>>> but this doesn't fit into the PCI device model.
>> 
>> (Warning: What I write below is based on experience with very different
>> architectures, both CPU and hypervisor; your mileage may vary)
>> 
>> It looks to me like the discussion so far is mostly focused on a "synchronous"
>> model where presumably the same CPU is switching context between
>> guest and (host) device emulation.
>> 
>> However, I/O devices on real hardware are asynchronous by construction.
>> They do their thing while the CPU processes stuff. So at least theoretically,
>> there is no reason to context switch on the same CPU. You could very well
>> have an I/O thread on some other CPU doing its thing. This allows to
>> do something some of you may have heard me talk about, called
>> "interrupt coalescing".
>> 
>> As Stefan noted, this is not always a win, as it may introduce latency.
>> There are at least two cases where this latency really hurts:
>> 
>> 1. When the I/O thread is in some kind of deep sleep, e.g. because it
>> was not active recently. Everything from cache to TLB may hit you here,
>> but that normally happens when there isn't much I/O activity, so this case
>> in practice does not hurt that much, or rather it hurts in a case where
>> don't really care.
>> 
>> 2. When the I/O thread is preempted, or not given enough cycles to do its
>> stuff. This happens when the system is both CPU and I/O bound, and
>> addressing that is mostly a scheduling issue. A CPU thread could hand-off
>> to a specific I/O thread, reducing that case to the kind of context switch
>> Alex was mentioning, but I'm not sure how feasible it is to implement
>> that on Linux / kvm.
>> 
>> In such cases, you have to pay for context switch. I'm not sure if that
>> context switch is markedly more expensive than a "vmexit". On at least
>> that alien architecture I was familiar with, there was little difference between
>> switching to "your" host CPU thread and switching to "another" host
>> I/O thread. But then the context switch was all in software, so we had
>> designed it that way.
>> 
>> So let's assume now that you run your device emulation fully in an I/O
>> thread, which we will assume for simplicity sits mostly in host user-space,
>> and your guest I/O code runs in a CPU thread, which we will assume
>> sits mostly in guest user/kernel space.
>> 
>> It is possible to share two-way doorbells / IRQ queues on some memory
>> page, very similar to a virtqueue. When you want to "doorbell" your device,
>> you simply write to that page. The device threads picks it up by reading
>> the same page, and posts I/O completions on the same page, with simple
>> memory writes.
>> 
>> Consider this I/O exchange buffer as having (at least) a writer and reader
>> index for both doorbells and virtual interrupts. In the explanation
>> below, I will call them "dwi", "dri", "iwi", "iri" for doorbell / interrupt read
>> and write index. (Note that as a key optimization, you really
>> don't want dwi and dri to be in the same cache line, since different
>> CPUs are going to read and write them)
>> 
>> You obviously still need to "kick" the I/O or CPU thread, and we are
>> talking about an IPI here since you don't know which CPU that other
>> thread is sitting on. But the interesting property is that you only need
>> to do that when dwi==dri or iwi==iri, because if not, the other side
>> has already been "kicked" and will keep working, i.e. incrementing
>> dri or iri, until it reaches back that state.
>> 
>> The real "interrupt coalescing" trick can happen here. In some
>> cases, you can decide to update your dwi or iwi without kicking,
>> as long as you know that you will need to kick later. That requires
>> some heavy cooperation from guest drivers, though, and is a
>> second-order optimization.
>> 
>> With a scheme like this, you replace a systematic context switch
>> for each device interrupt with a memory write and a "fire and forget"
>> kick IPI that only happens when the system is not already busy
>> processing I/Os, so that it can be eliminated when the system is
>> most busy. With interrupt coalescing, you can send IPIs at a rate
>> much lower than the actual I/O rate.
>> 
>> Not sure how difficult it is to adapt a scheme like this to the current
>> state of qemu / kvm, but I'm pretty sure it works well if you implement
>> it correctly ;-)
>> 
>>> 
>>> A bigger issue than vmexit latency is device emulation thread wakeup
>>> latency. There is a thread (QEMU, vhost-user, vhost, etc) monitoring the
>>> ioeventfd but it may be descheduled. Its physical CPU may be in a low
>>> power state. I ran a benchmark late last year with QEMU's AioContext
>>> adaptive polling disabled so we can measure the wakeup latency:
>>> 
>>>      CPU 0/KVM 26102 [000] 85626.737072:       kvm:kvm_fast_mmio:
>>> fast mmio at gpa 0xfde03000
>>>   IO iothread1 26099 [001] 85626.737076: syscalls:sys_exit_ppoll: 0x1
>>>                  4 microseconds ------^
> 
> Hi Christophe,
> QEMU/KVM does something similar to what you described. In the perf
> output above the vmexit kvm_fast_mmio event occurs on physical CPU
> "[000]".  The IOThread wakes up on physical CPU "[001]". Physical CPU#0
> resumes guest execution immediately after marking the ioeventfd ready.
> There is no context switch to the IOThread or a return from
> ioctl(KVM_RUN) on CPU#0.
> 
Oh, that's good.

But then the conclusion that the 4us delay limits us to 250kIOPS
is incorrect, no? Is there anything that would prevent multiple
I/O events (doorbell or interrupt) to be in flight at the same time?

> The IOThread reads the eventfd. An eventfd is a counter that is reset to
> 0 on read. Because it's a counter you get coalescing: if the guest
> performs multiple MMIO writes the ioeventfd counter increases but the
> IOThread only wakes up once and reads the ioeventfd.
> 
> VIRTIO itself also has a mechanism for suppressing notifications called
> EVENT_IDX. It allows the driver to let the device know that it does not
> require interrupts, and the device to let the driver know it does not
> require virtqueue kicks. This reminds me a bit of the mitigation
> mechanism you described.
> 
> Stefan


[-- Attachment #2: Type: text/html, Size: 25975 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [virtio-dev] On doorbells (queue notifications)
  2020-07-16 14:34               ` Christophe de Dinechin
@ 2020-07-17  8:42                 ` Stefan Hajnoczi
  0 siblings, 0 replies; 16+ messages in thread
From: Stefan Hajnoczi @ 2020-07-17  8:42 UTC (permalink / raw)
  To: Christophe de Dinechin; +Cc: virtio-dev

[-- Attachment #1: Type: text/plain, Size: 9928 bytes --]

On Thu, Jul 16, 2020 at 04:34:04PM +0200, Christophe de Dinechin wrote:
> 
> 
> > On 16 Jul 2020, at 16:19, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > 
> > On Thu, Jul 16, 2020 at 01:25:37PM +0200, Christophe de Dinechin wrote:
> >> 
> >> 
> >>> On 16 Jul 2020, at 12:00, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>> 
> >>> On Wed, Jul 15, 2020 at 05:40:33PM +0100, Alex Bennée wrote:
> >>>> 
> >>>> Stefan Hajnoczi <stefanha@redhat.com> writes:
> >>>> 
> >>>>> On Wed, Jul 15, 2020 at 02:29:04PM +0100, Alex Bennée wrote:
> >>>>>> Stefan Hajnoczi <stefanha@redhat.com> writes:
> >>>>>>> On Tue, Jul 14, 2020 at 10:43:36PM +0100, Alex Bennée wrote:
> >>>>>>>> Finally I'm curious if this is just a problem avoided by the s390
> >>>>>>>> channel approach? Does the use of messages over a channel just avoid the
> >>>>>>>> sort of bouncing back and forth that other hypervisors have to do when
> >>>>>>>> emulating a device?
> >>>>>>> 
> >>>>>>> What does "bouncing back and forth" mean exactly?
> >>>>>> 
> >>>>>> Context switching between guest and hypervisor.
> >>>>> 
> >>>>> I have CCed Cornelia Huck, who can explain the lifecycle of an I/O
> >>>>> request on s390 channel I/O.
> >>>> 
> >>>> Thanks.
> >>>> 
> >>>> I was also wondering about the efficiency of doorbells/notifications the
> >>>> other way. AFAIUI for both PCI and MMIO only a single write is required
> >>>> to the notify flag which causes a trap to the hypervisor and the rest of
> >>>> the processing. The hypervisor doesn't have the cost multiple exits to
> >>>> read the guest state although it obviously wants to be as efficient as
> >>>> possible passing the data back up to what ever is handling the backend
> >>>> of the device so it doesn't need to do multiple context switches.
> >>>> 
> >>>> Has there been any investigation into other mechanisms for notifying the
> >>>> hypervisor of an event - for example using a HYP call or similar
> >>>> mechanism?
> >>>> 
> >>>> My gut tells me this probably doesn't make any difference as a trap to
> >>>> the hypervisor is likely to cost the same either way because you still
> >>>> need to save the guest context before actioning something but it would
> >>>> be interesting to know if anyone has looked at it. Perhaps there is a
> >>>> benefit in partitioned systems where core running the guest can return
> >>>> straight away after initiating what it needs to internally in the
> >>>> hypervisor to pass the notification to something that can deal with it?
> >>> 
> >>> It's very architecture-specific. This is something Michael Tsirkin
> >>> looked in in the past. He found that MMIO and PIO perform differently on
> >>> x86. VIRTIO supports both so the device can be configured optimally.
> >>> There was an old discussion from 2013 here:
> >>> https://lkml.org/lkml/2013/4/4/299 <https://lkml.org/lkml/2013/4/4/299> <https://lkml.org/lkml/2013/4/4/299 <https://lkml.org/lkml/2013/4/4/299>>
> >>> 
> >>> Without nested page tables MMIO was slower than PIO. But with nested
> >>> page tables it was faster.
> >>> 
> >>> Another option on x86 is using Model-Specific Registers (for hypercalls)
> >>> but this doesn't fit into the PCI device model.
> >> 
> >> (Warning: What I write below is based on experience with very different
> >> architectures, both CPU and hypervisor; your mileage may vary)
> >> 
> >> It looks to me like the discussion so far is mostly focused on a "synchronous"
> >> model where presumably the same CPU is switching context between
> >> guest and (host) device emulation.
> >> 
> >> However, I/O devices on real hardware are asynchronous by construction.
> >> They do their thing while the CPU processes stuff. So at least theoretically,
> >> there is no reason to context switch on the same CPU. You could very well
> >> have an I/O thread on some other CPU doing its thing. This allows to
> >> do something some of you may have heard me talk about, called
> >> "interrupt coalescing".
> >> 
> >> As Stefan noted, this is not always a win, as it may introduce latency.
> >> There are at least two cases where this latency really hurts:
> >> 
> >> 1. When the I/O thread is in some kind of deep sleep, e.g. because it
> >> was not active recently. Everything from cache to TLB may hit you here,
> >> but that normally happens when there isn't much I/O activity, so this case
> >> in practice does not hurt that much, or rather it hurts in a case where
> >> don't really care.
> >> 
> >> 2. When the I/O thread is preempted, or not given enough cycles to do its
> >> stuff. This happens when the system is both CPU and I/O bound, and
> >> addressing that is mostly a scheduling issue. A CPU thread could hand-off
> >> to a specific I/O thread, reducing that case to the kind of context switch
> >> Alex was mentioning, but I'm not sure how feasible it is to implement
> >> that on Linux / kvm.
> >> 
> >> In such cases, you have to pay for context switch. I'm not sure if that
> >> context switch is markedly more expensive than a "vmexit". On at least
> >> that alien architecture I was familiar with, there was little difference between
> >> switching to "your" host CPU thread and switching to "another" host
> >> I/O thread. But then the context switch was all in software, so we had
> >> designed it that way.
> >> 
> >> So let's assume now that you run your device emulation fully in an I/O
> >> thread, which we will assume for simplicity sits mostly in host user-space,
> >> and your guest I/O code runs in a CPU thread, which we will assume
> >> sits mostly in guest user/kernel space.
> >> 
> >> It is possible to share two-way doorbells / IRQ queues on some memory
> >> page, very similar to a virtqueue. When you want to "doorbell" your device,
> >> you simply write to that page. The device threads picks it up by reading
> >> the same page, and posts I/O completions on the same page, with simple
> >> memory writes.
> >> 
> >> Consider this I/O exchange buffer as having (at least) a writer and reader
> >> index for both doorbells and virtual interrupts. In the explanation
> >> below, I will call them "dwi", "dri", "iwi", "iri" for doorbell / interrupt read
> >> and write index. (Note that as a key optimization, you really
> >> don't want dwi and dri to be in the same cache line, since different
> >> CPUs are going to read and write them)
> >> 
> >> You obviously still need to "kick" the I/O or CPU thread, and we are
> >> talking about an IPI here since you don't know which CPU that other
> >> thread is sitting on. But the interesting property is that you only need
> >> to do that when dwi==dri or iwi==iri, because if not, the other side
> >> has already been "kicked" and will keep working, i.e. incrementing
> >> dri or iri, until it reaches back that state.
> >> 
> >> The real "interrupt coalescing" trick can happen here. In some
> >> cases, you can decide to update your dwi or iwi without kicking,
> >> as long as you know that you will need to kick later. That requires
> >> some heavy cooperation from guest drivers, though, and is a
> >> second-order optimization.
> >> 
> >> With a scheme like this, you replace a systematic context switch
> >> for each device interrupt with a memory write and a "fire and forget"
> >> kick IPI that only happens when the system is not already busy
> >> processing I/Os, so that it can be eliminated when the system is
> >> most busy. With interrupt coalescing, you can send IPIs at a rate
> >> much lower than the actual I/O rate.
> >> 
> >> Not sure how difficult it is to adapt a scheme like this to the current
> >> state of qemu / kvm, but I'm pretty sure it works well if you implement
> >> it correctly ;-)
> >> 
> >>> 
> >>> A bigger issue than vmexit latency is device emulation thread wakeup
> >>> latency. There is a thread (QEMU, vhost-user, vhost, etc) monitoring the
> >>> ioeventfd but it may be descheduled. Its physical CPU may be in a low
> >>> power state. I ran a benchmark late last year with QEMU's AioContext
> >>> adaptive polling disabled so we can measure the wakeup latency:
> >>> 
> >>>      CPU 0/KVM 26102 [000] 85626.737072:       kvm:kvm_fast_mmio:
> >>> fast mmio at gpa 0xfde03000
> >>>   IO iothread1 26099 [001] 85626.737076: syscalls:sys_exit_ppoll: 0x1
> >>>                  4 microseconds ------^
> > 
> > Hi Christophe,
> > QEMU/KVM does something similar to what you described. In the perf
> > output above the vmexit kvm_fast_mmio event occurs on physical CPU
> > "[000]".  The IOThread wakes up on physical CPU "[001]". Physical CPU#0
> > resumes guest execution immediately after marking the ioeventfd ready.
> > There is no context switch to the IOThread or a return from
> > ioctl(KVM_RUN) on CPU#0.
> > 
> Oh, that's good.
> 
> But then the conclusion that the 4us delay limits us to 250kIOPS
> is incorrect, no? Is there anything that would prevent multiple
> I/O events (doorbell or interrupt) to be in flight at the same time?

The number I posted is a worst-case latency scenario:

1. Queue depth = 1. No batching. Workloads that want to achieve maximum
   IOPS typically queue more than 1 request at a time. When multiple
   requests are queued both submission and completion can be batched so
   that only a fraction of vmexits + interrupts are required to process
   N requests.

2. QEMU AioContext adaptive polling disabled. Polling skips ioeventfd
   and instead polls on the virtqueue ring index memory that the guest
   updates. Polling is on by default when -device
   virtio-blk-pci,iothread= is used but I disabled it for this test.

But if we stick with the worst-case scenario then we are really limited
to 250k IOPS per virtio-blk device because a single IOThread is
processing the virtqueue with a 4 microsecond wakeup latency.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-07-17  8:42 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-14 21:43 [virtio-dev] On doorbells (queue notifications) Alex Bennée
2020-07-15 11:48 ` Stefan Hajnoczi
2020-07-15 13:29   ` Alex Bennée
2020-07-15 15:47     ` Stefan Hajnoczi
2020-07-15 16:40       ` Alex Bennée
2020-07-15 17:09         ` Cornelia Huck
2020-07-16 10:00         ` Stefan Hajnoczi
2020-07-16 11:25           ` Christophe de Dinechin
2020-07-16 14:19             ` Stefan Hajnoczi
2020-07-16 14:31               ` Christophe de Dinechin
2020-07-16 14:34               ` Christophe de Dinechin
2020-07-17  8:42                 ` Stefan Hajnoczi
2020-07-15 17:01       ` Cornelia Huck
2020-07-15 17:25         ` Alex Bennée
2020-07-15 20:04           ` Halil Pasic
2020-07-16  9:41             ` Cornelia Huck

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.