All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] Linux kernel polling for QEMU
@ 2016-11-24 15:12 Stefan Hajnoczi
  2016-11-28  9:31 ` Eliezer Tamir
                   ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Stefan Hajnoczi @ 2016-11-24 15:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Jens Axboe, Christoph Hellwig, Eliezer Tamir, Davide Libenzi,
	Michael S. Tsirkin, Paolo Bonzini, Christian Borntraeger,
	Fam Zheng

[-- Attachment #1: Type: text/plain, Size: 3448 bytes --]

I looked through the socket SO_BUSY_POLL and blk_mq poll support in
recent Linux kernels with an eye towards integrating the ongoing QEMU
polling work.  The main missing feature is eventfd polling support which
I describe below.

Background
----------
We're experimenting with polling in QEMU so I wondered if there are
advantages to having the kernel do polling instead of userspace.

One such advantage has been pointed out by Christian Borntraeger and
Paolo Bonzini: a userspace thread spins blindly without knowing when it
is hogging a CPU that other tasks need.  The kernel knows when other
tasks need to run and can skip polling in that case.

Power management might also benefit if the kernel was aware of polling
activity on the system.  That way polling can be controlled by the
system administrator in a single place.  Perhaps smarter power saving
choices can also be made by the kernel.

Another advantage is that the kernel can poll hardware rings (e.g. NIC
rx rings) whereas QEMU can only poll its own virtual memory (including
guest RAM).  That means the kernel can bypass interrupts for devices
that are using kernel drivers.

State of polling in Linux
-------------------------
SO_BUSY_POLL causes recvmsg(2), select(2), and poll(2) family system
calls to spin awaiting new receive packets.  From what I can tell epoll
is not supported so that system call will sleep without polling.

blk_mq poll is mainly supported by NVMe.  It is only available with
synchronous direct I/O.  select(2), poll(2), epoll, and Linux AIO are
therefore not integrated.  It would be nice to extend the code so a
process waiting on Linux AIO using io_getevents(2), select(2), poll(2),
or epoll will poll.

QEMU and KVM-specific polling
-----------------------------
There are a few QEMU/KVM-specific items that require polling support:

QEMU's event loop aio_notify() mechanism wakes up the event loop from a
blocking poll(2) or epoll call.  It is used when another thread adds or
changes an event loop resource (such as scheduling a BH).  There is a
userspace memory location (ctx->notified) that is written by
aio_notify() as well as an eventfd that can be signalled.

kvm.ko's ioeventfd is signalled upon guest MMIO/PIO accesses.  Virtio
devices use ioeventfd as a doorbell after new requests have been placed
in a virtqueue, which is a descriptor ring in userspace memory.

Eventfd polling support could look like this:

  struct eventfd_poll_info poll_info = {
      .addr = ...memory location...,
      .size = sizeof(uint32_t),
      .op   = EVENTFD_POLL_OP_NOT_EQUAL, /* check *addr != val */
      .val  = ...last value...,
  };
  ioctl(eventfd, EVENTFD_SET_POLL, &poll_info);

In the kernel, eventfd stashes this information and eventfd_poll()
evaluates the operation (e.g. not equal, bitwise and, etc) to detect
progress.

Note that this eventfd polling mechanism doesn't actually poll the
eventfd counter value.  It's useful for situations where the eventfd is
a doorbell/notification that some object in userspace memory has been
updated.  So it polls that userspace memory location directly.

This new eventfd feature also provides a poor man's Linux AIO polling
support: set the Linux AIO shared ring index as the eventfd polling
memory location.  This is not as good as true Linux AIO polling support
where the kernel polls the NVMe, virtio_blk, etc ring since we'd still
rely on an interrupt to complete I/O requests.

Thoughts?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] Linux kernel polling for QEMU
  2016-11-24 15:12 [Qemu-devel] Linux kernel polling for QEMU Stefan Hajnoczi
@ 2016-11-28  9:31 ` Eliezer Tamir
  2016-11-28 15:29   ` Stefan Hajnoczi
  2016-11-28 20:41   ` Willem de Bruijn
  2016-11-29  8:19 ` Christian Borntraeger
  2016-11-29 10:32 ` Fam Zheng
  2 siblings, 2 replies; 37+ messages in thread
From: Eliezer Tamir @ 2016-11-28  9:31 UTC (permalink / raw)
  To: Stefan Hajnoczi, qemu-devel
  Cc: Jens Axboe, Christoph Hellwig, Davide Libenzi,
	Michael S. Tsirkin, Paolo Bonzini, Christian Borntraeger,
	Fam Zheng, Eric Dumazet, Willem de Bruijn

+ Eric, Willem

On 24/11/2016 17:12, Stefan Hajnoczi wrote:
> I looked through the socket SO_BUSY_POLL and blk_mq poll support in
> recent Linux kernels with an eye towards integrating the ongoing QEMU
> polling work.  The main missing feature is eventfd polling support which
> I describe below.
...
> State of polling in Linux
> -------------------------
> SO_BUSY_POLL causes recvmsg(2), select(2), and poll(2) family system
> calls to spin awaiting new receive packets.  From what I can tell epoll
> is not supported so that system call will sleep without polling.

At the time I sent out an RFC for epoll() SO_BUSY_POLL support.
https://lkml.org/lkml/2013/8/21/192

In hindsight I think the way I tracked sockets was over-complicated.
What I would do today would be to extend the API to allow the user
to tell epoll which socket/queue combinations are interesting.

I would love to collaborate on this with you, though I must confess that
my resources at the moment are limited and the setup I used for testing
no longer exists.

-Eliezer

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] Linux kernel polling for QEMU
  2016-11-28  9:31 ` Eliezer Tamir
@ 2016-11-28 15:29   ` Stefan Hajnoczi
  2016-11-28 15:41     ` Paolo Bonzini
  2016-11-28 20:41   ` Willem de Bruijn
  1 sibling, 1 reply; 37+ messages in thread
From: Stefan Hajnoczi @ 2016-11-28 15:29 UTC (permalink / raw)
  To: Eliezer Tamir
  Cc: qemu-devel, Jens Axboe, Christoph Hellwig, Davide Libenzi,
	Michael S. Tsirkin, Paolo Bonzini, Christian Borntraeger,
	Fam Zheng, Eric Dumazet, Willem de Bruijn

[-- Attachment #1: Type: text/plain, Size: 1635 bytes --]

On Mon, Nov 28, 2016 at 11:31:43AM +0200, Eliezer Tamir wrote:
> + Eric, Willem
> 
> On 24/11/2016 17:12, Stefan Hajnoczi wrote:
> > I looked through the socket SO_BUSY_POLL and blk_mq poll support in
> > recent Linux kernels with an eye towards integrating the ongoing QEMU
> > polling work.  The main missing feature is eventfd polling support which
> > I describe below.
> ...
> > State of polling in Linux
> > -------------------------
> > SO_BUSY_POLL causes recvmsg(2), select(2), and poll(2) family system
> > calls to spin awaiting new receive packets.  From what I can tell epoll
> > is not supported so that system call will sleep without polling.
> 
> At the time I sent out an RFC for epoll() SO_BUSY_POLL support.
> https://lkml.org/lkml/2013/8/21/192
> 
> In hindsight I think the way I tracked sockets was over-complicated.
> What I would do today would be to extend the API to allow the user
> to tell epoll which socket/queue combinations are interesting.
> 
> I would love to collaborate on this with you, though I must confess that
> my resources at the moment are limited and the setup I used for testing
> no longer exists.

Thanks for sharing the link.  I'll let you know before embarking on an
effort to make epoll support busy_loop.

At the moment I'm still evaluating whether the good results we've gotten
from polling in QEMU userspace are preserved when polling is shifted to
the kernel.

FWIW I've prototyped ioctl(EVENTFD_SET_POLL_INFO) but haven't had a
chance to test it yet:
https://github.com/stefanha/linux/commit/133e8f1da8eb5364cd5c5f7162decbc79175cd13

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] Linux kernel polling for QEMU
  2016-11-28 15:29   ` Stefan Hajnoczi
@ 2016-11-28 15:41     ` Paolo Bonzini
  2016-11-29 10:45       ` Stefan Hajnoczi
  0 siblings, 1 reply; 37+ messages in thread
From: Paolo Bonzini @ 2016-11-28 15:41 UTC (permalink / raw)
  To: Stefan Hajnoczi, Eliezer Tamir
  Cc: Willem de Bruijn, Fam Zheng, Eric Dumazet, Michael S. Tsirkin,
	qemu-devel, Jens Axboe, Christian Borntraeger, Davide Libenzi,
	Christoph Hellwig



On 28/11/2016 16:29, Stefan Hajnoczi wrote:
> Thanks for sharing the link.  I'll let you know before embarking on an
> effort to make epoll support busy_loop.
> 
> At the moment I'm still evaluating whether the good results we've gotten
> from polling in QEMU userspace are preserved when polling is shifted to
> the kernel.
> 
> FWIW I've prototyped ioctl(EVENTFD_SET_POLL_INFO) but haven't had a
> chance to test it yet:
> https://github.com/stefanha/linux/commit/133e8f1da8eb5364cd5c5f7162decbc79175cd13

This would add a system call every time the main loop processes a vring,
wouldn't it?

Paolo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] Linux kernel polling for QEMU
  2016-11-28  9:31 ` Eliezer Tamir
  2016-11-28 15:29   ` Stefan Hajnoczi
@ 2016-11-28 20:41   ` Willem de Bruijn
  1 sibling, 0 replies; 37+ messages in thread
From: Willem de Bruijn @ 2016-11-28 20:41 UTC (permalink / raw)
  To: Eliezer Tamir
  Cc: Stefan Hajnoczi, qemu-devel, Jens Axboe, Christoph Hellwig,
	Davide Libenzi, Michael S. Tsirkin, Paolo Bonzini,
	Christian Borntraeger, Fam Zheng, Eric Dumazet

On Mon, Nov 28, 2016 at 4:31 AM, Eliezer Tamir
<eliezer.tamir@linux.intel.com> wrote:
> + Eric, Willem
>
> On 24/11/2016 17:12, Stefan Hajnoczi wrote:
>> I looked through the socket SO_BUSY_POLL and blk_mq poll support in
>> recent Linux kernels with an eye towards integrating the ongoing QEMU
>> polling work.  The main missing feature is eventfd polling support which
>> I describe below.
> ...
>> State of polling in Linux
>> -------------------------
>> SO_BUSY_POLL causes recvmsg(2), select(2), and poll(2) family system
>> calls to spin awaiting new receive packets.  From what I can tell epoll
>> is not supported so that system call will sleep without polling.
>
> At the time I sent out an RFC for epoll() SO_BUSY_POLL support.
> https://lkml.org/lkml/2013/8/21/192
>
> In hindsight I think the way I tracked sockets was over-complicated.
> What I would do today would be to extend the API to allow the user
> to tell epoll which socket/queue combinations are interesting.

Also note the trivial special case for setups with one single-queue
nic (most 1Gb machines).

On multi-queue setups a less trivial attempt is optimistically spinning on
queues whose interrupt is pinned to same cpu as the process. Based
on the heuristic that a process that cares enough about low-latency to
configure busy poll will also care enough to set cpu affinity.

Both heuristics can be implemented without an explicit user API.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] Linux kernel polling for QEMU
  2016-11-24 15:12 [Qemu-devel] Linux kernel polling for QEMU Stefan Hajnoczi
  2016-11-28  9:31 ` Eliezer Tamir
@ 2016-11-29  8:19 ` Christian Borntraeger
  2016-11-29 11:00   ` Stefan Hajnoczi
  2016-11-29 10:32 ` Fam Zheng
  2 siblings, 1 reply; 37+ messages in thread
From: Christian Borntraeger @ 2016-11-29  8:19 UTC (permalink / raw)
  To: Stefan Hajnoczi, qemu-devel
  Cc: Jens Axboe, Christoph Hellwig, Eliezer Tamir, Davide Libenzi,
	Michael S. Tsirkin, Paolo Bonzini, Fam Zheng

On 11/24/2016 04:12 PM, Stefan Hajnoczi wrote:
> I looked through the socket SO_BUSY_POLL and blk_mq poll support in
> recent Linux kernels with an eye towards integrating the ongoing QEMU
> polling work.  The main missing feature is eventfd polling support which
> I describe below.
> 
> Background
> ----------
> We're experimenting with polling in QEMU so I wondered if there are
> advantages to having the kernel do polling instead of userspace.
> 
> One such advantage has been pointed out by Christian Borntraeger and
> Paolo Bonzini: a userspace thread spins blindly without knowing when it
> is hogging a CPU that other tasks need.  The kernel knows when other
> tasks need to run and can skip polling in that case.
> 
> Power management might also benefit if the kernel was aware of polling
> activity on the system.  That way polling can be controlled by the
> system administrator in a single place.  Perhaps smarter power saving
> choices can also be made by the kernel.
> 
> Another advantage is that the kernel can poll hardware rings (e.g. NIC
> rx rings) whereas QEMU can only poll its own virtual memory (including
> guest RAM).  That means the kernel can bypass interrupts for devices
> that are using kernel drivers.
> 
> State of polling in Linux
> -------------------------
> SO_BUSY_POLL causes recvmsg(2), select(2), and poll(2) family system
> calls to spin awaiting new receive packets.  From what I can tell epoll
> is not supported so that system call will sleep without polling.
> 
> blk_mq poll is mainly supported by NVMe.  It is only available with
> synchronous direct I/O.  select(2), poll(2), epoll, and Linux AIO are
> therefore not integrated.  It would be nice to extend the code so a
> process waiting on Linux AIO using io_getevents(2), select(2), poll(2),
> or epoll will poll.
> 
> QEMU and KVM-specific polling
> -----------------------------
> There are a few QEMU/KVM-specific items that require polling support:
> 
> QEMU's event loop aio_notify() mechanism wakes up the event loop from a
> blocking poll(2) or epoll call.  It is used when another thread adds or
> changes an event loop resource (such as scheduling a BH).  There is a
> userspace memory location (ctx->notified) that is written by
> aio_notify() as well as an eventfd that can be signalled.
> 
> kvm.ko's ioeventfd is signalled upon guest MMIO/PIO accesses.  Virtio
> devices use ioeventfd as a doorbell after new requests have been placed
> in a virtqueue, which is a descriptor ring in userspace memory.
> 
> Eventfd polling support could look like this:
> 
>   struct eventfd_poll_info poll_info = {
>       .addr = ...memory location...,
>       .size = sizeof(uint32_t),
>       .op   = EVENTFD_POLL_OP_NOT_EQUAL, /* check *addr != val */
>       .val  = ...last value...,
>   };
>   ioctl(eventfd, EVENTFD_SET_POLL, &poll_info);
> 
> In the kernel, eventfd stashes this information and eventfd_poll()
> evaluates the operation (e.g. not equal, bitwise and, etc) to detect
> progress.
> 
> Note that this eventfd polling mechanism doesn't actually poll the
> eventfd counter value.  It's useful for situations where the eventfd is
> a doorbell/notification that some object in userspace memory has been
> updated.  So it polls that userspace memory location directly.
> 
> This new eventfd feature also provides a poor man's Linux AIO polling
> support: set the Linux AIO shared ring index as the eventfd polling
> memory location.  This is not as good as true Linux AIO polling support
> where the kernel polls the NVMe, virtio_blk, etc ring since we'd still
> rely on an interrupt to complete I/O requests.
> 
> Thoughts?

Would be an interesting excercise, but we should really try to avoid making
the iothreads more costly. When I look at some of our measurements, I/O-wise
we are  slightly behind z/VM, which can be tuned to be in a similar area but
we use more host CPUs on s390 for the same throughput.

So I have two concerns and both a related to overhead.
a: I am able to get a higher bandwidth and lower host cpu utilization
when running fio for multiple disks when I pin the iothreads to a subset of
the host CPUs (there is a sweet spot). Is the polling maybe just influencing
the scheduler to do the same by making the iothread not doing sleep/wakeup
all the time?
b: what about contention with other guests on the host? What
worries me a bit, is the fact that most performance measurements and
tunings are done for workloads without that. We (including myself) do our
microbenchmarks (or fio runs) with just one guest and are happy if we see
an improvement. But does that reflect real usage? For example have you ever
measured the aio polling with 10 guests or so?
My gut feeling (and obviously I have not done proper measurements myself) is
that we want to stop polling as soon as there is contention.

As you outlined, we already have something in place in the kernel to stop
polling

Interestingly enough, for SO_BUSY_POLL the network code seems to consider
    !need_resched() && !signal_pending(current)
for stopping the poll, which allows to consume your time slice. KVM instead
uses single_task_running() for the halt_poll_thing. This means that KVM 
yields much more aggressively, which is probably the right thing to do for
opportunistic spinning.

Another thing to consider: In the kernel we have already other opportunistic
spinners and we are in the process of making things less aggressive because
it caused real issues. For example search for the  vcpu_is_preempted​ patch set.
Which by the way shown another issue, running nested you do not only want to
consider your own load, but also the load of the hypervisor.

Christian

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] Linux kernel polling for QEMU
  2016-11-24 15:12 [Qemu-devel] Linux kernel polling for QEMU Stefan Hajnoczi
  2016-11-28  9:31 ` Eliezer Tamir
  2016-11-29  8:19 ` Christian Borntraeger
@ 2016-11-29 10:32 ` Fam Zheng
  2016-11-29 11:17   ` Paolo Bonzini
  2 siblings, 1 reply; 37+ messages in thread
From: Fam Zheng @ 2016-11-29 10:32 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, Jens Axboe, Christoph Hellwig, Eliezer Tamir,
	Davide Libenzi, Michael S. Tsirkin, Paolo Bonzini,
	Christian Borntraeger

On Thu, 11/24 15:12, Stefan Hajnoczi wrote:
> QEMU and KVM-specific polling
> -----------------------------
> There are a few QEMU/KVM-specific items that require polling support:
> 
> QEMU's event loop aio_notify() mechanism wakes up the event loop from a
> blocking poll(2) or epoll call.  It is used when another thread adds or
> changes an event loop resource (such as scheduling a BH).  There is a
> userspace memory location (ctx->notified) that is written by
> aio_notify() as well as an eventfd that can be signalled.

I'm thinking about an alternative approach to achieve user space "idle polling"
like kvm_halt_poll_ns.

The kernel change will be a new prctl operation (should it be a different
syscall to extend?) to register a new type of eventfd called "idle eventfd":

    prctl(PR_ADD_IDLE_EVENTFD, int eventfd);
    prctl(PR_DEL_IDLE_EVENTFD, int eventfd);

It will be notified by kernel each time when the thread's local core has no
runnable threads (i.e., entering idle state).

QEMU can then add this eventfd to its event loop when it has events to poll, and
watch virtqueue/linux-aio memory from userspace in the fd handlers.  Effectiely,
if a ppoll() would have blocked because there are no new events, it could now
return immediately because of idle_eventfd events, and do the idle polling.

Does that make any sense?

Fam

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] Linux kernel polling for QEMU
  2016-11-28 15:41     ` Paolo Bonzini
@ 2016-11-29 10:45       ` Stefan Hajnoczi
  2016-11-30 17:41         ` Avi Kivity
  0 siblings, 1 reply; 37+ messages in thread
From: Stefan Hajnoczi @ 2016-11-29 10:45 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Eliezer Tamir, Willem de Bruijn, Fam Zheng, Eric Dumazet,
	Michael S. Tsirkin, qemu-devel, Jens Axboe,
	Christian Borntraeger, Davide Libenzi, Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 1200 bytes --]

On Mon, Nov 28, 2016 at 04:41:13PM +0100, Paolo Bonzini wrote:
> On 28/11/2016 16:29, Stefan Hajnoczi wrote:
> > Thanks for sharing the link.  I'll let you know before embarking on an
> > effort to make epoll support busy_loop.
> > 
> > At the moment I'm still evaluating whether the good results we've gotten
> > from polling in QEMU userspace are preserved when polling is shifted to
> > the kernel.
> > 
> > FWIW I've prototyped ioctl(EVENTFD_SET_POLL_INFO) but haven't had a
> > chance to test it yet:
> > https://github.com/stefanha/linux/commit/133e8f1da8eb5364cd5c5f7162decbc79175cd13
> 
> This would add a system call every time the main loop processes a vring,
> wouldn't it?

Yes, this is a problem and is the reason I haven't finished implementing
a test using QEMU yet.

My proposed eventfd polling mechanism doesn't work well with descriptor
ring indices because the polling info needs to be updated each event
loop iteration with the last seen ring index.

This can be solved by making struct eventfd_poll_info.value take a
userspace memory address.  The value to compare against is fetched each
polling iteration, avoiding the need for ioctl calls.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] Linux kernel polling for QEMU
  2016-11-29  8:19 ` Christian Borntraeger
@ 2016-11-29 11:00   ` Stefan Hajnoczi
  2016-11-29 11:58     ` Christian Borntraeger
  0 siblings, 1 reply; 37+ messages in thread
From: Stefan Hajnoczi @ 2016-11-29 11:00 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: qemu-devel, Jens Axboe, Christoph Hellwig, Eliezer Tamir,
	Davide Libenzi, Michael S. Tsirkin, Paolo Bonzini, Fam Zheng

[-- Attachment #1: Type: text/plain, Size: 6887 bytes --]

On Tue, Nov 29, 2016 at 09:19:22AM +0100, Christian Borntraeger wrote:
> On 11/24/2016 04:12 PM, Stefan Hajnoczi wrote:
> > I looked through the socket SO_BUSY_POLL and blk_mq poll support in
> > recent Linux kernels with an eye towards integrating the ongoing QEMU
> > polling work.  The main missing feature is eventfd polling support which
> > I describe below.
> > 
> > Background
> > ----------
> > We're experimenting with polling in QEMU so I wondered if there are
> > advantages to having the kernel do polling instead of userspace.
> > 
> > One such advantage has been pointed out by Christian Borntraeger and
> > Paolo Bonzini: a userspace thread spins blindly without knowing when it
> > is hogging a CPU that other tasks need.  The kernel knows when other
> > tasks need to run and can skip polling in that case.
> > 
> > Power management might also benefit if the kernel was aware of polling
> > activity on the system.  That way polling can be controlled by the
> > system administrator in a single place.  Perhaps smarter power saving
> > choices can also be made by the kernel.
> > 
> > Another advantage is that the kernel can poll hardware rings (e.g. NIC
> > rx rings) whereas QEMU can only poll its own virtual memory (including
> > guest RAM).  That means the kernel can bypass interrupts for devices
> > that are using kernel drivers.
> > 
> > State of polling in Linux
> > -------------------------
> > SO_BUSY_POLL causes recvmsg(2), select(2), and poll(2) family system
> > calls to spin awaiting new receive packets.  From what I can tell epoll
> > is not supported so that system call will sleep without polling.
> > 
> > blk_mq poll is mainly supported by NVMe.  It is only available with
> > synchronous direct I/O.  select(2), poll(2), epoll, and Linux AIO are
> > therefore not integrated.  It would be nice to extend the code so a
> > process waiting on Linux AIO using io_getevents(2), select(2), poll(2),
> > or epoll will poll.
> > 
> > QEMU and KVM-specific polling
> > -----------------------------
> > There are a few QEMU/KVM-specific items that require polling support:
> > 
> > QEMU's event loop aio_notify() mechanism wakes up the event loop from a
> > blocking poll(2) or epoll call.  It is used when another thread adds or
> > changes an event loop resource (such as scheduling a BH).  There is a
> > userspace memory location (ctx->notified) that is written by
> > aio_notify() as well as an eventfd that can be signalled.
> > 
> > kvm.ko's ioeventfd is signalled upon guest MMIO/PIO accesses.  Virtio
> > devices use ioeventfd as a doorbell after new requests have been placed
> > in a virtqueue, which is a descriptor ring in userspace memory.
> > 
> > Eventfd polling support could look like this:
> > 
> >   struct eventfd_poll_info poll_info = {
> >       .addr = ...memory location...,
> >       .size = sizeof(uint32_t),
> >       .op   = EVENTFD_POLL_OP_NOT_EQUAL, /* check *addr != val */
> >       .val  = ...last value...,
> >   };
> >   ioctl(eventfd, EVENTFD_SET_POLL, &poll_info);
> > 
> > In the kernel, eventfd stashes this information and eventfd_poll()
> > evaluates the operation (e.g. not equal, bitwise and, etc) to detect
> > progress.
> > 
> > Note that this eventfd polling mechanism doesn't actually poll the
> > eventfd counter value.  It's useful for situations where the eventfd is
> > a doorbell/notification that some object in userspace memory has been
> > updated.  So it polls that userspace memory location directly.
> > 
> > This new eventfd feature also provides a poor man's Linux AIO polling
> > support: set the Linux AIO shared ring index as the eventfd polling
> > memory location.  This is not as good as true Linux AIO polling support
> > where the kernel polls the NVMe, virtio_blk, etc ring since we'd still
> > rely on an interrupt to complete I/O requests.
> > 
> > Thoughts?
> 
> Would be an interesting excercise, but we should really try to avoid making
> the iothreads more costly. When I look at some of our measurements, I/O-wise
> we are  slightly behind z/VM, which can be tuned to be in a similar area but
> we use more host CPUs on s390 for the same throughput.
> 
> So I have two concerns and both a related to overhead.
> a: I am able to get a higher bandwidth and lower host cpu utilization
> when running fio for multiple disks when I pin the iothreads to a subset of
> the host CPUs (there is a sweet spot). Is the polling maybe just influencing
> the scheduler to do the same by making the iothread not doing sleep/wakeup
> all the time?

Interesting theory, look at sched_switch tracing data to find out
whether that is true.  Do you get any benefit from combining the sweet
spot pinning with polling?

> b: what about contention with other guests on the host? What
> worries me a bit, is the fact that most performance measurements and
> tunings are done for workloads without that. We (including myself) do our
> microbenchmarks (or fio runs) with just one guest and are happy if we see
> an improvement. But does that reflect real usage? For example have you ever
> measured the aio polling with 10 guests or so?
> My gut feeling (and obviously I have not done proper measurements myself) is
> that we want to stop polling as soon as there is contention.
> 
> As you outlined, we already have something in place in the kernel to stop
> polling
> 
> Interestingly enough, for SO_BUSY_POLL the network code seems to consider
>     !need_resched() && !signal_pending(current)
> for stopping the poll, which allows to consume your time slice. KVM instead
> uses single_task_running() for the halt_poll_thing. This means that KVM 
> yields much more aggressively, which is probably the right thing to do for
> opportunistic spinning.

Another thing I noticed about the busy_poll implementation is that it
will spin if *any* file descriptor supports polling.

In QEMU we decided to implement the opposite: spin only if *all* event
sources support polling.  The reason is that we don't want polling to
introduce any extra latency on the event sources that do not support
polling.

> Another thing to consider: In the kernel we have already other opportunistic
> spinners and we are in the process of making things less aggressive because
> it caused real issues. For example search for the  vcpu_is_preempted​ patch set.
> Which by the way shown another issue, running nested you do not only want to
> consider your own load, but also the load of the hypervisor.

These are good points and it's why I think polling in the kernel can
make smarter decisions than in polling userspace.  There are multiple
components in the system that can do polling, it would be best to have a
single place so that the polling activity doesn't interfere.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] Linux kernel polling for QEMU
  2016-11-29 10:32 ` Fam Zheng
@ 2016-11-29 11:17   ` Paolo Bonzini
  2016-11-29 13:24     ` Fam Zheng
  0 siblings, 1 reply; 37+ messages in thread
From: Paolo Bonzini @ 2016-11-29 11:17 UTC (permalink / raw)
  To: Fam Zheng, Stefan Hajnoczi
  Cc: Eliezer Tamir, Michael S. Tsirkin, qemu-devel, Jens Axboe,
	Christian Borntraeger, Davide Libenzi, Christoph Hellwig



On 29/11/2016 11:32, Fam Zheng wrote:
> 
> The kernel change will be a new prctl operation (should it be a different
> syscall to extend?) to register a new type of eventfd called "idle eventfd":
> 
>     prctl(PR_ADD_IDLE_EVENTFD, int eventfd);
>     prctl(PR_DEL_IDLE_EVENTFD, int eventfd);
> 
> It will be notified by kernel each time when the thread's local core has no
> runnable threads (i.e., entering idle state).
> 
> QEMU can then add this eventfd to its event loop when it has events to poll, and
> watch virtqueue/linux-aio memory from userspace in the fd handlers.  Effectiely,
> if a ppoll() would have blocked because there are no new events, it could now
> return immediately because of idle_eventfd events, and do the idle polling.

This has two issues:

* it only reports the leading edge of single_task_running().  Is it also
useful to stop polling on the trailing edge?

* it still needs a system call before polling is entered.  Ideally, QEMU
could run without any system call while in polling mode.

Another possibility is to add a system call for single_task_running().
It should be simple enough that you can implement it in the vDSO and
avoid a context switch.  There are convenient hooking points in
add_nr_running and sub_nr_running.

Paolo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] Linux kernel polling for QEMU
  2016-11-29 11:00   ` Stefan Hajnoczi
@ 2016-11-29 11:58     ` Christian Borntraeger
  0 siblings, 0 replies; 37+ messages in thread
From: Christian Borntraeger @ 2016-11-29 11:58 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, Jens Axboe, Christoph Hellwig, Eliezer Tamir,
	Davide Libenzi, Michael S. Tsirkin, Paolo Bonzini, Fam Zheng

On 11/29/2016 12:00 PM, Stefan Hajnoczi wrote:
> On Tue, Nov 29, 2016 at 09:19:22AM +0100, Christian Borntraeger wrote:
>> On 11/24/2016 04:12 PM, Stefan Hajnoczi wrote:
>>> I looked through the socket SO_BUSY_POLL and blk_mq poll support in
>>> recent Linux kernels with an eye towards integrating the ongoing QEMU
>>> polling work.  The main missing feature is eventfd polling support which
>>> I describe below.
>>>
>>> Background
>>> ----------
>>> We're experimenting with polling in QEMU so I wondered if there are
>>> advantages to having the kernel do polling instead of userspace.
>>>
>>> One such advantage has been pointed out by Christian Borntraeger and
>>> Paolo Bonzini: a userspace thread spins blindly without knowing when it
>>> is hogging a CPU that other tasks need.  The kernel knows when other
>>> tasks need to run and can skip polling in that case.
>>>
>>> Power management might also benefit if the kernel was aware of polling
>>> activity on the system.  That way polling can be controlled by the
>>> system administrator in a single place.  Perhaps smarter power saving
>>> choices can also be made by the kernel.
>>>
>>> Another advantage is that the kernel can poll hardware rings (e.g. NIC
>>> rx rings) whereas QEMU can only poll its own virtual memory (including
>>> guest RAM).  That means the kernel can bypass interrupts for devices
>>> that are using kernel drivers.
>>>
>>> State of polling in Linux
>>> -------------------------
>>> SO_BUSY_POLL causes recvmsg(2), select(2), and poll(2) family system
>>> calls to spin awaiting new receive packets.  From what I can tell epoll
>>> is not supported so that system call will sleep without polling.
>>>
>>> blk_mq poll is mainly supported by NVMe.  It is only available with
>>> synchronous direct I/O.  select(2), poll(2), epoll, and Linux AIO are
>>> therefore not integrated.  It would be nice to extend the code so a
>>> process waiting on Linux AIO using io_getevents(2), select(2), poll(2),
>>> or epoll will poll.
>>>
>>> QEMU and KVM-specific polling
>>> -----------------------------
>>> There are a few QEMU/KVM-specific items that require polling support:
>>>
>>> QEMU's event loop aio_notify() mechanism wakes up the event loop from a
>>> blocking poll(2) or epoll call.  It is used when another thread adds or
>>> changes an event loop resource (such as scheduling a BH).  There is a
>>> userspace memory location (ctx->notified) that is written by
>>> aio_notify() as well as an eventfd that can be signalled.
>>>
>>> kvm.ko's ioeventfd is signalled upon guest MMIO/PIO accesses.  Virtio
>>> devices use ioeventfd as a doorbell after new requests have been placed
>>> in a virtqueue, which is a descriptor ring in userspace memory.
>>>
>>> Eventfd polling support could look like this:
>>>
>>>   struct eventfd_poll_info poll_info = {
>>>       .addr = ...memory location...,
>>>       .size = sizeof(uint32_t),
>>>       .op   = EVENTFD_POLL_OP_NOT_EQUAL, /* check *addr != val */
>>>       .val  = ...last value...,
>>>   };
>>>   ioctl(eventfd, EVENTFD_SET_POLL, &poll_info);
>>>
>>> In the kernel, eventfd stashes this information and eventfd_poll()
>>> evaluates the operation (e.g. not equal, bitwise and, etc) to detect
>>> progress.
>>>
>>> Note that this eventfd polling mechanism doesn't actually poll the
>>> eventfd counter value.  It's useful for situations where the eventfd is
>>> a doorbell/notification that some object in userspace memory has been
>>> updated.  So it polls that userspace memory location directly.
>>>
>>> This new eventfd feature also provides a poor man's Linux AIO polling
>>> support: set the Linux AIO shared ring index as the eventfd polling
>>> memory location.  This is not as good as true Linux AIO polling support
>>> where the kernel polls the NVMe, virtio_blk, etc ring since we'd still
>>> rely on an interrupt to complete I/O requests.
>>>
>>> Thoughts?
>>
>> Would be an interesting excercise, but we should really try to avoid making
>> the iothreads more costly. When I look at some of our measurements, I/O-wise
>> we are  slightly behind z/VM, which can be tuned to be in a similar area but
>> we use more host CPUs on s390 for the same throughput.
>>
>> So I have two concerns and both a related to overhead.
>> a: I am able to get a higher bandwidth and lower host cpu utilization
>> when running fio for multiple disks when I pin the iothreads to a subset of
>> the host CPUs (there is a sweet spot). Is the polling maybe just influencing
>> the scheduler to do the same by making the iothread not doing sleep/wakeup
>> all the time?
> 
> Interesting theory, look at sched_switch tracing data to find out
> whether that is true. 

Looking at vmstat, a poll value on 50000 seems to reduce the amount of
context switches. Depending on the workload almost no change to sometimes
a lot (one test from 250000/sec to 150000/sec)
According to sched_switch it does still move as before between the CPUs,
so my theory does not seem to hold.

On the other hand this is a development s390 system that I share  with 84
other LPARs, so I have trouble to get stable results as soon as I have a 
high data rate. I would need to find a time slot on one of the dedicated
systems, but maybe its just easier to reproduce this on x86.

 Do you get any benefit from combining the sweet
> spot pinning with polling?

maybe, but it seems that you have to give a little bit more CPUs to the
iothreads. What I can tell is that combining both hurts for the case with
more than one disk and all iothreads are pinned to just one host CPU as
soon as the polling is too big.

> 
>> b: what about contention with other guests on the host? What
>> worries me a bit, is the fact that most performance measurements and
>> tunings are done for workloads without that. We (including myself) do our
>> microbenchmarks (or fio runs) with just one guest and are happy if we see
>> an improvement. But does that reflect real usage? For example have you ever
>> measured the aio polling with 10 guests or so?
>> My gut feeling (and obviously I have not done proper measurements myself) is
>> that we want to stop polling as soon as there is contention.
>>
>> As you outlined, we already have something in place in the kernel to stop
>> polling
>>
>> Interestingly enough, for SO_BUSY_POLL the network code seems to consider
>>     !need_resched() && !signal_pending(current)
>> for stopping the poll, which allows to consume your time slice. KVM instead
>> uses single_task_running() for the halt_poll_thing. This means that KVM 
>> yields much more aggressively, which is probably the right thing to do for
>> opportunistic spinning.
> 
> Another thing I noticed about the busy_poll implementation is that it
> will spin if *any* file descriptor supports polling.
> 
> In QEMU we decided to implement the opposite: spin only if *all* event
> sources support polling.  The reason is that we don't want polling to
> introduce any extra latency on the event sources that do not support
> polling.
> 
>> Another thing to consider: In the kernel we have already other opportunistic
>> spinners and we are in the process of making things less aggressive because
>> it caused real issues. For example search for the  vcpu_is_preempted​ patch set.
>> Which by the way shown another issue, running nested you do not only want to
>> consider your own load, but also the load of the hypervisor.
> 
> These are good points and it's why I think polling in the kernel can
> make smarter decisions than in polling userspace.  There are multiple
> components in the system that can do polling, it would be best to have a
> single place so that the polling activity doesn't interfere.
> 
> Stefan
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] Linux kernel polling for QEMU
  2016-11-29 11:17   ` Paolo Bonzini
@ 2016-11-29 13:24     ` Fam Zheng
  2016-11-29 13:27       ` Paolo Bonzini
  2016-11-29 20:43       ` Stefan Hajnoczi
  0 siblings, 2 replies; 37+ messages in thread
From: Fam Zheng @ 2016-11-29 13:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Stefan Hajnoczi, Eliezer Tamir, Michael S. Tsirkin, qemu-devel,
	Jens Axboe, Christian Borntraeger, Davide Libenzi,
	Christoph Hellwig

On Tue, 11/29 12:17, Paolo Bonzini wrote:
> 
> 
> On 29/11/2016 11:32, Fam Zheng wrote:
> > 
> > The kernel change will be a new prctl operation (should it be a different
> > syscall to extend?) to register a new type of eventfd called "idle eventfd":
> > 
> >     prctl(PR_ADD_IDLE_EVENTFD, int eventfd);
> >     prctl(PR_DEL_IDLE_EVENTFD, int eventfd);
> > 
> > It will be notified by kernel each time when the thread's local core has no
> > runnable threads (i.e., entering idle state).
> > 
> > QEMU can then add this eventfd to its event loop when it has events to poll, and
> > watch virtqueue/linux-aio memory from userspace in the fd handlers.  Effectiely,
> > if a ppoll() would have blocked because there are no new events, it could now
> > return immediately because of idle_eventfd events, and do the idle polling.
> 
> This has two issues:
> 
> * it only reports the leading edge of single_task_running().  Is it also
> useful to stop polling on the trailing edge?

QEMU can clear the eventfd right after event firing so I don't think it is
necessary.

> 
> * it still needs a system call before polling is entered.  Ideally, QEMU
> could run without any system call while in polling mode.
> 
> Another possibility is to add a system call for single_task_running().
> It should be simple enough that you can implement it in the vDSO and
> avoid a context switch.  There are convenient hooking points in
> add_nr_running and sub_nr_running.

That sounds good!

Fam

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] Linux kernel polling for QEMU
  2016-11-29 13:24     ` Fam Zheng
@ 2016-11-29 13:27       ` Paolo Bonzini
  2016-11-29 14:17         ` Fam Zheng
  2016-11-29 20:43       ` Stefan Hajnoczi
  1 sibling, 1 reply; 37+ messages in thread
From: Paolo Bonzini @ 2016-11-29 13:27 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Stefan Hajnoczi, Eliezer Tamir, Michael S. Tsirkin, qemu-devel,
	Jens Axboe, Christian Borntraeger, Davide Libenzi,
	Christoph Hellwig



On 29/11/2016 14:24, Fam Zheng wrote:
> On Tue, 11/29 12:17, Paolo Bonzini wrote:
>>
>>
>> On 29/11/2016 11:32, Fam Zheng wrote:
>>>
>>> The kernel change will be a new prctl operation (should it be a different
>>> syscall to extend?) to register a new type of eventfd called "idle eventfd":
>>>
>>>     prctl(PR_ADD_IDLE_EVENTFD, int eventfd);
>>>     prctl(PR_DEL_IDLE_EVENTFD, int eventfd);
>>>
>>> It will be notified by kernel each time when the thread's local core has no
>>> runnable threads (i.e., entering idle state).
>>>
>>> QEMU can then add this eventfd to its event loop when it has events to poll, and
>>> watch virtqueue/linux-aio memory from userspace in the fd handlers.  Effectiely,
>>> if a ppoll() would have blocked because there are no new events, it could now
>>> return immediately because of idle_eventfd events, and do the idle polling.
>>
>> This has two issues:
>>
>> * it only reports the leading edge of single_task_running().  Is it also
>> useful to stop polling on the trailing edge?
> 
> QEMU can clear the eventfd right after event firing so I don't think it is
> necessary.

Yes, but how would QEMU know that the eventfd has fired?  It would be
very expensive to read the eventfd on each iteration of polling.

Paolo

>> * it still needs a system call before polling is entered.  Ideally, QEMU
>> could run without any system call while in polling mode.
>>
>> Another possibility is to add a system call for single_task_running().
>> It should be simple enough that you can implement it in the vDSO and
>> avoid a context switch.  There are convenient hooking points in
>> add_nr_running and sub_nr_running.
> 
> That sounds good!
> 
> Fam
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] Linux kernel polling for QEMU
  2016-11-29 13:27       ` Paolo Bonzini
@ 2016-11-29 14:17         ` Fam Zheng
  2016-11-29 15:24           ` Andrew Jones
  0 siblings, 1 reply; 37+ messages in thread
From: Fam Zheng @ 2016-11-29 14:17 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Stefan Hajnoczi, Eliezer Tamir, Michael S. Tsirkin, qemu-devel,
	Jens Axboe, Christian Borntraeger, Davide Libenzi,
	Christoph Hellwig

On Tue, 11/29 14:27, Paolo Bonzini wrote:
> 
> 
> On 29/11/2016 14:24, Fam Zheng wrote:
> > On Tue, 11/29 12:17, Paolo Bonzini wrote:
> >>
> >>
> >> On 29/11/2016 11:32, Fam Zheng wrote:
> >>>
> >>> The kernel change will be a new prctl operation (should it be a different
> >>> syscall to extend?) to register a new type of eventfd called "idle eventfd":
> >>>
> >>>     prctl(PR_ADD_IDLE_EVENTFD, int eventfd);
> >>>     prctl(PR_DEL_IDLE_EVENTFD, int eventfd);
> >>>
> >>> It will be notified by kernel each time when the thread's local core has no
> >>> runnable threads (i.e., entering idle state).
> >>>
> >>> QEMU can then add this eventfd to its event loop when it has events to poll, and
> >>> watch virtqueue/linux-aio memory from userspace in the fd handlers.  Effectiely,
> >>> if a ppoll() would have blocked because there are no new events, it could now
> >>> return immediately because of idle_eventfd events, and do the idle polling.
> >>
> >> This has two issues:
> >>
> >> * it only reports the leading edge of single_task_running().  Is it also
> >> useful to stop polling on the trailing edge?
> > 
> > QEMU can clear the eventfd right after event firing so I don't think it is
> > necessary.
> 
> Yes, but how would QEMU know that the eventfd has fired?  It would be
> very expensive to read the eventfd on each iteration of polling.

The idea is to ppoll() the eventfd together with other fds (ioeventfd and
linux-aio etc.), and in the handler, call event_notifier_test_and_clear()
followed by a polling loop for some period.

Fam

> 
> Paolo
> 
> >> * it still needs a system call before polling is entered.  Ideally, QEMU
> >> could run without any system call while in polling mode.
> >>
> >> Another possibility is to add a system call for single_task_running().
> >> It should be simple enough that you can implement it in the vDSO and
> >> avoid a context switch.  There are convenient hooking points in
> >> add_nr_running and sub_nr_running.
> > 
> > That sounds good!
> > 
> > Fam
> > 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] Linux kernel polling for QEMU
  2016-11-29 14:17         ` Fam Zheng
@ 2016-11-29 15:24           ` Andrew Jones
  2016-11-29 15:39             ` Fam Zheng
  2016-11-29 15:45             ` Paolo Bonzini
  0 siblings, 2 replies; 37+ messages in thread
From: Andrew Jones @ 2016-11-29 15:24 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Paolo Bonzini, Eliezer Tamir, Michael S. Tsirkin, qemu-devel,
	Jens Axboe, Christian Borntraeger, Stefan Hajnoczi,
	Davide Libenzi, Christoph Hellwig

On Tue, Nov 29, 2016 at 10:17:46PM +0800, Fam Zheng wrote:
> On Tue, 11/29 14:27, Paolo Bonzini wrote:
> > 
> > 
> > On 29/11/2016 14:24, Fam Zheng wrote:
> > > On Tue, 11/29 12:17, Paolo Bonzini wrote:
> > >>
> > >>
> > >> On 29/11/2016 11:32, Fam Zheng wrote:
> > >>>
> > >>> The kernel change will be a new prctl operation (should it be a different
> > >>> syscall to extend?) to register a new type of eventfd called "idle eventfd":
> > >>>
> > >>>     prctl(PR_ADD_IDLE_EVENTFD, int eventfd);
> > >>>     prctl(PR_DEL_IDLE_EVENTFD, int eventfd);
> > >>>
> > >>> It will be notified by kernel each time when the thread's local core has no
> > >>> runnable threads (i.e., entering idle state).
> > >>>
> > >>> QEMU can then add this eventfd to its event loop when it has events to poll, and
> > >>> watch virtqueue/linux-aio memory from userspace in the fd handlers.  Effectiely,
> > >>> if a ppoll() would have blocked because there are no new events, it could now
> > >>> return immediately because of idle_eventfd events, and do the idle polling.
> > >>
> > >> This has two issues:
> > >>
> > >> * it only reports the leading edge of single_task_running().  Is it also
> > >> useful to stop polling on the trailing edge?
> > > 
> > > QEMU can clear the eventfd right after event firing so I don't think it is
> > > necessary.
> > 
> > Yes, but how would QEMU know that the eventfd has fired?  It would be
> > very expensive to read the eventfd on each iteration of polling.
> 
> The idea is to ppoll() the eventfd together with other fds (ioeventfd and
> linux-aio etc.), and in the handler, call event_notifier_test_and_clear()
> followed by a polling loop for some period.
> 
> Fam
> 
> > 
> > Paolo
> > 
> > >> * it still needs a system call before polling is entered.  Ideally, QEMU
> > >> could run without any system call while in polling mode.
> > >>
> > >> Another possibility is to add a system call for single_task_running().
> > >> It should be simple enough that you can implement it in the vDSO and
> > >> avoid a context switch.  There are convenient hooking points in
> > >> add_nr_running and sub_nr_running.
> > > 
> > > That sounds good!
> > > 
> > > Fam
> > > 
>

While we have a ppoll audience, another issue with the current polling
is that we can block with an infinite timeout set (-1), and it can
actually end up being infinite, i.e. vcpus will never run again. I'm
able to exhibit this with kvm-unit-tests. For these rare cases where
no other timeout has been selected, shouldn't we have a default timeout?
Anyone want to pick a number? I have a baseless compulsion to use 10 ms...

Thanks,
drew

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] Linux kernel polling for QEMU
  2016-11-29 15:24           ` Andrew Jones
@ 2016-11-29 15:39             ` Fam Zheng
  2016-11-29 16:01               ` Andrew Jones
  2016-11-29 15:45             ` Paolo Bonzini
  1 sibling, 1 reply; 37+ messages in thread
From: Fam Zheng @ 2016-11-29 15:39 UTC (permalink / raw)
  To: Andrew Jones
  Cc: Paolo Bonzini, Eliezer Tamir, Michael S. Tsirkin, qemu-devel,
	Jens Axboe, Christian Borntraeger, Stefan Hajnoczi,
	Davide Libenzi, Christoph Hellwig

On Tue, 11/29 16:24, Andrew Jones wrote:
> On Tue, Nov 29, 2016 at 10:17:46PM +0800, Fam Zheng wrote:
> > On Tue, 11/29 14:27, Paolo Bonzini wrote:
> > > 
> > > 
> > > On 29/11/2016 14:24, Fam Zheng wrote:
> > > > On Tue, 11/29 12:17, Paolo Bonzini wrote:
> > > >>
> > > >>
> > > >> On 29/11/2016 11:32, Fam Zheng wrote:
> > > >>>
> > > >>> The kernel change will be a new prctl operation (should it be a different
> > > >>> syscall to extend?) to register a new type of eventfd called "idle eventfd":
> > > >>>
> > > >>>     prctl(PR_ADD_IDLE_EVENTFD, int eventfd);
> > > >>>     prctl(PR_DEL_IDLE_EVENTFD, int eventfd);
> > > >>>
> > > >>> It will be notified by kernel each time when the thread's local core has no
> > > >>> runnable threads (i.e., entering idle state).
> > > >>>
> > > >>> QEMU can then add this eventfd to its event loop when it has events to poll, and
> > > >>> watch virtqueue/linux-aio memory from userspace in the fd handlers.  Effectiely,
> > > >>> if a ppoll() would have blocked because there are no new events, it could now
> > > >>> return immediately because of idle_eventfd events, and do the idle polling.
> > > >>
> > > >> This has two issues:
> > > >>
> > > >> * it only reports the leading edge of single_task_running().  Is it also
> > > >> useful to stop polling on the trailing edge?
> > > > 
> > > > QEMU can clear the eventfd right after event firing so I don't think it is
> > > > necessary.
> > > 
> > > Yes, but how would QEMU know that the eventfd has fired?  It would be
> > > very expensive to read the eventfd on each iteration of polling.
> > 
> > The idea is to ppoll() the eventfd together with other fds (ioeventfd and
> > linux-aio etc.), and in the handler, call event_notifier_test_and_clear()
> > followed by a polling loop for some period.
> > 
> > Fam
> > 
> > > 
> > > Paolo
> > > 
> > > >> * it still needs a system call before polling is entered.  Ideally, QEMU
> > > >> could run without any system call while in polling mode.
> > > >>
> > > >> Another possibility is to add a system call for single_task_running().
> > > >> It should be simple enough that you can implement it in the vDSO and
> > > >> avoid a context switch.  There are convenient hooking points in
> > > >> add_nr_running and sub_nr_running.
> > > > 
> > > > That sounds good!
> > > > 
> > > > Fam
> > > > 
> >
> 
> While we have a ppoll audience, another issue with the current polling
> is that we can block with an infinite timeout set (-1), and it can
> actually end up being infinite, i.e. vcpus will never run again. I'm
> able to exhibit this with kvm-unit-tests.

I don't understand, why does ppoll() block vcpus? They are in different threads.
Could you elaborate the kvm-unit-test case?

> For these rare cases where
> no other timeout has been selected, shouldn't we have a default timeout?
> Anyone want to pick a number? I have a baseless compulsion to use 10 ms...

Timeout set as -1 means there is abosolutely no event to expect, waking up after
10 ms is a waste. It's a bug if guest hangs in this case.

Fam

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] Linux kernel polling for QEMU
  2016-11-29 15:24           ` Andrew Jones
  2016-11-29 15:39             ` Fam Zheng
@ 2016-11-29 15:45             ` Paolo Bonzini
  1 sibling, 0 replies; 37+ messages in thread
From: Paolo Bonzini @ 2016-11-29 15:45 UTC (permalink / raw)
  To: Andrew Jones, Fam Zheng
  Cc: Eliezer Tamir, Michael S. Tsirkin, qemu-devel, Jens Axboe,
	Christian Borntraeger, Stefan Hajnoczi, Davide Libenzi,
	Christoph Hellwig



On 29/11/2016 16:24, Andrew Jones wrote:
> While we have a ppoll audience, another issue with the current polling
> is that we can block with an infinite timeout set (-1), and it can
> actually end up being infinite, i.e. vcpus will never run again. I'm
> able to exhibit this with kvm-unit-tests.

This should only happen outside the big QEMU lock, otherwise it's a bug.

Paolo

> For these rare cases where
> no other timeout has been selected, shouldn't we have a default timeout?
> Anyone want to pick a number? I have a baseless compulsion to use 10 ms...

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] Linux kernel polling for QEMU
  2016-11-29 15:39             ` Fam Zheng
@ 2016-11-29 16:01               ` Andrew Jones
  2016-11-29 16:13                 ` Paolo Bonzini
  0 siblings, 1 reply; 37+ messages in thread
From: Andrew Jones @ 2016-11-29 16:01 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Eliezer Tamir, Michael S. Tsirkin, qemu-devel, Jens Axboe,
	Christian Borntraeger, Stefan Hajnoczi, Paolo Bonzini,
	Davide Libenzi, Christoph Hellwig

On Tue, Nov 29, 2016 at 11:39:44PM +0800, Fam Zheng wrote:
> On Tue, 11/29 16:24, Andrew Jones wrote:
> > On Tue, Nov 29, 2016 at 10:17:46PM +0800, Fam Zheng wrote:
> > > On Tue, 11/29 14:27, Paolo Bonzini wrote:
> > > > 
> > > > 
> > > > On 29/11/2016 14:24, Fam Zheng wrote:
> > > > > On Tue, 11/29 12:17, Paolo Bonzini wrote:
> > > > >>
> > > > >>
> > > > >> On 29/11/2016 11:32, Fam Zheng wrote:
> > > > >>>
> > > > >>> The kernel change will be a new prctl operation (should it be a different
> > > > >>> syscall to extend?) to register a new type of eventfd called "idle eventfd":
> > > > >>>
> > > > >>>     prctl(PR_ADD_IDLE_EVENTFD, int eventfd);
> > > > >>>     prctl(PR_DEL_IDLE_EVENTFD, int eventfd);
> > > > >>>
> > > > >>> It will be notified by kernel each time when the thread's local core has no
> > > > >>> runnable threads (i.e., entering idle state).
> > > > >>>
> > > > >>> QEMU can then add this eventfd to its event loop when it has events to poll, and
> > > > >>> watch virtqueue/linux-aio memory from userspace in the fd handlers.  Effectiely,
> > > > >>> if a ppoll() would have blocked because there are no new events, it could now
> > > > >>> return immediately because of idle_eventfd events, and do the idle polling.
> > > > >>
> > > > >> This has two issues:
> > > > >>
> > > > >> * it only reports the leading edge of single_task_running().  Is it also
> > > > >> useful to stop polling on the trailing edge?
> > > > > 
> > > > > QEMU can clear the eventfd right after event firing so I don't think it is
> > > > > necessary.
> > > > 
> > > > Yes, but how would QEMU know that the eventfd has fired?  It would be
> > > > very expensive to read the eventfd on each iteration of polling.
> > > 
> > > The idea is to ppoll() the eventfd together with other fds (ioeventfd and
> > > linux-aio etc.), and in the handler, call event_notifier_test_and_clear()
> > > followed by a polling loop for some period.
> > > 
> > > Fam
> > > 
> > > > 
> > > > Paolo
> > > > 
> > > > >> * it still needs a system call before polling is entered.  Ideally, QEMU
> > > > >> could run without any system call while in polling mode.
> > > > >>
> > > > >> Another possibility is to add a system call for single_task_running().
> > > > >> It should be simple enough that you can implement it in the vDSO and
> > > > >> avoid a context switch.  There are convenient hooking points in
> > > > >> add_nr_running and sub_nr_running.
> > > > > 
> > > > > That sounds good!
> > > > > 
> > > > > Fam
> > > > > 
> > >
> > 
> > While we have a ppoll audience, another issue with the current polling
> > is that we can block with an infinite timeout set (-1), and it can
> > actually end up being infinite, i.e. vcpus will never run again. I'm
> > able to exhibit this with kvm-unit-tests.
> 
> I don't understand, why does ppoll() block vcpus? They are in different threads.
> Could you elaborate the kvm-unit-test case?

OK, it may be due to scheduling then. Below is the test case (for AArch64)
Also, I forgot to mention before that I can only see this with TCG, not
KVM. If ppoll is allowed to timeout, then the test will complete. If not,
then, as can be seen with strace, the iothread is stuck in ppoll, and the
test never completes.

 #include <asm/smp.h>
 volatile int ready;
 void set_ready(void) {
     ready = 1;
     while(1);
 }
 int main(void) {
     smp_boot_secondary(1, set_ready);
     while (!ready);
     return 0;
 }

Thanks,
drew

> 
> > For these rare cases where
> > no other timeout has been selected, shouldn't we have a default timeout?
> > Anyone want to pick a number? I have a baseless compulsion to use 10 ms...
> 
> Timeout set as -1 means there is abosolutely no event to expect, waking up after
> 10 ms is a waste. It's a bug if guest hangs in this case.
> 
> Fam
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] Linux kernel polling for QEMU
  2016-11-29 16:01               ` Andrew Jones
@ 2016-11-29 16:13                 ` Paolo Bonzini
  2016-11-29 19:38                   ` Andrew Jones
  0 siblings, 1 reply; 37+ messages in thread
From: Paolo Bonzini @ 2016-11-29 16:13 UTC (permalink / raw)
  To: Andrew Jones, Fam Zheng
  Cc: Eliezer Tamir, Michael S. Tsirkin, qemu-devel, Jens Axboe,
	Christian Borntraeger, Stefan Hajnoczi, Davide Libenzi,
	Christoph Hellwig



On 29/11/2016 17:01, Andrew Jones wrote:
> OK, it may be due to scheduling then. Below is the test case (for AArch64)
> Also, I forgot to mention before that I can only see this with TCG, not
> KVM. If ppoll is allowed to timeout, then the test will complete. If not,
> then, as can be seen with strace, the iothread is stuck in ppoll, and the
> test never completes.
> 
>  #include <asm/smp.h>
>  volatile int ready;
>  void set_ready(void) {
>      ready = 1;
>      while(1);
>  }
>  int main(void) {
>      smp_boot_secondary(1, set_ready);
>      while (!ready);
>      return 0;
>  }

Where is the test stuck?

Paolo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] Linux kernel polling for QEMU
  2016-11-29 16:13                 ` Paolo Bonzini
@ 2016-11-29 19:38                   ` Andrew Jones
  2016-11-30  7:19                     ` Peter Maydell
  0 siblings, 1 reply; 37+ messages in thread
From: Andrew Jones @ 2016-11-29 19:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Fam Zheng, Eliezer Tamir, Michael S. Tsirkin, qemu-devel,
	Jens Axboe, Christian Borntraeger, Stefan Hajnoczi,
	Davide Libenzi, Christoph Hellwig

On Tue, Nov 29, 2016 at 05:13:27PM +0100, Paolo Bonzini wrote:
> 
> 
> On 29/11/2016 17:01, Andrew Jones wrote:
> > OK, it may be due to scheduling then. Below is the test case (for AArch64)
> > Also, I forgot to mention before that I can only see this with TCG, not
> > KVM. If ppoll is allowed to timeout, then the test will complete. If not,
> > then, as can be seen with strace, the iothread is stuck in ppoll, and the
> > test never completes.
> > 
> >  #include <asm/smp.h>
> >  volatile int ready;
> >  void set_ready(void) {
> >      ready = 1;
> >      while(1);
> >  }
> >  int main(void) {
> >      smp_boot_secondary(1, set_ready);
> >      while (!ready);
> >      return 0;
> >  }
> 
> Where is the test stuck?
>

Thanks for making me look, I was simply assuming we were in the while
loops above.

I couldn't get the problem to reproduce with access to the monitor,
but by adding '-d exec' I was able to see cpu0 was on the wfe in
smp_boot_secondary. It should only stay there until cpu1 executes the
sev in secondary_cinit, but it looks like TCG doesn't yet implement sev

 $ grep SEV target-arm/translate.c
        /* TODO: Implement SEV, SEVL and WFE.  May help SMP performance.

Changing the sev in kvm-unit-tests to a yield (which isn't the right
thing to do) "resolves" the issue.

Back to why the iothread's ppoll timeout is involved. Without the
timeout we never leave ppoll, so we never call qemu_mutex_lock_iothread,
which calls qemu_cpu_kick_no_halt, telling cpu1 to let cpu0 run again.

Anyway, I agree now that changing the infinite timeout to an arbitrary
finite timeout isn't the right solution. An fd that ppoll could select,
which would emulate a sched tick, would make a bit more sense, but we
probably don't want that either. Actually, this particular test case
covers such a small corner that we probably don't want to do anything
at all, except eventually implement sev for ARM TCG.

Apologies to Stefan for polluting his mail thread!

drew

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] Linux kernel polling for QEMU
  2016-11-29 13:24     ` Fam Zheng
  2016-11-29 13:27       ` Paolo Bonzini
@ 2016-11-29 20:43       ` Stefan Hajnoczi
  2016-11-30  5:42         ` Fam Zheng
  1 sibling, 1 reply; 37+ messages in thread
From: Stefan Hajnoczi @ 2016-11-29 20:43 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Paolo Bonzini, Eliezer Tamir, Michael S. Tsirkin, qemu-devel,
	Jens Axboe, Christian Borntraeger, Stefan Hajnoczi,
	Davide Libenzi, Christoph Hellwig

On Tue, Nov 29, 2016 at 1:24 PM, Fam Zheng <famz@redhat.com> wrote:
> On Tue, 11/29 12:17, Paolo Bonzini wrote:
>> On 29/11/2016 11:32, Fam Zheng wrote:
>> * it still needs a system call before polling is entered.  Ideally, QEMU
>> could run without any system call while in polling mode.
>>
>> Another possibility is to add a system call for single_task_running().
>> It should be simple enough that you can implement it in the vDSO and
>> avoid a context switch.  There are convenient hooking points in
>> add_nr_running and sub_nr_running.
>
> That sounds good!

With this solution QEMU can either poll virtqueues or the host kernel
can poll NIC and storage controller descriptor rings, but not both at
the same time in one thread.  This is one of the reasons why I think
exploring polling in the kernel makes more sense.

The disadvantage of the kernel approach is that you must make the
ppoll(2)/epoll_wait(2) syscall even for polling, and you probably need
to do eventfd reads afterwards so the minimum event loop iteration
latency is higher than doing polling in userspace.

Stefan

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] Linux kernel polling for QEMU
  2016-11-29 20:43       ` Stefan Hajnoczi
@ 2016-11-30  5:42         ` Fam Zheng
  2016-11-30  9:38           ` Stefan Hajnoczi
  0 siblings, 1 reply; 37+ messages in thread
From: Fam Zheng @ 2016-11-30  5:42 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Paolo Bonzini, Eliezer Tamir, Michael S. Tsirkin, qemu-devel,
	Jens Axboe, Christian Borntraeger, Stefan Hajnoczi,
	Davide Libenzi, Christoph Hellwig

On Tue, 11/29 20:43, Stefan Hajnoczi wrote:
> On Tue, Nov 29, 2016 at 1:24 PM, Fam Zheng <famz@redhat.com> wrote:
> > On Tue, 11/29 12:17, Paolo Bonzini wrote:
> >> On 29/11/2016 11:32, Fam Zheng wrote:
> >> * it still needs a system call before polling is entered.  Ideally, QEMU
> >> could run without any system call while in polling mode.
> >>
> >> Another possibility is to add a system call for single_task_running().
> >> It should be simple enough that you can implement it in the vDSO and
> >> avoid a context switch.  There are convenient hooking points in
> >> add_nr_running and sub_nr_running.
> >
> > That sounds good!
> 
> With this solution QEMU can either poll virtqueues or the host kernel
> can poll NIC and storage controller descriptor rings, but not both at
> the same time in one thread.  This is one of the reasons why I think
> exploring polling in the kernel makes more sense.

That's true. I have one question though: controller rings are in a different
layer in the kernel, I wonder what the syscall interface looks like to ask
kernel to poll both hardware rings and memory locations in the same loop? It's
not obvious to me after reading your eventfd patch.

> 
> The disadvantage of the kernel approach is that you must make the
> ppoll(2)/epoll_wait(2) syscall even for polling, and you probably need
> to do eventfd reads afterwards so the minimum event loop iteration
> latency is higher than doing polling in userspace.

And userspace drivers powered by dpdk or vfio will still want to do polling in
userspace anyway, we may want to take that into account as well.

Fam

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] Linux kernel polling for QEMU
  2016-11-29 19:38                   ` Andrew Jones
@ 2016-11-30  7:19                     ` Peter Maydell
  2016-11-30  9:05                       ` Andrew Jones
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Maydell @ 2016-11-30  7:19 UTC (permalink / raw)
  To: Andrew Jones
  Cc: Paolo Bonzini, Fam Zheng, Eliezer Tamir, Michael S. Tsirkin,
	QEMU Developers, Jens Axboe, Christian Borntraeger,
	Stefan Hajnoczi, Davide Libenzi, Christoph Hellwig

On 29 November 2016 at 19:38, Andrew Jones <drjones@redhat.com> wrote:
> Thanks for making me look, I was simply assuming we were in the while
> loops above.
>
> I couldn't get the problem to reproduce with access to the monitor,
> but by adding '-d exec' I was able to see cpu0 was on the wfe in
> smp_boot_secondary. It should only stay there until cpu1 executes the
> sev in secondary_cinit, but it looks like TCG doesn't yet implement sev
>
>  $ grep SEV target-arm/translate.c
>         /* TODO: Implement SEV, SEVL and WFE.  May help SMP performance.

Yes, we currently NOP SEV. We only implement WFE as "yield back
to TCG top level loop", though, so this is fine. The idea is
that WFE gets used in busy loops so it's a helpful hint to
try running some other TCG vCPU instead of just spinning in
the guest on this one. Implementing SEV as a NOP and WFE as
a more-or-less NOP is architecturally permitted (guest code
is required to cope with WFE returning "early"). If something
is not working correctly then it's either buggy guest code
or a problem with the generic TCG scheduling of CPUs.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] Linux kernel polling for QEMU
  2016-11-30  7:19                     ` Peter Maydell
@ 2016-11-30  9:05                       ` Andrew Jones
  2016-11-30  9:46                         ` Peter Maydell
  0 siblings, 1 reply; 37+ messages in thread
From: Andrew Jones @ 2016-11-30  9:05 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Fam Zheng, Eliezer Tamir, Michael S. Tsirkin, QEMU Developers,
	Jens Axboe, Christian Borntraeger, Stefan Hajnoczi,
	Paolo Bonzini, Davide Libenzi, Christoph Hellwig

On Wed, Nov 30, 2016 at 07:19:12AM +0000, Peter Maydell wrote:
> On 29 November 2016 at 19:38, Andrew Jones <drjones@redhat.com> wrote:
> > Thanks for making me look, I was simply assuming we were in the while
> > loops above.
> >
> > I couldn't get the problem to reproduce with access to the monitor,
> > but by adding '-d exec' I was able to see cpu0 was on the wfe in
> > smp_boot_secondary. It should only stay there until cpu1 executes the
> > sev in secondary_cinit, but it looks like TCG doesn't yet implement sev
> >
> >  $ grep SEV target-arm/translate.c
> >         /* TODO: Implement SEV, SEVL and WFE.  May help SMP performance.
> 
> Yes, we currently NOP SEV. We only implement WFE as "yield back
> to TCG top level loop", though, so this is fine. The idea is
> that WFE gets used in busy loops so it's a helpful hint to
> try running some other TCG vCPU instead of just spinning in
> the guest on this one. Implementing SEV as a NOP and WFE as
> a more-or-less NOP is architecturally permitted (guest code
> is required to cope with WFE returning "early"). If something
> is not working correctly then it's either buggy guest code
> or a problem with the generic TCG scheduling of CPUs.

The problem is indeed with the scheduling. The way it currently works
is to depend on the iothread to kick a reschedule once in a while, or
a cpu to issue an instruction that does so (wfe/wfi). However if
there's no io and a cpu never issues a scheduling instruction, then it
won't happen. We either need a sched tick or to never have an infinite
iothread ppoll timeout (basically using the ppoll timeout as a tick).

As for being buggy guest code, I don't think so. Here's another
unit test that illustrates the issue taking wfe/sev out.

 #include <asm/smp.h>
 void secondary(void) {
     printf("secondary running\n");
     asm("yield");

     /* A "real" guest cpu shouldn't do this, but even if it
      * does, that shouldn't stop other cpus from running.
      */
     while(1);
 }
 int main(void) {
     smp_boot_secondary(1, secondary);
     printf("primary running\n");
     asm("yield");
     return 0;
 }

With that test we get the two print statements, but it never exits.

Now that I understand the problem much better, I think I may be
coming full circle and advocating the iothread's ppoll never be
allowed to have an infinite timeout again, but now only for tcg.
Something like

 if (timeout < 0 && tcg_enabled())
    timeout = TCG_SCHED_TICK;

Thanks,
drew

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] Linux kernel polling for QEMU
  2016-11-30  5:42         ` Fam Zheng
@ 2016-11-30  9:38           ` Stefan Hajnoczi
  2016-11-30 10:50             ` Fam Zheng
  0 siblings, 1 reply; 37+ messages in thread
From: Stefan Hajnoczi @ 2016-11-30  9:38 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Stefan Hajnoczi, Paolo Bonzini, Eliezer Tamir,
	Michael S. Tsirkin, qemu-devel, Jens Axboe,
	Christian Borntraeger, Davide Libenzi, Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 2887 bytes --]

On Wed, Nov 30, 2016 at 01:42:14PM +0800, Fam Zheng wrote:
> On Tue, 11/29 20:43, Stefan Hajnoczi wrote:
> > On Tue, Nov 29, 2016 at 1:24 PM, Fam Zheng <famz@redhat.com> wrote:
> > > On Tue, 11/29 12:17, Paolo Bonzini wrote:
> > >> On 29/11/2016 11:32, Fam Zheng wrote:
> > >> * it still needs a system call before polling is entered.  Ideally, QEMU
> > >> could run without any system call while in polling mode.
> > >>
> > >> Another possibility is to add a system call for single_task_running().
> > >> It should be simple enough that you can implement it in the vDSO and
> > >> avoid a context switch.  There are convenient hooking points in
> > >> add_nr_running and sub_nr_running.
> > >
> > > That sounds good!
> > 
> > With this solution QEMU can either poll virtqueues or the host kernel
> > can poll NIC and storage controller descriptor rings, but not both at
> > the same time in one thread.  This is one of the reasons why I think
> > exploring polling in the kernel makes more sense.
> 
> That's true. I have one question though: controller rings are in a different
> layer in the kernel, I wonder what the syscall interface looks like to ask
> kernel to poll both hardware rings and memory locations in the same loop? It's
> not obvious to me after reading your eventfd patch.

Current descriptor ring polling in select(2)/poll(2) is supported for
network sockets.  Take a look at the POLL_BUSY_LOOP flag in
fs/select.c:do_poll().  If the .poll() callback sets the flag then it
indicates that the fd supports busy loop polling.

The way this is implemented for network sockets is that the socket looks
up the napi index and is able to use the NIC driver to poll the rx ring.
Then it checks whether the socket's receive queue contains data after
the rx ring was processed.

The virtio_net.ko driver supports this interface, for example.  See
drivers/net/virtio_net.c:virtnet_busy_poll().

Busy loop polling isn't supported for block I/O yet.  There is currently
a completely independent code path for O_DIRECT synchronous I/O where
NVMe can poll for request completion.  But it doesn't work together with
asynchronous I/O (e.g. Linux AIO using eventfd with select(2)/poll(2)).

> > The disadvantage of the kernel approach is that you must make the
> > ppoll(2)/epoll_wait(2) syscall even for polling, and you probably need
> > to do eventfd reads afterwards so the minimum event loop iteration
> > latency is higher than doing polling in userspace.
> 
> And userspace drivers powered by dpdk or vfio will still want to do polling in
> userspace anyway, we may want to take that into account as well.

vfio supports interrupts so it can definitely be integrated with
adaptive kernel polling (i.e. poll for a little while and then wait for
an interrupt if there was no event).

Does dpdk ever use interrupts?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] Linux kernel polling for QEMU
  2016-11-30  9:05                       ` Andrew Jones
@ 2016-11-30  9:46                         ` Peter Maydell
  2016-11-30 14:18                           ` Paolo Bonzini
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Maydell @ 2016-11-30  9:46 UTC (permalink / raw)
  To: Andrew Jones
  Cc: Fam Zheng, Eliezer Tamir, Michael S. Tsirkin, QEMU Developers,
	Jens Axboe, Christian Borntraeger, Stefan Hajnoczi,
	Paolo Bonzini, Davide Libenzi, Christoph Hellwig,
	Alex Bennée

On 30 November 2016 at 09:05, Andrew Jones <drjones@redhat.com> wrote:
> The problem is indeed with the scheduling. The way it currently works
> is to depend on the iothread to kick a reschedule once in a while, or
> a cpu to issue an instruction that does so (wfe/wfi). However if
> there's no io and a cpu never issues a scheduling instruction, then it
> won't happen. We either need a sched tick or to never have an infinite
> iothread ppoll timeout (basically using the ppoll timeout as a tick).

Ah yes, that one. I thought Alex had a patch which added
a timer to ensure that we don't allow a single guest
TCG vCPU to hog the execution thread, but maybe I'm
misremembering.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] Linux kernel polling for QEMU
  2016-11-30  9:38           ` Stefan Hajnoczi
@ 2016-11-30 10:50             ` Fam Zheng
  2016-11-30 15:10               ` Stefan Hajnoczi
  0 siblings, 1 reply; 37+ messages in thread
From: Fam Zheng @ 2016-11-30 10:50 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Eliezer Tamir, Michael S. Tsirkin, Stefan Hajnoczi, qemu-devel,
	Jens Axboe, Christian Borntraeger, Davide Libenzi, Paolo Bonzini,
	Christoph Hellwig

On Wed, 11/30 09:38, Stefan Hajnoczi wrote:
> On Wed, Nov 30, 2016 at 01:42:14PM +0800, Fam Zheng wrote:
> > On Tue, 11/29 20:43, Stefan Hajnoczi wrote:
> > > On Tue, Nov 29, 2016 at 1:24 PM, Fam Zheng <famz@redhat.com> wrote:
> > > > On Tue, 11/29 12:17, Paolo Bonzini wrote:
> > > >> On 29/11/2016 11:32, Fam Zheng wrote:
> > > >> * it still needs a system call before polling is entered.  Ideally, QEMU
> > > >> could run without any system call while in polling mode.
> > > >>
> > > >> Another possibility is to add a system call for single_task_running().
> > > >> It should be simple enough that you can implement it in the vDSO and
> > > >> avoid a context switch.  There are convenient hooking points in
> > > >> add_nr_running and sub_nr_running.
> > > >
> > > > That sounds good!
> > > 
> > > With this solution QEMU can either poll virtqueues or the host kernel
> > > can poll NIC and storage controller descriptor rings, but not both at
> > > the same time in one thread.  This is one of the reasons why I think
> > > exploring polling in the kernel makes more sense.
> > 
> > That's true. I have one question though: controller rings are in a different
> > layer in the kernel, I wonder what the syscall interface looks like to ask
> > kernel to poll both hardware rings and memory locations in the same loop? It's
> > not obvious to me after reading your eventfd patch.
> 
> Current descriptor ring polling in select(2)/poll(2) is supported for
> network sockets.  Take a look at the POLL_BUSY_LOOP flag in
> fs/select.c:do_poll().  If the .poll() callback sets the flag then it
> indicates that the fd supports busy loop polling.
> 
> The way this is implemented for network sockets is that the socket looks
> up the napi index and is able to use the NIC driver to poll the rx ring.
> Then it checks whether the socket's receive queue contains data after
> the rx ring was processed.
> 
> The virtio_net.ko driver supports this interface, for example.  See
> drivers/net/virtio_net.c:virtnet_busy_poll().
> 
> Busy loop polling isn't supported for block I/O yet.  There is currently
> a completely independent code path for O_DIRECT synchronous I/O where
> NVMe can poll for request completion.  But it doesn't work together with
> asynchronous I/O (e.g. Linux AIO using eventfd with select(2)/poll(2)).

This makes perfect sense now, thanks for the pointers!

> 
> > > The disadvantage of the kernel approach is that you must make the
> > > ppoll(2)/epoll_wait(2) syscall even for polling, and you probably need
> > > to do eventfd reads afterwards so the minimum event loop iteration
> > > latency is higher than doing polling in userspace.
> > 
> > And userspace drivers powered by dpdk or vfio will still want to do polling in
> > userspace anyway, we may want to take that into account as well.
> 
> vfio supports interrupts so it can definitely be integrated with
> adaptive kernel polling (i.e. poll for a little while and then wait for
> an interrupt if there was no event).
> 
> Does dpdk ever use interrupts?

Yes, interrupt mode is supported there. For example see the intx/msix init code
in drivers/net/ixgbe/ixgbe_ethdev.c:ixgbe_dev_start().

Fam

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] Linux kernel polling for QEMU
  2016-11-30  9:46                         ` Peter Maydell
@ 2016-11-30 14:18                           ` Paolo Bonzini
  2016-12-05 11:20                             ` Alex Bennée
  0 siblings, 1 reply; 37+ messages in thread
From: Paolo Bonzini @ 2016-11-30 14:18 UTC (permalink / raw)
  To: Peter Maydell, Andrew Jones
  Cc: Fam Zheng, Eliezer Tamir, Michael S. Tsirkin, QEMU Developers,
	Jens Axboe, Christian Borntraeger, Stefan Hajnoczi,
	Davide Libenzi, Christoph Hellwig, Alex Bennée



On 30/11/2016 10:46, Peter Maydell wrote:
>> > The problem is indeed with the scheduling. The way it currently works
>> > is to depend on the iothread to kick a reschedule once in a while, or
>> > a cpu to issue an instruction that does so (wfe/wfi). However if
>> > there's no io and a cpu never issues a scheduling instruction, then it
>> > won't happen. We either need a sched tick or to never have an infinite
>> > iothread ppoll timeout (basically using the ppoll timeout as a tick).
> Ah yes, that one. I thought Alex had a patch which added
> a timer to ensure that we don't allow a single guest
> TCG vCPU to hog the execution thread, but maybe I'm
> misremembering.

Yes, it's part of MTTCG.

Paolo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] Linux kernel polling for QEMU
  2016-11-30 10:50             ` Fam Zheng
@ 2016-11-30 15:10               ` Stefan Hajnoczi
  0 siblings, 0 replies; 37+ messages in thread
From: Stefan Hajnoczi @ 2016-11-30 15:10 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Eliezer Tamir, Michael S. Tsirkin, Stefan Hajnoczi, qemu-devel,
	Jens Axboe, Christian Borntraeger, Davide Libenzi, Paolo Bonzini,
	Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 3915 bytes --]

On Wed, Nov 30, 2016 at 06:50:09PM +0800, Fam Zheng wrote:
> On Wed, 11/30 09:38, Stefan Hajnoczi wrote:
> > On Wed, Nov 30, 2016 at 01:42:14PM +0800, Fam Zheng wrote:
> > > On Tue, 11/29 20:43, Stefan Hajnoczi wrote:
> > > > On Tue, Nov 29, 2016 at 1:24 PM, Fam Zheng <famz@redhat.com> wrote:
> > > > > On Tue, 11/29 12:17, Paolo Bonzini wrote:
> > > > >> On 29/11/2016 11:32, Fam Zheng wrote:
> > > > >> * it still needs a system call before polling is entered.  Ideally, QEMU
> > > > >> could run without any system call while in polling mode.
> > > > >>
> > > > >> Another possibility is to add a system call for single_task_running().
> > > > >> It should be simple enough that you can implement it in the vDSO and
> > > > >> avoid a context switch.  There are convenient hooking points in
> > > > >> add_nr_running and sub_nr_running.
> > > > >
> > > > > That sounds good!
> > > > 
> > > > With this solution QEMU can either poll virtqueues or the host kernel
> > > > can poll NIC and storage controller descriptor rings, but not both at
> > > > the same time in one thread.  This is one of the reasons why I think
> > > > exploring polling in the kernel makes more sense.
> > > 
> > > That's true. I have one question though: controller rings are in a different
> > > layer in the kernel, I wonder what the syscall interface looks like to ask
> > > kernel to poll both hardware rings and memory locations in the same loop? It's
> > > not obvious to me after reading your eventfd patch.
> > 
> > Current descriptor ring polling in select(2)/poll(2) is supported for
> > network sockets.  Take a look at the POLL_BUSY_LOOP flag in
> > fs/select.c:do_poll().  If the .poll() callback sets the flag then it
> > indicates that the fd supports busy loop polling.
> > 
> > The way this is implemented for network sockets is that the socket looks
> > up the napi index and is able to use the NIC driver to poll the rx ring.
> > Then it checks whether the socket's receive queue contains data after
> > the rx ring was processed.
> > 
> > The virtio_net.ko driver supports this interface, for example.  See
> > drivers/net/virtio_net.c:virtnet_busy_poll().
> > 
> > Busy loop polling isn't supported for block I/O yet.  There is currently
> > a completely independent code path for O_DIRECT synchronous I/O where
> > NVMe can poll for request completion.  But it doesn't work together with
> > asynchronous I/O (e.g. Linux AIO using eventfd with select(2)/poll(2)).
> 
> This makes perfect sense now, thanks for the pointers!
> 
> > 
> > > > The disadvantage of the kernel approach is that you must make the
> > > > ppoll(2)/epoll_wait(2) syscall even for polling, and you probably need
> > > > to do eventfd reads afterwards so the minimum event loop iteration
> > > > latency is higher than doing polling in userspace.
> > > 
> > > And userspace drivers powered by dpdk or vfio will still want to do polling in
> > > userspace anyway, we may want to take that into account as well.
> > 
> > vfio supports interrupts so it can definitely be integrated with
> > adaptive kernel polling (i.e. poll for a little while and then wait for
> > an interrupt if there was no event).
> > 
> > Does dpdk ever use interrupts?
> 
> Yes, interrupt mode is supported there. For example see the intx/msix init code
> in drivers/net/ixgbe/ixgbe_ethdev.c:ixgbe_dev_start().

If the application wants to poll 100% of the time then userspace polling
is the right solution.  Userspace polling also makes sense when all
event sources can be polled from userspace - it's faster than using
kernel polling.

But I think in adaptive polling + wait use cases or when there are a
mixture of userspace and kernel event sources to poll, then it makes
sense to use kernel polling.

In QEMU we have the latter so I think we need to contribute to kernel
polling.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] Linux kernel polling for QEMU
  2016-11-29 10:45       ` Stefan Hajnoczi
@ 2016-11-30 17:41         ` Avi Kivity
  2016-12-01 11:45           ` Stefan Hajnoczi
  0 siblings, 1 reply; 37+ messages in thread
From: Avi Kivity @ 2016-11-30 17:41 UTC (permalink / raw)
  To: Stefan Hajnoczi, Paolo Bonzini
  Cc: Willem de Bruijn, Fam Zheng, Eliezer Tamir, Eric Dumazet,
	Michael S. Tsirkin, qemu-devel, Jens Axboe,
	Christian Borntraeger, Davide Libenzi, Christoph Hellwig



On 11/29/2016 12:45 PM, Stefan Hajnoczi wrote:
> On Mon, Nov 28, 2016 at 04:41:13PM +0100, Paolo Bonzini wrote:
>> On 28/11/2016 16:29, Stefan Hajnoczi wrote:
>>> Thanks for sharing the link.  I'll let you know before embarking on an
>>> effort to make epoll support busy_loop.
>>>
>>> At the moment I'm still evaluating whether the good results we've gotten
>>> from polling in QEMU userspace are preserved when polling is shifted to
>>> the kernel.
>>>
>>> FWIW I've prototyped ioctl(EVENTFD_SET_POLL_INFO) but haven't had a
>>> chance to test it yet:
>>> https://github.com/stefanha/linux/commit/133e8f1da8eb5364cd5c5f7162decbc79175cd13
>> This would add a system call every time the main loop processes a vring,
>> wouldn't it?
> Yes, this is a problem and is the reason I haven't finished implementing
> a test using QEMU yet.
>
> My proposed eventfd polling mechanism doesn't work well with descriptor
> ring indices because the polling info needs to be updated each event
> loop iteration with the last seen ring index.
>
> This can be solved by making struct eventfd_poll_info.value take a
> userspace memory address.  The value to compare against is fetched each
> polling iteration, avoiding the need for ioctl calls.
>

Maybe we could do the same for sockets?  When data is available on a 
socket (or when it becomes writable), write to a user memory location.

I, too, have an interest in polling; in my situation most of the polling 
happens in userspace.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] Linux kernel polling for QEMU
  2016-11-30 17:41         ` Avi Kivity
@ 2016-12-01 11:45           ` Stefan Hajnoczi
  2016-12-01 11:59             ` Avi Kivity
  0 siblings, 1 reply; 37+ messages in thread
From: Stefan Hajnoczi @ 2016-12-01 11:45 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Paolo Bonzini, Willem de Bruijn, Fam Zheng, Eliezer Tamir,
	Eric Dumazet, Michael S. Tsirkin, qemu-devel, Jens Axboe,
	Christian Borntraeger, Davide Libenzi, Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 1752 bytes --]

On Wed, Nov 30, 2016 at 07:41:11PM +0200, Avi Kivity wrote:
> On 11/29/2016 12:45 PM, Stefan Hajnoczi wrote:
> > On Mon, Nov 28, 2016 at 04:41:13PM +0100, Paolo Bonzini wrote:
> > > On 28/11/2016 16:29, Stefan Hajnoczi wrote:
> > > > Thanks for sharing the link.  I'll let you know before embarking on an
> > > > effort to make epoll support busy_loop.
> > > > 
> > > > At the moment I'm still evaluating whether the good results we've gotten
> > > > from polling in QEMU userspace are preserved when polling is shifted to
> > > > the kernel.
> > > > 
> > > > FWIW I've prototyped ioctl(EVENTFD_SET_POLL_INFO) but haven't had a
> > > > chance to test it yet:
> > > > https://github.com/stefanha/linux/commit/133e8f1da8eb5364cd5c5f7162decbc79175cd13
> > > This would add a system call every time the main loop processes a vring,
> > > wouldn't it?
> > Yes, this is a problem and is the reason I haven't finished implementing
> > a test using QEMU yet.
> > 
> > My proposed eventfd polling mechanism doesn't work well with descriptor
> > ring indices because the polling info needs to be updated each event
> > loop iteration with the last seen ring index.
> > 
> > This can be solved by making struct eventfd_poll_info.value take a
> > userspace memory address.  The value to compare against is fetched each
> > polling iteration, avoiding the need for ioctl calls.
> > 
> 
> Maybe we could do the same for sockets?  When data is available on a socket
> (or when it becomes writable), write to a user memory location.
> 
> I, too, have an interest in polling; in my situation most of the polling
> happens in userspace.

You are trying to improve on the latency of non-blocking
ppoll(2)/epoll_wait(2) call?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] Linux kernel polling for QEMU
  2016-12-01 11:45           ` Stefan Hajnoczi
@ 2016-12-01 11:59             ` Avi Kivity
  2016-12-01 14:35               ` Paolo Bonzini
  0 siblings, 1 reply; 37+ messages in thread
From: Avi Kivity @ 2016-12-01 11:59 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Paolo Bonzini, Willem de Bruijn, Fam Zheng, Eliezer Tamir,
	Eric Dumazet, Michael S. Tsirkin, qemu-devel, Jens Axboe,
	Christian Borntraeger, Davide Libenzi, Christoph Hellwig

On 12/01/2016 01:45 PM, Stefan Hajnoczi wrote:
> On Wed, Nov 30, 2016 at 07:41:11PM +0200, Avi Kivity wrote:
>> On 11/29/2016 12:45 PM, Stefan Hajnoczi wrote:
>>> On Mon, Nov 28, 2016 at 04:41:13PM +0100, Paolo Bonzini wrote:
>>>> On 28/11/2016 16:29, Stefan Hajnoczi wrote:
>>>>> Thanks for sharing the link.  I'll let you know before embarking on an
>>>>> effort to make epoll support busy_loop.
>>>>>
>>>>> At the moment I'm still evaluating whether the good results we've gotten
>>>>> from polling in QEMU userspace are preserved when polling is shifted to
>>>>> the kernel.
>>>>>
>>>>> FWIW I've prototyped ioctl(EVENTFD_SET_POLL_INFO) but haven't had a
>>>>> chance to test it yet:
>>>>> https://github.com/stefanha/linux/commit/133e8f1da8eb5364cd5c5f7162decbc79175cd13
>>>> This would add a system call every time the main loop processes a vring,
>>>> wouldn't it?
>>> Yes, this is a problem and is the reason I haven't finished implementing
>>> a test using QEMU yet.
>>>
>>> My proposed eventfd polling mechanism doesn't work well with descriptor
>>> ring indices because the polling info needs to be updated each event
>>> loop iteration with the last seen ring index.
>>>
>>> This can be solved by making struct eventfd_poll_info.value take a
>>> userspace memory address.  The value to compare against is fetched each
>>> polling iteration, avoiding the need for ioctl calls.
>>>
>> Maybe we could do the same for sockets?  When data is available on a socket
>> (or when it becomes writable), write to a user memory location.
>>
>> I, too, have an interest in polling; in my situation most of the polling
>> happens in userspace.
> You are trying to improve on the latency of non-blocking
> ppoll(2)/epoll_wait(2) call?
>

Yes, but the concern is for throughput, not latency.


My main loop looks like

    execute some tasks
    poll from many sources

Since epoll_wait(..., 0) has non-trivial costs, I have to limit the 
polling rate, and even so it shows up in the profile.  If the cost were 
lower, I could poll at a higher frequency, resulting in lower worst-case 
latencies for high-priority tasks.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] Linux kernel polling for QEMU
  2016-12-01 11:59             ` Avi Kivity
@ 2016-12-01 14:35               ` Paolo Bonzini
  2016-12-02 10:12                 ` Stefan Hajnoczi
  2016-12-07 10:32                 ` Avi Kivity
  0 siblings, 2 replies; 37+ messages in thread
From: Paolo Bonzini @ 2016-12-01 14:35 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Stefan Hajnoczi, Willem de Bruijn, Fam Zheng, Eliezer Tamir,
	Eric Dumazet, Michael S. Tsirkin, qemu-devel, Jens Axboe,
	Christian Borntraeger, Davide Libenzi, Christoph Hellwig


> > > Maybe we could do the same for sockets?  When data is available on a
> > > socket (or when it becomes writable), write to a user memory location.
> > >
> > > I, too, have an interest in polling; in my situation most of the polling
> > > happens in userspace.
> >
> > You are trying to improve on the latency of non-blocking
> > ppoll(2)/epoll_wait(2) call?
> 
> Yes, but the concern is for throughput, not latency.
> 
> My main loop looks like
> 
>     execute some tasks
>     poll from many sources
> 
> Since epoll_wait(..., 0) has non-trivial costs, I have to limit the
> polling rate, and even so it shows up in the profile.  If the cost were
> lower, I could poll at a higher frequency, resulting in lower worst-case
> latencies for high-priority tasks.

IMHO, the ideal model wouldn't enter the kernel at all unless you _want_
to go to sleep.

Paolo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] Linux kernel polling for QEMU
  2016-12-01 14:35               ` Paolo Bonzini
@ 2016-12-02 10:12                 ` Stefan Hajnoczi
  2016-12-07 10:38                   ` Avi Kivity
  2016-12-07 10:32                 ` Avi Kivity
  1 sibling, 1 reply; 37+ messages in thread
From: Stefan Hajnoczi @ 2016-12-02 10:12 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Avi Kivity, Willem de Bruijn, Fam Zheng, Eliezer Tamir,
	Eric Dumazet, Michael S. Tsirkin, qemu-devel, Jens Axboe,
	Christian Borntraeger, Stefan Hajnoczi, Davide Libenzi,
	Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 1424 bytes --]

On Thu, Dec 01, 2016 at 09:35:27AM -0500, Paolo Bonzini wrote:
> > > > Maybe we could do the same for sockets?  When data is available on a
> > > > socket (or when it becomes writable), write to a user memory location.
> > > >
> > > > I, too, have an interest in polling; in my situation most of the polling
> > > > happens in userspace.
> > >
> > > You are trying to improve on the latency of non-blocking
> > > ppoll(2)/epoll_wait(2) call?
> > 
> > Yes, but the concern is for throughput, not latency.
> > 
> > My main loop looks like
> > 
> >     execute some tasks
> >     poll from many sources
> > 
> > Since epoll_wait(..., 0) has non-trivial costs, I have to limit the
> > polling rate, and even so it shows up in the profile.  If the cost were
> > lower, I could poll at a higher frequency, resulting in lower worst-case
> > latencies for high-priority tasks.
> 
> IMHO, the ideal model wouldn't enter the kernel at all unless you _want_
> to go to sleep.

Something like mmapping an epoll file descriptor to get a ring of
epoll_events.  It has to play nice with blocking epoll_wait(2) and
ppoll(2) of an eventpoll file descriptor so that the userspace process
can go to sleep without a lot of work to switch epoll wait vs mmap
modes.

I'm bothered by the fact that the kernel will not be able to poll NIC or
storage controller rings if userspace is doing the polling :(.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] Linux kernel polling for QEMU
  2016-11-30 14:18                           ` Paolo Bonzini
@ 2016-12-05 11:20                             ` Alex Bennée
  0 siblings, 0 replies; 37+ messages in thread
From: Alex Bennée @ 2016-12-05 11:20 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Peter Maydell, Andrew Jones, Fam Zheng, Eliezer Tamir,
	Michael S. Tsirkin, QEMU Developers, Jens Axboe,
	Christian Borntraeger, Stefan Hajnoczi, Davide Libenzi,
	Christoph Hellwig


Paolo Bonzini <pbonzini@redhat.com> writes:

> On 30/11/2016 10:46, Peter Maydell wrote:
>>> > The problem is indeed with the scheduling. The way it currently works
>>> > is to depend on the iothread to kick a reschedule once in a while, or
>>> > a cpu to issue an instruction that does so (wfe/wfi). However if
>>> > there's no io and a cpu never issues a scheduling instruction, then it
>>> > won't happen. We either need a sched tick or to never have an infinite
>>> > iothread ppoll timeout (basically using the ppoll timeout as a tick).
>> Ah yes, that one. I thought Alex had a patch which added
>> a timer to ensure that we don't allow a single guest
>> TCG vCPU to hog the execution thread, but maybe I'm
>> misremembering.
>
> Yes, it's part of MTTCG.

The patch itself is:

    tcg: add kick timer for single-threaded vCPU emulation

It's not really part of MTTCG as it can be applied without reference to
the MTTCG work. However it is a pre-requisite to the iothread mutex
rework that MTTCG requires which would otherwise break the single
threaded case which currently relies on the side-effect to trigger
scheduling.

>
> Paolo


--
Alex Bennée

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] Linux kernel polling for QEMU
  2016-12-01 14:35               ` Paolo Bonzini
  2016-12-02 10:12                 ` Stefan Hajnoczi
@ 2016-12-07 10:32                 ` Avi Kivity
  1 sibling, 0 replies; 37+ messages in thread
From: Avi Kivity @ 2016-12-07 10:32 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Stefan Hajnoczi, Willem de Bruijn, Fam Zheng, Eliezer Tamir,
	Eric Dumazet, Michael S. Tsirkin, qemu-devel, Jens Axboe,
	Christian Borntraeger, Davide Libenzi, Christoph Hellwig



On 12/01/2016 04:35 PM, Paolo Bonzini wrote:
>>>> Maybe we could do the same for sockets?  When data is available on a
>>>> socket (or when it becomes writable), write to a user memory location.
>>>>
>>>> I, too, have an interest in polling; in my situation most of the polling
>>>> happens in userspace.
>>> You are trying to improve on the latency of non-blocking
>>> ppoll(2)/epoll_wait(2) call?
>> Yes, but the concern is for throughput, not latency.
>>
>> My main loop looks like
>>
>>      execute some tasks
>>      poll from many sources
>>
>> Since epoll_wait(..., 0) has non-trivial costs, I have to limit the
>> polling rate, and even so it shows up in the profile.  If the cost were
>> lower, I could poll at a higher frequency, resulting in lower worst-case
>> latencies for high-priority tasks.
> IMHO, the ideal model wouldn't enter the kernel at all unless you _want_
> to go to sleep.

Right.  Note that if data is available, it doesn't really matter, 
because you'd have to read() it anyway.  It matters for the case where 
data is not available.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] Linux kernel polling for QEMU
  2016-12-02 10:12                 ` Stefan Hajnoczi
@ 2016-12-07 10:38                   ` Avi Kivity
  0 siblings, 0 replies; 37+ messages in thread
From: Avi Kivity @ 2016-12-07 10:38 UTC (permalink / raw)
  To: Stefan Hajnoczi, Paolo Bonzini
  Cc: Willem de Bruijn, Fam Zheng, Eliezer Tamir, Eric Dumazet,
	Michael S. Tsirkin, qemu-devel, Jens Axboe,
	Christian Borntraeger, Stefan Hajnoczi, Davide Libenzi,
	Christoph Hellwig



On 12/02/2016 12:12 PM, Stefan Hajnoczi wrote:
> On Thu, Dec 01, 2016 at 09:35:27AM -0500, Paolo Bonzini wrote:
>>>>> Maybe we could do the same for sockets?  When data is available on a
>>>>> socket (or when it becomes writable), write to a user memory location.
>>>>>
>>>>> I, too, have an interest in polling; in my situation most of the polling
>>>>> happens in userspace.
>>>> You are trying to improve on the latency of non-blocking
>>>> ppoll(2)/epoll_wait(2) call?
>>> Yes, but the concern is for throughput, not latency.
>>>
>>> My main loop looks like
>>>
>>>      execute some tasks
>>>      poll from many sources
>>>
>>> Since epoll_wait(..., 0) has non-trivial costs, I have to limit the
>>> polling rate, and even so it shows up in the profile.  If the cost were
>>> lower, I could poll at a higher frequency, resulting in lower worst-case
>>> latencies for high-priority tasks.
>> IMHO, the ideal model wouldn't enter the kernel at all unless you _want_
>> to go to sleep.
> Something like mmapping an epoll file descriptor to get a ring of
> epoll_events.  It has to play nice with blocking epoll_wait(2) and
> ppoll(2) of an eventpoll file descriptor so that the userspace process
> can go to sleep without a lot of work to switch epoll wait vs mmap
> modes.

I think that's overkill; as soon as data is returned, you're going to be 
talking to the kernel in order to get it.  Besides, leave something for 
userspace tcp stack implementations.

> I'm bothered by the fact that the kernel will not be able to poll NIC or
> storage controller rings if userspace is doing the polling :(.


Yes, that's a concern.

We have three conflicting requirements:
  - fast inter-thread IPC (or guest/host IPC) -> shared memory+polling
  - kernel multiplexing of hardware -> kernel interfaces + kernel poll / 
sleep
  - composable interfaces that allow you to use the two methods

You could use something like Flow Director to move interesting flows 
directly to the application, let the application poll those queues (and 
maybe tell the kernel if something interesting happens), while the 
kernel manages the other queues.  It gets ugly very quickly.

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2016-12-07 10:39 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-24 15:12 [Qemu-devel] Linux kernel polling for QEMU Stefan Hajnoczi
2016-11-28  9:31 ` Eliezer Tamir
2016-11-28 15:29   ` Stefan Hajnoczi
2016-11-28 15:41     ` Paolo Bonzini
2016-11-29 10:45       ` Stefan Hajnoczi
2016-11-30 17:41         ` Avi Kivity
2016-12-01 11:45           ` Stefan Hajnoczi
2016-12-01 11:59             ` Avi Kivity
2016-12-01 14:35               ` Paolo Bonzini
2016-12-02 10:12                 ` Stefan Hajnoczi
2016-12-07 10:38                   ` Avi Kivity
2016-12-07 10:32                 ` Avi Kivity
2016-11-28 20:41   ` Willem de Bruijn
2016-11-29  8:19 ` Christian Borntraeger
2016-11-29 11:00   ` Stefan Hajnoczi
2016-11-29 11:58     ` Christian Borntraeger
2016-11-29 10:32 ` Fam Zheng
2016-11-29 11:17   ` Paolo Bonzini
2016-11-29 13:24     ` Fam Zheng
2016-11-29 13:27       ` Paolo Bonzini
2016-11-29 14:17         ` Fam Zheng
2016-11-29 15:24           ` Andrew Jones
2016-11-29 15:39             ` Fam Zheng
2016-11-29 16:01               ` Andrew Jones
2016-11-29 16:13                 ` Paolo Bonzini
2016-11-29 19:38                   ` Andrew Jones
2016-11-30  7:19                     ` Peter Maydell
2016-11-30  9:05                       ` Andrew Jones
2016-11-30  9:46                         ` Peter Maydell
2016-11-30 14:18                           ` Paolo Bonzini
2016-12-05 11:20                             ` Alex Bennée
2016-11-29 15:45             ` Paolo Bonzini
2016-11-29 20:43       ` Stefan Hajnoczi
2016-11-30  5:42         ` Fam Zheng
2016-11-30  9:38           ` Stefan Hajnoczi
2016-11-30 10:50             ` Fam Zheng
2016-11-30 15:10               ` Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.