* Re: [Qemu-devel] QEMU event loop optimizations
[not found] ` <55751c00-0854-ea4d-75b5-ab82b4eeb70d@redhat.com>
@ 2019-04-02 16:18 ` Kevin Wolf
2019-04-02 16:25 ` Paolo Bonzini
2019-04-05 16:33 ` Sergio Lopez
1 sibling, 1 reply; 10+ messages in thread
From: Kevin Wolf @ 2019-04-02 16:18 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: Stefan Hajnoczi, Sergio Lopez, qemu-devel
Am 26.03.2019 um 15:11 hat Paolo Bonzini geschrieben:
> - but actually (and a precursor to using IOCB_CMD_POLL) it should be
> possible to have just one LinuxAioState per AioContext, and then
> it can simply share the AioContext's EventNotifier. This removes
> the need to do the event_notifier_test_and_clear in linux-aio.c.
Isn't having only one LinuxAioState per AioContext what we already do?
See aio_get_linux_aio().
Kevin
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] QEMU event loop optimizations
2019-04-02 16:18 ` Kevin Wolf
@ 2019-04-02 16:25 ` Paolo Bonzini
0 siblings, 0 replies; 10+ messages in thread
From: Paolo Bonzini @ 2019-04-02 16:25 UTC (permalink / raw)
To: Kevin Wolf; +Cc: Stefan Hajnoczi, Sergio Lopez, qemu-devel
On 02/04/19 18:18, Kevin Wolf wrote:
> Am 26.03.2019 um 15:11 hat Paolo Bonzini geschrieben:
>> - but actually (and a precursor to using IOCB_CMD_POLL) it should be
>> possible to have just one LinuxAioState per AioContext, and then
>> it can simply share the AioContext's EventNotifier. This removes
>> the need to do the event_notifier_test_and_clear in linux-aio.c.
>
> Isn't having only one LinuxAioState per AioContext what we already do?
> See aio_get_linux_aio().
And I should have known that:
commit 0187f5c9cb172771ba85c66e3bf61f8cde6d6561
Author: Paolo Bonzini <pbonzini@redhat.com>
Date: Mon Jul 4 18:33:20 2016 +0200
linux-aio: share one LinuxAioState within an AioContext
This has better performance because it executes fewer system calls
and does not use a bottom half per disk.
Originally proposed by Ming Lei.
The second point, which is to share the AioContext's EventNotifier,
still stands.
Paolo
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] QEMU event loop optimizations
@ 2019-04-05 16:29 ` Sergio Lopez
0 siblings, 0 replies; 10+ messages in thread
From: Sergio Lopez @ 2019-04-05 16:29 UTC (permalink / raw)
To: Stefan Hajnoczi; +Cc: Sergio Lopez, qemu-devel, Paolo Bonzini
[-- Attachment #1: Type: text/plain, Size: 2656 bytes --]
Stefan Hajnoczi writes:
> Hi Sergio,
> Here are the forgotten event loop optimizations I mentioned:
>
> https://github.com/stefanha/qemu/commits/event-loop-optimizations
>
> The goal was to eliminate or reorder syscalls so that useful work (like
> executing BHs) occurs as soon as possible after an event is detected.
>
> I remember that these optimizations only shave off a handful of
> microseconds, so they aren't a huge win. They do become attractive on
> fast SSDs with <10us read/write latency.
>
> These optimizations are aggressive and there is a possibility of
> introducing regressions.
>
> If you have time to pick up this work, try benchmarking each commit
> individually so performance changes are attributed individually.
> There's no need to send them together in a single patch series, the
> changes are quite independent.
It took me a while to find a way to get meaningful numbers to evaluate
those optimizations. The problem is that here (Xeon E5-2640 v3 and EPYC
7351P) the cost of event_notifier_set() is just ~0.4us when the code
path is hot, and it's hard differentiating it from the noise.
To do so, I've used a patched kernel with a naive io_poll implementation
for virtio_blk [1], an also patched QEMU with poll-inflight [2] (just to
be sure we're polling) and ran the test on semi-isolated cores
(nohz_full + rcu_nocbs + systemd_isolation) with idle siblings. The
storage is simulated by null_blk with "completion_nsec=0 no_sched=1
irqmode=0".
# fio --time_based --runtime=30 --rw=randread --name=randread \
--filename=/dev/vdb --direct=1 --ioengine=pvsync2 --iodepth=1 --hipri=1
| avg_lat (us) | master | qbsn* |
| run1 | 11.32 | 10.96 |
| run2 | 11.37 | 10.79 |
| run3 | 11.42 | 10.67 |
| run4 | 11.32 | 11.06 |
| run5 | 11.42 | 11.19 |
| run6 | 11.42 | 10.91 |
* patched with aio: add optimized qemu_bh_schedule_nested() API
Even though there's still some variance in the numbers, the 0.4us
improvement can be clearly appreciated.
I haven't tested the other 3 patches, as their optimizations only have
effect when the event loop is not running in polling mode. Without
polling, we get an additional overhead of, at least, 10us, in addition
to a lot of noise, due to both direct costs (ppoll()...) and indirect
ones (re-scheduling and TLB/cache pollution), so I don't think we can
reliable benchmark them. Probably their impact won't be significant
either, due to the costs I've just mentioned.
Sergio.
[1] https://github.com/slp/linux/commit/d369b37db3e298933e8bb88c6eeacff07f39bc13
[2] https://lists.nongnu.org/archive/html/qemu-devel/2019-04/msg00447.html
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] QEMU event loop optimizations
@ 2019-04-05 16:29 ` Sergio Lopez
0 siblings, 0 replies; 10+ messages in thread
From: Sergio Lopez @ 2019-04-05 16:29 UTC (permalink / raw)
To: Stefan Hajnoczi; +Cc: Paolo Bonzini, qemu-devel, Sergio Lopez
[-- Attachment #1: Type: text/plain, Size: 2656 bytes --]
Stefan Hajnoczi writes:
> Hi Sergio,
> Here are the forgotten event loop optimizations I mentioned:
>
> https://github.com/stefanha/qemu/commits/event-loop-optimizations
>
> The goal was to eliminate or reorder syscalls so that useful work (like
> executing BHs) occurs as soon as possible after an event is detected.
>
> I remember that these optimizations only shave off a handful of
> microseconds, so they aren't a huge win. They do become attractive on
> fast SSDs with <10us read/write latency.
>
> These optimizations are aggressive and there is a possibility of
> introducing regressions.
>
> If you have time to pick up this work, try benchmarking each commit
> individually so performance changes are attributed individually.
> There's no need to send them together in a single patch series, the
> changes are quite independent.
It took me a while to find a way to get meaningful numbers to evaluate
those optimizations. The problem is that here (Xeon E5-2640 v3 and EPYC
7351P) the cost of event_notifier_set() is just ~0.4us when the code
path is hot, and it's hard differentiating it from the noise.
To do so, I've used a patched kernel with a naive io_poll implementation
for virtio_blk [1], an also patched QEMU with poll-inflight [2] (just to
be sure we're polling) and ran the test on semi-isolated cores
(nohz_full + rcu_nocbs + systemd_isolation) with idle siblings. The
storage is simulated by null_blk with "completion_nsec=0 no_sched=1
irqmode=0".
# fio --time_based --runtime=30 --rw=randread --name=randread \
--filename=/dev/vdb --direct=1 --ioengine=pvsync2 --iodepth=1 --hipri=1
| avg_lat (us) | master | qbsn* |
| run1 | 11.32 | 10.96 |
| run2 | 11.37 | 10.79 |
| run3 | 11.42 | 10.67 |
| run4 | 11.32 | 11.06 |
| run5 | 11.42 | 11.19 |
| run6 | 11.42 | 10.91 |
* patched with aio: add optimized qemu_bh_schedule_nested() API
Even though there's still some variance in the numbers, the 0.4us
improvement can be clearly appreciated.
I haven't tested the other 3 patches, as their optimizations only have
effect when the event loop is not running in polling mode. Without
polling, we get an additional overhead of, at least, 10us, in addition
to a lot of noise, due to both direct costs (ppoll()...) and indirect
ones (re-scheduling and TLB/cache pollution), so I don't think we can
reliable benchmark them. Probably their impact won't be significant
either, due to the costs I've just mentioned.
Sergio.
[1] https://github.com/slp/linux/commit/d369b37db3e298933e8bb88c6eeacff07f39bc13
[2] https://lists.nongnu.org/archive/html/qemu-devel/2019-04/msg00447.html
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] QEMU event loop optimizations
@ 2019-04-05 16:33 ` Sergio Lopez
0 siblings, 0 replies; 10+ messages in thread
From: Sergio Lopez @ 2019-04-05 16:33 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: Stefan Hajnoczi, qemu-devel
[-- Attachment #1: Type: text/plain, Size: 1427 bytes --]
Paolo Bonzini writes:
> On 26/03/19 14:18, Stefan Hajnoczi wrote:
>> Hi Sergio,
>> Here are the forgotten event loop optimizations I mentioned:
>>
>> https://github.com/stefanha/qemu/commits/event-loop-optimizations
>>
>> The goal was to eliminate or reorder syscalls so that useful work (like
>> executing BHs) occurs as soon as possible after an event is detected.
>>
>> I remember that these optimizations only shave off a handful of
>> microseconds, so they aren't a huge win. They do become attractive on
>> fast SSDs with <10us read/write latency.
>>
>> These optimizations are aggressive and there is a possibility of
>> introducing regressions.
>>
>> If you have time to pick up this work, try benchmarking each commit
>> individually so performance changes are attributed individually.
>> There's no need to send them together in a single patch series, the
>> changes are quite independent.
>
> I reviewed the patches now:
>
> - qemu_bh_schedule_nested should not be necessary since we have
> ctx->notify_me to also avoid the event_notifier_set call. However, it
> is possible to avoid the smp_mb at the beginning of aio_notify, since
> atomic_xchg already implies it. Maybe add a "static void
> aio_notify__after_smp_mb"?
try_poll_mode() is called with ctx->notify_me != 0, so we get at least
one event_notifier_set() call while working in polling mode.
Sergio.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] QEMU event loop optimizations
@ 2019-04-05 16:33 ` Sergio Lopez
0 siblings, 0 replies; 10+ messages in thread
From: Sergio Lopez @ 2019-04-05 16:33 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: qemu-devel, Stefan Hajnoczi
[-- Attachment #1: Type: text/plain, Size: 1427 bytes --]
Paolo Bonzini writes:
> On 26/03/19 14:18, Stefan Hajnoczi wrote:
>> Hi Sergio,
>> Here are the forgotten event loop optimizations I mentioned:
>>
>> https://github.com/stefanha/qemu/commits/event-loop-optimizations
>>
>> The goal was to eliminate or reorder syscalls so that useful work (like
>> executing BHs) occurs as soon as possible after an event is detected.
>>
>> I remember that these optimizations only shave off a handful of
>> microseconds, so they aren't a huge win. They do become attractive on
>> fast SSDs with <10us read/write latency.
>>
>> These optimizations are aggressive and there is a possibility of
>> introducing regressions.
>>
>> If you have time to pick up this work, try benchmarking each commit
>> individually so performance changes are attributed individually.
>> There's no need to send them together in a single patch series, the
>> changes are quite independent.
>
> I reviewed the patches now:
>
> - qemu_bh_schedule_nested should not be necessary since we have
> ctx->notify_me to also avoid the event_notifier_set call. However, it
> is possible to avoid the smp_mb at the beginning of aio_notify, since
> atomic_xchg already implies it. Maybe add a "static void
> aio_notify__after_smp_mb"?
try_poll_mode() is called with ctx->notify_me != 0, so we get at least
one event_notifier_set() call while working in polling mode.
Sergio.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] QEMU event loop optimizations
@ 2019-04-08 8:29 ` Stefan Hajnoczi
0 siblings, 0 replies; 10+ messages in thread
From: Stefan Hajnoczi @ 2019-04-08 8:29 UTC (permalink / raw)
To: Sergio Lopez; +Cc: qemu-devel, Paolo Bonzini
[-- Attachment #1: Type: text/plain, Size: 3061 bytes --]
On Fri, Apr 05, 2019 at 06:29:49PM +0200, Sergio Lopez wrote:
>
> Stefan Hajnoczi writes:
>
> > Hi Sergio,
> > Here are the forgotten event loop optimizations I mentioned:
> >
> > https://github.com/stefanha/qemu/commits/event-loop-optimizations
> >
> > The goal was to eliminate or reorder syscalls so that useful work (like
> > executing BHs) occurs as soon as possible after an event is detected.
> >
> > I remember that these optimizations only shave off a handful of
> > microseconds, so they aren't a huge win. They do become attractive on
> > fast SSDs with <10us read/write latency.
> >
> > These optimizations are aggressive and there is a possibility of
> > introducing regressions.
> >
> > If you have time to pick up this work, try benchmarking each commit
> > individually so performance changes are attributed individually.
> > There's no need to send them together in a single patch series, the
> > changes are quite independent.
>
> It took me a while to find a way to get meaningful numbers to evaluate
> those optimizations. The problem is that here (Xeon E5-2640 v3 and EPYC
> 7351P) the cost of event_notifier_set() is just ~0.4us when the code
> path is hot, and it's hard differentiating it from the noise.
>
> To do so, I've used a patched kernel with a naive io_poll implementation
> for virtio_blk [1], an also patched QEMU with poll-inflight [2] (just to
> be sure we're polling) and ran the test on semi-isolated cores
> (nohz_full + rcu_nocbs + systemd_isolation) with idle siblings. The
> storage is simulated by null_blk with "completion_nsec=0 no_sched=1
> irqmode=0".
>
> # fio --time_based --runtime=30 --rw=randread --name=randread \
> --filename=/dev/vdb --direct=1 --ioengine=pvsync2 --iodepth=1 --hipri=1
>
> | avg_lat (us) | master | qbsn* |
> | run1 | 11.32 | 10.96 |
> | run2 | 11.37 | 10.79 |
> | run3 | 11.42 | 10.67 |
> | run4 | 11.32 | 11.06 |
> | run5 | 11.42 | 11.19 |
> | run6 | 11.42 | 10.91 |
> * patched with aio: add optimized qemu_bh_schedule_nested() API
>
> Even though there's still some variance in the numbers, the 0.4us
> improvement can be clearly appreciated.
>
> I haven't tested the other 3 patches, as their optimizations only have
> effect when the event loop is not running in polling mode. Without
> polling, we get an additional overhead of, at least, 10us, in addition
> to a lot of noise, due to both direct costs (ppoll()...) and indirect
> ones (re-scheduling and TLB/cache pollution), so I don't think we can
> reliable benchmark them. Probably their impact won't be significant
> either, due to the costs I've just mentioned.
Thanks for benchmarking them. We can leave them for now, since there is
a risk of introducing bugs and they don't make a great difference.
Stefan
> Sergio.
>
> [1] https://github.com/slp/linux/commit/d369b37db3e298933e8bb88c6eeacff07f39bc13
> [2] https://lists.nongnu.org/archive/html/qemu-devel/2019-04/msg00447.html
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] QEMU event loop optimizations
@ 2019-04-08 8:29 ` Stefan Hajnoczi
0 siblings, 0 replies; 10+ messages in thread
From: Stefan Hajnoczi @ 2019-04-08 8:29 UTC (permalink / raw)
To: Sergio Lopez; +Cc: Paolo Bonzini, qemu-devel
[-- Attachment #1: Type: text/plain, Size: 3061 bytes --]
On Fri, Apr 05, 2019 at 06:29:49PM +0200, Sergio Lopez wrote:
>
> Stefan Hajnoczi writes:
>
> > Hi Sergio,
> > Here are the forgotten event loop optimizations I mentioned:
> >
> > https://github.com/stefanha/qemu/commits/event-loop-optimizations
> >
> > The goal was to eliminate or reorder syscalls so that useful work (like
> > executing BHs) occurs as soon as possible after an event is detected.
> >
> > I remember that these optimizations only shave off a handful of
> > microseconds, so they aren't a huge win. They do become attractive on
> > fast SSDs with <10us read/write latency.
> >
> > These optimizations are aggressive and there is a possibility of
> > introducing regressions.
> >
> > If you have time to pick up this work, try benchmarking each commit
> > individually so performance changes are attributed individually.
> > There's no need to send them together in a single patch series, the
> > changes are quite independent.
>
> It took me a while to find a way to get meaningful numbers to evaluate
> those optimizations. The problem is that here (Xeon E5-2640 v3 and EPYC
> 7351P) the cost of event_notifier_set() is just ~0.4us when the code
> path is hot, and it's hard differentiating it from the noise.
>
> To do so, I've used a patched kernel with a naive io_poll implementation
> for virtio_blk [1], an also patched QEMU with poll-inflight [2] (just to
> be sure we're polling) and ran the test on semi-isolated cores
> (nohz_full + rcu_nocbs + systemd_isolation) with idle siblings. The
> storage is simulated by null_blk with "completion_nsec=0 no_sched=1
> irqmode=0".
>
> # fio --time_based --runtime=30 --rw=randread --name=randread \
> --filename=/dev/vdb --direct=1 --ioengine=pvsync2 --iodepth=1 --hipri=1
>
> | avg_lat (us) | master | qbsn* |
> | run1 | 11.32 | 10.96 |
> | run2 | 11.37 | 10.79 |
> | run3 | 11.42 | 10.67 |
> | run4 | 11.32 | 11.06 |
> | run5 | 11.42 | 11.19 |
> | run6 | 11.42 | 10.91 |
> * patched with aio: add optimized qemu_bh_schedule_nested() API
>
> Even though there's still some variance in the numbers, the 0.4us
> improvement can be clearly appreciated.
>
> I haven't tested the other 3 patches, as their optimizations only have
> effect when the event loop is not running in polling mode. Without
> polling, we get an additional overhead of, at least, 10us, in addition
> to a lot of noise, due to both direct costs (ppoll()...) and indirect
> ones (re-scheduling and TLB/cache pollution), so I don't think we can
> reliable benchmark them. Probably their impact won't be significant
> either, due to the costs I've just mentioned.
Thanks for benchmarking them. We can leave them for now, since there is
a risk of introducing bugs and they don't make a great difference.
Stefan
> Sergio.
>
> [1] https://github.com/slp/linux/commit/d369b37db3e298933e8bb88c6eeacff07f39bc13
> [2] https://lists.nongnu.org/archive/html/qemu-devel/2019-04/msg00447.html
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] QEMU event loop optimizations
@ 2019-04-08 10:42 ` Paolo Bonzini
0 siblings, 0 replies; 10+ messages in thread
From: Paolo Bonzini @ 2019-04-08 10:42 UTC (permalink / raw)
To: Sergio Lopez; +Cc: Stefan Hajnoczi, qemu-devel
On 05/04/19 18:33, Sergio Lopez wrote:
>> - qemu_bh_schedule_nested should not be necessary since we have
>> ctx->notify_me to also avoid the event_notifier_set call. However, it
>> is possible to avoid the smp_mb at the beginning of aio_notify, since
>> atomic_xchg already implies it. Maybe add a "static void
>> aio_notify__after_smp_mb"?
> try_poll_mode() is called with ctx->notify_me != 0, so we get at least
> one event_notifier_set() call while working in polling mode.
Right, though then there is the idea of making ctx->notify_me and
ctx->notified bitmasks, where bit 0 is "BH ready" and bit 1 is "poll()
parameters changed". try_poll_mode() cares about the latter, but does
not need event_notifier_set() for BH ready.
Paolo
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] QEMU event loop optimizations
@ 2019-04-08 10:42 ` Paolo Bonzini
0 siblings, 0 replies; 10+ messages in thread
From: Paolo Bonzini @ 2019-04-08 10:42 UTC (permalink / raw)
To: Sergio Lopez; +Cc: qemu-devel, Stefan Hajnoczi
On 05/04/19 18:33, Sergio Lopez wrote:
>> - qemu_bh_schedule_nested should not be necessary since we have
>> ctx->notify_me to also avoid the event_notifier_set call. However, it
>> is possible to avoid the smp_mb at the beginning of aio_notify, since
>> atomic_xchg already implies it. Maybe add a "static void
>> aio_notify__after_smp_mb"?
> try_poll_mode() is called with ctx->notify_me != 0, so we get at least
> one event_notifier_set() call while working in polling mode.
Right, though then there is the idea of making ctx->notify_me and
ctx->notified bitmasks, where bit 0 is "BH ready" and bit 1 is "poll()
parameters changed". try_poll_mode() cares about the latter, but does
not need event_notifier_set() for BH ready.
Paolo
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2019-04-08 10:43 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <20190326131822.GD15011@stefanha-x1.localdomain>
2019-04-05 16:29 ` [Qemu-devel] QEMU event loop optimizations Sergio Lopez
2019-04-05 16:29 ` Sergio Lopez
2019-04-08 8:29 ` Stefan Hajnoczi
2019-04-08 8:29 ` Stefan Hajnoczi
[not found] ` <55751c00-0854-ea4d-75b5-ab82b4eeb70d@redhat.com>
2019-04-02 16:18 ` Kevin Wolf
2019-04-02 16:25 ` Paolo Bonzini
2019-04-05 16:33 ` Sergio Lopez
2019-04-05 16:33 ` Sergio Lopez
2019-04-08 10:42 ` Paolo Bonzini
2019-04-08 10:42 ` Paolo Bonzini
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.