* Re: [Qemu-devel] QEMU event loop optimizations @ 2019-04-05 16:29 ` Sergio Lopez 0 siblings, 0 replies; 10+ messages in thread From: Sergio Lopez @ 2019-04-05 16:29 UTC (permalink / raw) To: Stefan Hajnoczi; +Cc: Sergio Lopez, qemu-devel, Paolo Bonzini [-- Attachment #1: Type: text/plain, Size: 2656 bytes --] Stefan Hajnoczi writes: > Hi Sergio, > Here are the forgotten event loop optimizations I mentioned: > > https://github.com/stefanha/qemu/commits/event-loop-optimizations > > The goal was to eliminate or reorder syscalls so that useful work (like > executing BHs) occurs as soon as possible after an event is detected. > > I remember that these optimizations only shave off a handful of > microseconds, so they aren't a huge win. They do become attractive on > fast SSDs with <10us read/write latency. > > These optimizations are aggressive and there is a possibility of > introducing regressions. > > If you have time to pick up this work, try benchmarking each commit > individually so performance changes are attributed individually. > There's no need to send them together in a single patch series, the > changes are quite independent. It took me a while to find a way to get meaningful numbers to evaluate those optimizations. The problem is that here (Xeon E5-2640 v3 and EPYC 7351P) the cost of event_notifier_set() is just ~0.4us when the code path is hot, and it's hard differentiating it from the noise. To do so, I've used a patched kernel with a naive io_poll implementation for virtio_blk [1], an also patched QEMU with poll-inflight [2] (just to be sure we're polling) and ran the test on semi-isolated cores (nohz_full + rcu_nocbs + systemd_isolation) with idle siblings. The storage is simulated by null_blk with "completion_nsec=0 no_sched=1 irqmode=0". # fio --time_based --runtime=30 --rw=randread --name=randread \ --filename=/dev/vdb --direct=1 --ioengine=pvsync2 --iodepth=1 --hipri=1 | avg_lat (us) | master | qbsn* | | run1 | 11.32 | 10.96 | | run2 | 11.37 | 10.79 | | run3 | 11.42 | 10.67 | | run4 | 11.32 | 11.06 | | run5 | 11.42 | 11.19 | | run6 | 11.42 | 10.91 | * patched with aio: add optimized qemu_bh_schedule_nested() API Even though there's still some variance in the numbers, the 0.4us improvement can be clearly appreciated. I haven't tested the other 3 patches, as their optimizations only have effect when the event loop is not running in polling mode. Without polling, we get an additional overhead of, at least, 10us, in addition to a lot of noise, due to both direct costs (ppoll()...) and indirect ones (re-scheduling and TLB/cache pollution), so I don't think we can reliable benchmark them. Probably their impact won't be significant either, due to the costs I've just mentioned. Sergio. [1] https://github.com/slp/linux/commit/d369b37db3e298933e8bb88c6eeacff07f39bc13 [2] https://lists.nongnu.org/archive/html/qemu-devel/2019-04/msg00447.html [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] QEMU event loop optimizations @ 2019-04-05 16:29 ` Sergio Lopez 0 siblings, 0 replies; 10+ messages in thread From: Sergio Lopez @ 2019-04-05 16:29 UTC (permalink / raw) To: Stefan Hajnoczi; +Cc: Paolo Bonzini, qemu-devel, Sergio Lopez [-- Attachment #1: Type: text/plain, Size: 2656 bytes --] Stefan Hajnoczi writes: > Hi Sergio, > Here are the forgotten event loop optimizations I mentioned: > > https://github.com/stefanha/qemu/commits/event-loop-optimizations > > The goal was to eliminate or reorder syscalls so that useful work (like > executing BHs) occurs as soon as possible after an event is detected. > > I remember that these optimizations only shave off a handful of > microseconds, so they aren't a huge win. They do become attractive on > fast SSDs with <10us read/write latency. > > These optimizations are aggressive and there is a possibility of > introducing regressions. > > If you have time to pick up this work, try benchmarking each commit > individually so performance changes are attributed individually. > There's no need to send them together in a single patch series, the > changes are quite independent. It took me a while to find a way to get meaningful numbers to evaluate those optimizations. The problem is that here (Xeon E5-2640 v3 and EPYC 7351P) the cost of event_notifier_set() is just ~0.4us when the code path is hot, and it's hard differentiating it from the noise. To do so, I've used a patched kernel with a naive io_poll implementation for virtio_blk [1], an also patched QEMU with poll-inflight [2] (just to be sure we're polling) and ran the test on semi-isolated cores (nohz_full + rcu_nocbs + systemd_isolation) with idle siblings. The storage is simulated by null_blk with "completion_nsec=0 no_sched=1 irqmode=0". # fio --time_based --runtime=30 --rw=randread --name=randread \ --filename=/dev/vdb --direct=1 --ioengine=pvsync2 --iodepth=1 --hipri=1 | avg_lat (us) | master | qbsn* | | run1 | 11.32 | 10.96 | | run2 | 11.37 | 10.79 | | run3 | 11.42 | 10.67 | | run4 | 11.32 | 11.06 | | run5 | 11.42 | 11.19 | | run6 | 11.42 | 10.91 | * patched with aio: add optimized qemu_bh_schedule_nested() API Even though there's still some variance in the numbers, the 0.4us improvement can be clearly appreciated. I haven't tested the other 3 patches, as their optimizations only have effect when the event loop is not running in polling mode. Without polling, we get an additional overhead of, at least, 10us, in addition to a lot of noise, due to both direct costs (ppoll()...) and indirect ones (re-scheduling and TLB/cache pollution), so I don't think we can reliable benchmark them. Probably their impact won't be significant either, due to the costs I've just mentioned. Sergio. [1] https://github.com/slp/linux/commit/d369b37db3e298933e8bb88c6eeacff07f39bc13 [2] https://lists.nongnu.org/archive/html/qemu-devel/2019-04/msg00447.html [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] QEMU event loop optimizations @ 2019-04-08 8:29 ` Stefan Hajnoczi 0 siblings, 0 replies; 10+ messages in thread From: Stefan Hajnoczi @ 2019-04-08 8:29 UTC (permalink / raw) To: Sergio Lopez; +Cc: qemu-devel, Paolo Bonzini [-- Attachment #1: Type: text/plain, Size: 3061 bytes --] On Fri, Apr 05, 2019 at 06:29:49PM +0200, Sergio Lopez wrote: > > Stefan Hajnoczi writes: > > > Hi Sergio, > > Here are the forgotten event loop optimizations I mentioned: > > > > https://github.com/stefanha/qemu/commits/event-loop-optimizations > > > > The goal was to eliminate or reorder syscalls so that useful work (like > > executing BHs) occurs as soon as possible after an event is detected. > > > > I remember that these optimizations only shave off a handful of > > microseconds, so they aren't a huge win. They do become attractive on > > fast SSDs with <10us read/write latency. > > > > These optimizations are aggressive and there is a possibility of > > introducing regressions. > > > > If you have time to pick up this work, try benchmarking each commit > > individually so performance changes are attributed individually. > > There's no need to send them together in a single patch series, the > > changes are quite independent. > > It took me a while to find a way to get meaningful numbers to evaluate > those optimizations. The problem is that here (Xeon E5-2640 v3 and EPYC > 7351P) the cost of event_notifier_set() is just ~0.4us when the code > path is hot, and it's hard differentiating it from the noise. > > To do so, I've used a patched kernel with a naive io_poll implementation > for virtio_blk [1], an also patched QEMU with poll-inflight [2] (just to > be sure we're polling) and ran the test on semi-isolated cores > (nohz_full + rcu_nocbs + systemd_isolation) with idle siblings. The > storage is simulated by null_blk with "completion_nsec=0 no_sched=1 > irqmode=0". > > # fio --time_based --runtime=30 --rw=randread --name=randread \ > --filename=/dev/vdb --direct=1 --ioengine=pvsync2 --iodepth=1 --hipri=1 > > | avg_lat (us) | master | qbsn* | > | run1 | 11.32 | 10.96 | > | run2 | 11.37 | 10.79 | > | run3 | 11.42 | 10.67 | > | run4 | 11.32 | 11.06 | > | run5 | 11.42 | 11.19 | > | run6 | 11.42 | 10.91 | > * patched with aio: add optimized qemu_bh_schedule_nested() API > > Even though there's still some variance in the numbers, the 0.4us > improvement can be clearly appreciated. > > I haven't tested the other 3 patches, as their optimizations only have > effect when the event loop is not running in polling mode. Without > polling, we get an additional overhead of, at least, 10us, in addition > to a lot of noise, due to both direct costs (ppoll()...) and indirect > ones (re-scheduling and TLB/cache pollution), so I don't think we can > reliable benchmark them. Probably their impact won't be significant > either, due to the costs I've just mentioned. Thanks for benchmarking them. We can leave them for now, since there is a risk of introducing bugs and they don't make a great difference. Stefan > Sergio. > > [1] https://github.com/slp/linux/commit/d369b37db3e298933e8bb88c6eeacff07f39bc13 > [2] https://lists.nongnu.org/archive/html/qemu-devel/2019-04/msg00447.html [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] QEMU event loop optimizations @ 2019-04-08 8:29 ` Stefan Hajnoczi 0 siblings, 0 replies; 10+ messages in thread From: Stefan Hajnoczi @ 2019-04-08 8:29 UTC (permalink / raw) To: Sergio Lopez; +Cc: Paolo Bonzini, qemu-devel [-- Attachment #1: Type: text/plain, Size: 3061 bytes --] On Fri, Apr 05, 2019 at 06:29:49PM +0200, Sergio Lopez wrote: > > Stefan Hajnoczi writes: > > > Hi Sergio, > > Here are the forgotten event loop optimizations I mentioned: > > > > https://github.com/stefanha/qemu/commits/event-loop-optimizations > > > > The goal was to eliminate or reorder syscalls so that useful work (like > > executing BHs) occurs as soon as possible after an event is detected. > > > > I remember that these optimizations only shave off a handful of > > microseconds, so they aren't a huge win. They do become attractive on > > fast SSDs with <10us read/write latency. > > > > These optimizations are aggressive and there is a possibility of > > introducing regressions. > > > > If you have time to pick up this work, try benchmarking each commit > > individually so performance changes are attributed individually. > > There's no need to send them together in a single patch series, the > > changes are quite independent. > > It took me a while to find a way to get meaningful numbers to evaluate > those optimizations. The problem is that here (Xeon E5-2640 v3 and EPYC > 7351P) the cost of event_notifier_set() is just ~0.4us when the code > path is hot, and it's hard differentiating it from the noise. > > To do so, I've used a patched kernel with a naive io_poll implementation > for virtio_blk [1], an also patched QEMU with poll-inflight [2] (just to > be sure we're polling) and ran the test on semi-isolated cores > (nohz_full + rcu_nocbs + systemd_isolation) with idle siblings. The > storage is simulated by null_blk with "completion_nsec=0 no_sched=1 > irqmode=0". > > # fio --time_based --runtime=30 --rw=randread --name=randread \ > --filename=/dev/vdb --direct=1 --ioengine=pvsync2 --iodepth=1 --hipri=1 > > | avg_lat (us) | master | qbsn* | > | run1 | 11.32 | 10.96 | > | run2 | 11.37 | 10.79 | > | run3 | 11.42 | 10.67 | > | run4 | 11.32 | 11.06 | > | run5 | 11.42 | 11.19 | > | run6 | 11.42 | 10.91 | > * patched with aio: add optimized qemu_bh_schedule_nested() API > > Even though there's still some variance in the numbers, the 0.4us > improvement can be clearly appreciated. > > I haven't tested the other 3 patches, as their optimizations only have > effect when the event loop is not running in polling mode. Without > polling, we get an additional overhead of, at least, 10us, in addition > to a lot of noise, due to both direct costs (ppoll()...) and indirect > ones (re-scheduling and TLB/cache pollution), so I don't think we can > reliable benchmark them. Probably their impact won't be significant > either, due to the costs I've just mentioned. Thanks for benchmarking them. We can leave them for now, since there is a risk of introducing bugs and they don't make a great difference. Stefan > Sergio. > > [1] https://github.com/slp/linux/commit/d369b37db3e298933e8bb88c6eeacff07f39bc13 > [2] https://lists.nongnu.org/archive/html/qemu-devel/2019-04/msg00447.html [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <55751c00-0854-ea4d-75b5-ab82b4eeb70d@redhat.com>]
* Re: [Qemu-devel] QEMU event loop optimizations [not found] ` <55751c00-0854-ea4d-75b5-ab82b4eeb70d@redhat.com> @ 2019-04-02 16:18 ` Kevin Wolf 2019-04-02 16:25 ` Paolo Bonzini 2019-04-05 16:33 ` Sergio Lopez 1 sibling, 1 reply; 10+ messages in thread From: Kevin Wolf @ 2019-04-02 16:18 UTC (permalink / raw) To: Paolo Bonzini; +Cc: Stefan Hajnoczi, Sergio Lopez, qemu-devel Am 26.03.2019 um 15:11 hat Paolo Bonzini geschrieben: > - but actually (and a precursor to using IOCB_CMD_POLL) it should be > possible to have just one LinuxAioState per AioContext, and then > it can simply share the AioContext's EventNotifier. This removes > the need to do the event_notifier_test_and_clear in linux-aio.c. Isn't having only one LinuxAioState per AioContext what we already do? See aio_get_linux_aio(). Kevin ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] QEMU event loop optimizations 2019-04-02 16:18 ` Kevin Wolf @ 2019-04-02 16:25 ` Paolo Bonzini 0 siblings, 0 replies; 10+ messages in thread From: Paolo Bonzini @ 2019-04-02 16:25 UTC (permalink / raw) To: Kevin Wolf; +Cc: Stefan Hajnoczi, Sergio Lopez, qemu-devel On 02/04/19 18:18, Kevin Wolf wrote: > Am 26.03.2019 um 15:11 hat Paolo Bonzini geschrieben: >> - but actually (and a precursor to using IOCB_CMD_POLL) it should be >> possible to have just one LinuxAioState per AioContext, and then >> it can simply share the AioContext's EventNotifier. This removes >> the need to do the event_notifier_test_and_clear in linux-aio.c. > > Isn't having only one LinuxAioState per AioContext what we already do? > See aio_get_linux_aio(). And I should have known that: commit 0187f5c9cb172771ba85c66e3bf61f8cde6d6561 Author: Paolo Bonzini <pbonzini@redhat.com> Date: Mon Jul 4 18:33:20 2016 +0200 linux-aio: share one LinuxAioState within an AioContext This has better performance because it executes fewer system calls and does not use a bottom half per disk. Originally proposed by Ming Lei. The second point, which is to share the AioContext's EventNotifier, still stands. Paolo ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] QEMU event loop optimizations @ 2019-04-05 16:33 ` Sergio Lopez 0 siblings, 0 replies; 10+ messages in thread From: Sergio Lopez @ 2019-04-05 16:33 UTC (permalink / raw) To: Paolo Bonzini; +Cc: Stefan Hajnoczi, qemu-devel [-- Attachment #1: Type: text/plain, Size: 1427 bytes --] Paolo Bonzini writes: > On 26/03/19 14:18, Stefan Hajnoczi wrote: >> Hi Sergio, >> Here are the forgotten event loop optimizations I mentioned: >> >> https://github.com/stefanha/qemu/commits/event-loop-optimizations >> >> The goal was to eliminate or reorder syscalls so that useful work (like >> executing BHs) occurs as soon as possible after an event is detected. >> >> I remember that these optimizations only shave off a handful of >> microseconds, so they aren't a huge win. They do become attractive on >> fast SSDs with <10us read/write latency. >> >> These optimizations are aggressive and there is a possibility of >> introducing regressions. >> >> If you have time to pick up this work, try benchmarking each commit >> individually so performance changes are attributed individually. >> There's no need to send them together in a single patch series, the >> changes are quite independent. > > I reviewed the patches now: > > - qemu_bh_schedule_nested should not be necessary since we have > ctx->notify_me to also avoid the event_notifier_set call. However, it > is possible to avoid the smp_mb at the beginning of aio_notify, since > atomic_xchg already implies it. Maybe add a "static void > aio_notify__after_smp_mb"? try_poll_mode() is called with ctx->notify_me != 0, so we get at least one event_notifier_set() call while working in polling mode. Sergio. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] QEMU event loop optimizations @ 2019-04-05 16:33 ` Sergio Lopez 0 siblings, 0 replies; 10+ messages in thread From: Sergio Lopez @ 2019-04-05 16:33 UTC (permalink / raw) To: Paolo Bonzini; +Cc: qemu-devel, Stefan Hajnoczi [-- Attachment #1: Type: text/plain, Size: 1427 bytes --] Paolo Bonzini writes: > On 26/03/19 14:18, Stefan Hajnoczi wrote: >> Hi Sergio, >> Here are the forgotten event loop optimizations I mentioned: >> >> https://github.com/stefanha/qemu/commits/event-loop-optimizations >> >> The goal was to eliminate or reorder syscalls so that useful work (like >> executing BHs) occurs as soon as possible after an event is detected. >> >> I remember that these optimizations only shave off a handful of >> microseconds, so they aren't a huge win. They do become attractive on >> fast SSDs with <10us read/write latency. >> >> These optimizations are aggressive and there is a possibility of >> introducing regressions. >> >> If you have time to pick up this work, try benchmarking each commit >> individually so performance changes are attributed individually. >> There's no need to send them together in a single patch series, the >> changes are quite independent. > > I reviewed the patches now: > > - qemu_bh_schedule_nested should not be necessary since we have > ctx->notify_me to also avoid the event_notifier_set call. However, it > is possible to avoid the smp_mb at the beginning of aio_notify, since > atomic_xchg already implies it. Maybe add a "static void > aio_notify__after_smp_mb"? try_poll_mode() is called with ctx->notify_me != 0, so we get at least one event_notifier_set() call while working in polling mode. Sergio. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] QEMU event loop optimizations @ 2019-04-08 10:42 ` Paolo Bonzini 0 siblings, 0 replies; 10+ messages in thread From: Paolo Bonzini @ 2019-04-08 10:42 UTC (permalink / raw) To: Sergio Lopez; +Cc: Stefan Hajnoczi, qemu-devel On 05/04/19 18:33, Sergio Lopez wrote: >> - qemu_bh_schedule_nested should not be necessary since we have >> ctx->notify_me to also avoid the event_notifier_set call. However, it >> is possible to avoid the smp_mb at the beginning of aio_notify, since >> atomic_xchg already implies it. Maybe add a "static void >> aio_notify__after_smp_mb"? > try_poll_mode() is called with ctx->notify_me != 0, so we get at least > one event_notifier_set() call while working in polling mode. Right, though then there is the idea of making ctx->notify_me and ctx->notified bitmasks, where bit 0 is "BH ready" and bit 1 is "poll() parameters changed". try_poll_mode() cares about the latter, but does not need event_notifier_set() for BH ready. Paolo ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] QEMU event loop optimizations @ 2019-04-08 10:42 ` Paolo Bonzini 0 siblings, 0 replies; 10+ messages in thread From: Paolo Bonzini @ 2019-04-08 10:42 UTC (permalink / raw) To: Sergio Lopez; +Cc: qemu-devel, Stefan Hajnoczi On 05/04/19 18:33, Sergio Lopez wrote: >> - qemu_bh_schedule_nested should not be necessary since we have >> ctx->notify_me to also avoid the event_notifier_set call. However, it >> is possible to avoid the smp_mb at the beginning of aio_notify, since >> atomic_xchg already implies it. Maybe add a "static void >> aio_notify__after_smp_mb"? > try_poll_mode() is called with ctx->notify_me != 0, so we get at least > one event_notifier_set() call while working in polling mode. Right, though then there is the idea of making ctx->notify_me and ctx->notified bitmasks, where bit 0 is "BH ready" and bit 1 is "poll() parameters changed". try_poll_mode() cares about the latter, but does not need event_notifier_set() for BH ready. Paolo ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2019-04-08 10:43 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <20190326131822.GD15011@stefanha-x1.localdomain> 2019-04-05 16:29 ` [Qemu-devel] QEMU event loop optimizations Sergio Lopez 2019-04-05 16:29 ` Sergio Lopez 2019-04-08 8:29 ` Stefan Hajnoczi 2019-04-08 8:29 ` Stefan Hajnoczi [not found] ` <55751c00-0854-ea4d-75b5-ab82b4eeb70d@redhat.com> 2019-04-02 16:18 ` Kevin Wolf 2019-04-02 16:25 ` Paolo Bonzini 2019-04-05 16:33 ` Sergio Lopez 2019-04-05 16:33 ` Sergio Lopez 2019-04-08 10:42 ` Paolo Bonzini 2019-04-08 10:42 ` Paolo Bonzini
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.