On Fri, Apr 05, 2019 at 06:29:49PM +0200, Sergio Lopez wrote: > > Stefan Hajnoczi writes: > > > Hi Sergio, > > Here are the forgotten event loop optimizations I mentioned: > > > > https://github.com/stefanha/qemu/commits/event-loop-optimizations > > > > The goal was to eliminate or reorder syscalls so that useful work (like > > executing BHs) occurs as soon as possible after an event is detected. > > > > I remember that these optimizations only shave off a handful of > > microseconds, so they aren't a huge win. They do become attractive on > > fast SSDs with <10us read/write latency. > > > > These optimizations are aggressive and there is a possibility of > > introducing regressions. > > > > If you have time to pick up this work, try benchmarking each commit > > individually so performance changes are attributed individually. > > There's no need to send them together in a single patch series, the > > changes are quite independent. > > It took me a while to find a way to get meaningful numbers to evaluate > those optimizations. The problem is that here (Xeon E5-2640 v3 and EPYC > 7351P) the cost of event_notifier_set() is just ~0.4us when the code > path is hot, and it's hard differentiating it from the noise. > > To do so, I've used a patched kernel with a naive io_poll implementation > for virtio_blk [1], an also patched QEMU with poll-inflight [2] (just to > be sure we're polling) and ran the test on semi-isolated cores > (nohz_full + rcu_nocbs + systemd_isolation) with idle siblings. The > storage is simulated by null_blk with "completion_nsec=0 no_sched=1 > irqmode=0". > > # fio --time_based --runtime=30 --rw=randread --name=randread \ > --filename=/dev/vdb --direct=1 --ioengine=pvsync2 --iodepth=1 --hipri=1 > > | avg_lat (us) | master | qbsn* | > | run1 | 11.32 | 10.96 | > | run2 | 11.37 | 10.79 | > | run3 | 11.42 | 10.67 | > | run4 | 11.32 | 11.06 | > | run5 | 11.42 | 11.19 | > | run6 | 11.42 | 10.91 | > * patched with aio: add optimized qemu_bh_schedule_nested() API > > Even though there's still some variance in the numbers, the 0.4us > improvement can be clearly appreciated. > > I haven't tested the other 3 patches, as their optimizations only have > effect when the event loop is not running in polling mode. Without > polling, we get an additional overhead of, at least, 10us, in addition > to a lot of noise, due to both direct costs (ppoll()...) and indirect > ones (re-scheduling and TLB/cache pollution), so I don't think we can > reliable benchmark them. Probably their impact won't be significant > either, due to the costs I've just mentioned. Thanks for benchmarking them. We can leave them for now, since there is a risk of introducing bugs and they don't make a great difference. Stefan > Sergio. > > [1] https://github.com/slp/linux/commit/d369b37db3e298933e8bb88c6eeacff07f39bc13 > [2] https://lists.nongnu.org/archive/html/qemu-devel/2019-04/msg00447.html