On Thu, Sep 09, 2021 at 09:37:20AM +0100, Daniel P. Berrangé wrote: > On Wed, Sep 01, 2021 at 05:09:23PM +0100, Stefan Hajnoczi wrote: > > It was reported that enabling SafeStack reduces IOPS significantly > > (>25%) with the following fio benchmark on virtio-blk using a NVMe host > > block device: > > > > # fio --rw=randrw --bs=4k --iodepth=64 --runtime=1m --direct=1 \ > > --filename=/dev/vdb --name=job1 --ioengine=libaio --thread \ > > --group_reporting --numjobs=16 --time_based \ > > --output=/tmp/fio_result > > > > Serge Guelton and I found that SafeStack is not really at fault, it just > > increases the cost of coroutine creation. This fio workload exhausts the > > coroutine pool and coroutine creation becomes a bottleneck. Previous > > work by Honghao Wang also pointed to excessive coroutine creation. > > > > Creating new coroutines is expensive due to allocating new stacks with > > mmap(2) and mprotect(2). Currently there are thread-local and global > > pools that recycle old Coroutine objects and their stacks but the > > hardcoded size limit of 64 for thread-local pools and 128 for the global > > pool is insufficient for the fio benchmark shown above. > > Rather than keeping around a thread local pool of coroutine > instances, did you ever consider keeping around a pool of > allocated stacks ? Essentially it seems like you're syaing > the stack allocation is the problem due to it using mmap() > instead of malloc() and thus not benefiting from any of the > performance tricks malloc() impls use to avoid repeated > syscalls on every allocation. If 'qemu_alloc_stack' and > qemu_free_stack could be made more intelligent by caching > stacks, then perhaps the coroutine side can be left "dumb" ? What is the advantage of doing that? Then the Coroutine struct needs to be malloced each time. Coroutines are the only users of qemu_alloc_stack(), so I think pooling the Coroutines is optimal. > > > > This patch changes the coroutine pool algorithm to a simple thread-local > > pool without a size limit. Threads periodically shrink the pool down to > > a size sufficient for the maximum observed number of coroutines. > > > > This is a very simple algorithm. Fancier things could be done like > > keeping a minimum number of coroutines around to avoid latency when a > > new coroutine is created after a long period of inactivity. Another > > thought is to stop the timer when the pool size is zero for power saving > > on threads that aren't using coroutines. However, I'd rather not add > > bells and whistles unless they are really necessary. > > > > The global pool is removed by this patch. It can help to hide the fact > > that local pools are easily exhausted, but it's doesn't fix the root > > cause. I don't think there is a need for a global pool because QEMU's > > threads are long-lived, so let's keep things simple. > > QEMU's threads may be long-lived, but are the workloads they service > expected to be consistent over time? > > eg can we ever get a situation where Thread A has a peak of activity > causing caching of many coroutines, and then go idle for a long time, > while Thread B then has a peak and is unable to take advantage of the > cache from Thread A that is now unused ? It's possible to trigger that case, but it's probably not a real-world scenario. It requires a storage controller that is emulated in the vCPU thread (like AHCI) and a workload that alternates between vCPUs. However, that configuration isn't expected to perform well anyway (virtio-blk and virtio-scsi use IOThreads instead of running emulation in vCPU threads because this reduces the cost of the vmexit). Stefan