Re: [PATCH] coroutine: resize pool periodically instead of limiting size

From: Stefan Hajnoczi <stefanha@redhat.com>
To: "Daniel P. Berrangé" <berrange@redhat.com>
Cc: "Kevin Wolf" <kwolf@redhat.com>,
	"Tingting Mao" <timao@redhat.com>,
	qemu-devel@nongnu.org, "Honghao Wang" <wanghonghao@bytedance.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Philippe Mathieu-Daudé" <philmd@redhat.com>,
	"Daniele Buono" <dbuono@linux.vnet.ibm.com>,
	"Serge Guelton" <sguelton@redhat.com>
Subject: Re: [PATCH] coroutine: resize pool periodically instead of limiting size
Date: Thu, 9 Sep 2021 16:51:44 +0100	[thread overview]
Message-ID: <YTotkCiuqTeDgJJ0@stefanha-x1.localdomain> (raw)
In-Reply-To: <YTnHwJ/0O4rk7M7g@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 3738 bytes --]

On Thu, Sep 09, 2021 at 09:37:20AM +0100, Daniel P. Berrangé wrote:
> On Wed, Sep 01, 2021 at 05:09:23PM +0100, Stefan Hajnoczi wrote:
> > It was reported that enabling SafeStack reduces IOPS significantly
> > (>25%) with the following fio benchmark on virtio-blk using a NVMe host
> > block device:
> > 
> >   # fio --rw=randrw --bs=4k --iodepth=64 --runtime=1m --direct=1 \
> > 	--filename=/dev/vdb --name=job1 --ioengine=libaio --thread \
> > 	--group_reporting --numjobs=16 --time_based \
> >         --output=/tmp/fio_result
> > 
> > Serge Guelton and I found that SafeStack is not really at fault, it just
> > increases the cost of coroutine creation. This fio workload exhausts the
> > coroutine pool and coroutine creation becomes a bottleneck. Previous
> > work by Honghao Wang also pointed to excessive coroutine creation.
> > 
> > Creating new coroutines is expensive due to allocating new stacks with
> > mmap(2) and mprotect(2). Currently there are thread-local and global
> > pools that recycle old Coroutine objects and their stacks but the
> > hardcoded size limit of 64 for thread-local pools and 128 for the global
> > pool is insufficient for the fio benchmark shown above.
> 
> Rather than keeping around a thread local pool of coroutine
> instances, did you ever consider keeping around a pool of
> allocated stacks ? Essentially it seems like you're syaing
> the stack allocation is the problem due to it using mmap()
> instead of malloc() and thus not benefiting from any of the
> performance tricks malloc() impls use to avoid repeated
> syscalls on every allocation.  If 'qemu_alloc_stack' and
> qemu_free_stack could be made more intelligent by caching
> stacks, then perhaps the coroutine side can be left "dumb" ?

What is the advantage of doing that? Then the Coroutine struct needs to
be malloced each time. Coroutines are the only users of
qemu_alloc_stack(), so I think pooling the Coroutines is optimal.

> > 
> > This patch changes the coroutine pool algorithm to a simple thread-local
> > pool without a size limit. Threads periodically shrink the pool down to
> > a size sufficient for the maximum observed number of coroutines.
> > 
> > This is a very simple algorithm. Fancier things could be done like
> > keeping a minimum number of coroutines around to avoid latency when a
> > new coroutine is created after a long period of inactivity. Another
> > thought is to stop the timer when the pool size is zero for power saving
> > on threads that aren't using coroutines. However, I'd rather not add
> > bells and whistles unless they are really necessary.
> > 
> > The global pool is removed by this patch. It can help to hide the fact
> > that local pools are easily exhausted, but it's doesn't fix the root
> > cause. I don't think there is a need for a global pool because QEMU's
> > threads are long-lived, so let's keep things simple.
> 
> QEMU's threads may be long-lived,  but are the workloads they service
> expected to be consistent over time?
> 
> eg can we ever get a situation where Thread A has a peak of activity
> causing caching of many coroutines, and then go idle for a long time,
> while Thread B then has a peak and is unable to take advantage of the
> cache from Thread A that is now unused ?

It's possible to trigger that case, but it's probably not a real-world
scenario. It requires a storage controller that is emulated in the vCPU
thread (like AHCI) and a workload that alternates between vCPUs.
However, that configuration isn't expected to perform well anyway
(virtio-blk and virtio-scsi use IOThreads instead of running emulation
in vCPU threads because this reduces the cost of the vmexit).

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]