* [PATCH 1/2] QSLIST: add atomic replace operation @ 2020-08-24 4:31 wanghonghao 2020-08-24 4:31 ` [PATCH 2/2] coroutine: take exactly one batch from global pool at a time wanghonghao 2020-08-24 15:26 ` [PATCH 1/2] QSLIST: add atomic replace operation Stefan Hajnoczi 0 siblings, 2 replies; 11+ messages in thread From: wanghonghao @ 2020-08-24 4:31 UTC (permalink / raw) To: qemu-devel; +Cc: kwolf, pbonzini, fam, wanghonghao, stefanha Replace a queue with another atomicly. It's useful when we need to transfer queues between threads. Signed-off-by: wanghonghao <wanghonghao@bytedance.com> --- include/qemu/queue.h | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/include/qemu/queue.h b/include/qemu/queue.h index 456a5b01ee..a3ff544193 100644 --- a/include/qemu/queue.h +++ b/include/qemu/queue.h @@ -226,6 +226,10 @@ struct { \ (dest)->slh_first = atomic_xchg(&(src)->slh_first, NULL); \ } while (/*CONSTCOND*/0) +#define QSLIST_REPLACE_ATOMIC(dest, src) do { \ + (src)->slh_first = atomic_xchg(&(dest)->slh_first, (src)->slh_first); \ +} while (/*CONSTCOND*/0) + #define QSLIST_REMOVE_HEAD(head, field) do { \ typeof((head)->slh_first) elm = (head)->slh_first; \ (head)->slh_first = elm->field.sle_next; \ -- 2.24.3 (Apple Git-128) ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 2/2] coroutine: take exactly one batch from global pool at a time 2020-08-24 4:31 [PATCH 1/2] QSLIST: add atomic replace operation wanghonghao @ 2020-08-24 4:31 ` wanghonghao 2020-08-25 14:52 ` Stefan Hajnoczi 2020-08-24 15:26 ` [PATCH 1/2] QSLIST: add atomic replace operation Stefan Hajnoczi 1 sibling, 1 reply; 11+ messages in thread From: wanghonghao @ 2020-08-24 4:31 UTC (permalink / raw) To: qemu-devel; +Cc: kwolf, pbonzini, fam, wanghonghao, stefanha This patch replace the global coroutine queue with a lock-free stack of which the elements are coroutine queues. Threads can put coroutine queues into the stack or take queues from it and each coroutine queue has exactly POOL_BATCH_SIZE coroutines. Note that the stack is not strictly LIFO, but it's enough for buffer pool. Coroutines will be put into thread-local pools first while release. Now the fast pathes of both allocation and release are atomic-free, and there won't be too many coroutines remain in a single thread since POOL_BATCH_SIZE has been reduced to 16. In practice, I've run a VM with two block devices binding to two different iothreads, and run fio with iodepth 128 on each device. It maintains around 400 coroutines and has about 1% chance of calling to `qemu_coroutine_new` without this patch. And with this patch, it maintains no more than 273 coroutines and doesn't call `qemu_coroutine_new` after initial allocations. Signed-off-by: wanghonghao <wanghonghao@bytedance.com> --- util/qemu-coroutine.c | 63 ++++++++++++++++++++++++++++--------------- 1 file changed, 42 insertions(+), 21 deletions(-) diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c index c3caa6c770..070d492edc 100644 --- a/util/qemu-coroutine.c +++ b/util/qemu-coroutine.c @@ -21,13 +21,14 @@ #include "block/aio.h" enum { - POOL_BATCH_SIZE = 64, + POOL_BATCH_SIZE = 16, + POOL_MAX_BATCHES = 32, }; -/** Free list to speed up creation */ -static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool); -static unsigned int release_pool_size; -static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool); +/** Free stack to speed up creation */ +static QSLIST_HEAD(, Coroutine) pool[POOL_MAX_BATCHES]; +static int pool_top; +static __thread QSLIST_HEAD(, Coroutine) alloc_pool; static __thread unsigned int alloc_pool_size; static __thread Notifier coroutine_pool_cleanup_notifier; @@ -49,20 +50,26 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry, void *opaque) if (CONFIG_COROUTINE_POOL) { co = QSLIST_FIRST(&alloc_pool); if (!co) { - if (release_pool_size > POOL_BATCH_SIZE) { - /* Slow path; a good place to register the destructor, too. */ - if (!coroutine_pool_cleanup_notifier.notify) { - coroutine_pool_cleanup_notifier.notify = coroutine_pool_cleanup; - qemu_thread_atexit_add(&coroutine_pool_cleanup_notifier); + int top; + + /* Slow path; a good place to register the destructor, too. */ + if (!coroutine_pool_cleanup_notifier.notify) { + coroutine_pool_cleanup_notifier.notify = coroutine_pool_cleanup; + qemu_thread_atexit_add(&coroutine_pool_cleanup_notifier); + } + + while ((top = atomic_read(&pool_top)) > 0) { + if (atomic_cmpxchg(&pool_top, top, top - 1) != top) { + continue; } - /* This is not exact; there could be a little skew between - * release_pool_size and the actual size of release_pool. But - * it is just a heuristic, it does not need to be perfect. - */ - alloc_pool_size = atomic_xchg(&release_pool_size, 0); - QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool); + QSLIST_MOVE_ATOMIC(&alloc_pool, &pool[top - 1]); co = QSLIST_FIRST(&alloc_pool); + + if (co) { + alloc_pool_size = POOL_BATCH_SIZE; + break; + } } } if (co) { @@ -86,16 +93,30 @@ static void coroutine_delete(Coroutine *co) co->caller = NULL; if (CONFIG_COROUTINE_POOL) { - if (release_pool_size < POOL_BATCH_SIZE * 2) { - QSLIST_INSERT_HEAD_ATOMIC(&release_pool, co, pool_next); - atomic_inc(&release_pool_size); - return; - } + int top, value, old; + if (alloc_pool_size < POOL_BATCH_SIZE) { QSLIST_INSERT_HEAD(&alloc_pool, co, pool_next); alloc_pool_size++; return; } + + for (top = atomic_read(&pool_top); top < POOL_MAX_BATCHES; top++) { + QSLIST_REPLACE_ATOMIC(&pool[top], &alloc_pool); + if (!QSLIST_EMPTY(&alloc_pool)) { + continue; + } + + value = top + 1; + + do { + old = atomic_cmpxchg(&pool_top, top, value); + } while (old != top && (top = old) < value); + + QSLIST_INSERT_HEAD(&alloc_pool, co, pool_next); + alloc_pool_size = 1; + return; + } } qemu_coroutine_delete(co); -- 2.24.3 (Apple Git-128) ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 2/2] coroutine: take exactly one batch from global pool at a time 2020-08-24 4:31 ` [PATCH 2/2] coroutine: take exactly one batch from global pool at a time wanghonghao @ 2020-08-25 14:52 ` Stefan Hajnoczi 2020-08-26 6:06 ` [External] " 王洪浩 0 siblings, 1 reply; 11+ messages in thread From: Stefan Hajnoczi @ 2020-08-25 14:52 UTC (permalink / raw) To: wanghonghao; +Cc: kwolf, pbonzini, fam, qemu-devel [-- Attachment #1: Type: text/plain, Size: 1167 bytes --] On Mon, Aug 24, 2020 at 12:31:21PM +0800, wanghonghao wrote: > This patch replace the global coroutine queue with a lock-free stack of which > the elements are coroutine queues. Threads can put coroutine queues into the > stack or take queues from it and each coroutine queue has exactly > POOL_BATCH_SIZE coroutines. Note that the stack is not strictly LIFO, but it's > enough for buffer pool. > > Coroutines will be put into thread-local pools first while release. Now the > fast pathes of both allocation and release are atomic-free, and there won't > be too many coroutines remain in a single thread since POOL_BATCH_SIZE has been > reduced to 16. > > In practice, I've run a VM with two block devices binding to two different > iothreads, and run fio with iodepth 128 on each device. It maintains around > 400 coroutines and has about 1% chance of calling to `qemu_coroutine_new` > without this patch. And with this patch, it maintains no more than 273 > coroutines and doesn't call `qemu_coroutine_new` after initial allocations. Does throughput or IOPS change? Is the main purpose of this patch to reduce memory consumption? Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [External] Re: [PATCH 2/2] coroutine: take exactly one batch from global pool at a time 2020-08-25 14:52 ` Stefan Hajnoczi @ 2020-08-26 6:06 ` 王洪浩 2020-09-29 3:24 ` PING: " 王洪浩 0 siblings, 1 reply; 11+ messages in thread From: 王洪浩 @ 2020-08-26 6:06 UTC (permalink / raw) To: Stefan Hajnoczi; +Cc: kwolf, pbonzini, fam, qemu-devel The purpose of this patch is to improve performance without increasing memory consumption. My test case: QEMU command line arguments -drive file=/dev/nvme2n1p1,format=raw,if=none,id=local0,cache=none,aio=native \ -device virtio-blk,id=blk0,drive=local0,iothread=iothread0,num-queues=4 \ -drive file=/dev/nvme3n1p1,format=raw,if=none,id=local1,cache=none,aio=native \ -device virtio-blk,id=blk1,drive=local1,iothread=iothread1,num-queues=4 \ run these two fio jobs at the same time [job-vda] filename=/dev/vda iodepth=64 ioengine=libaio rw=randrw bs=4k size=300G rwmixread=80 direct=1 numjobs=2 runtime=60 [job-vdb] filename=/dev/vdb iodepth=64 ioengine=libaio rw=randrw bs=4k size=300G rwmixread=90 direct=1 numjobs=2 loops=1 runtime=60 without this patch, test 3 times: total iops: 278548.1, 312374.1, 276638.2 with this patch, test 3 times: total iops: 368370.9, 335693.2, 327693.1 18.9% improvement in average. In addition, we are also using a distributed block storage, of which the io latency is much more than local nvme devices because of the network overhead. So it needs higher iodepth(>=256) to reach its max throughput. Without this patch, it has more than 5% chance of calling `qemu_coroutine_new` and the iops is less than 100K, while the iops is about 260K with this patch. On the other hand, there's a simpler way to reduce or eliminate the cost of `qemu_coroutine_new` is to increase POOL_BATCH_SIZE. But it will also bring much more memory consumption which we don't expect. So it's the purpose of this patch. Stefan Hajnoczi <stefanha@redhat.com> 于2020年8月25日周二 下午10:52写道: > > On Mon, Aug 24, 2020 at 12:31:21PM +0800, wanghonghao wrote: > > This patch replace the global coroutine queue with a lock-free stack of which > > the elements are coroutine queues. Threads can put coroutine queues into the > > stack or take queues from it and each coroutine queue has exactly > > POOL_BATCH_SIZE coroutines. Note that the stack is not strictly LIFO, but it's > > enough for buffer pool. > > > > Coroutines will be put into thread-local pools first while release. Now the > > fast pathes of both allocation and release are atomic-free, and there won't > > be too many coroutines remain in a single thread since POOL_BATCH_SIZE has been > > reduced to 16. > > > > In practice, I've run a VM with two block devices binding to two different > > iothreads, and run fio with iodepth 128 on each device. It maintains around > > 400 coroutines and has about 1% chance of calling to `qemu_coroutine_new` > > without this patch. And with this patch, it maintains no more than 273 > > coroutines and doesn't call `qemu_coroutine_new` after initial allocations. > > Does throughput or IOPS change? > > Is the main purpose of this patch to reduce memory consumption? > > Stefan ^ permalink raw reply [flat|nested] 11+ messages in thread
* PING: [PATCH 2/2] coroutine: take exactly one batch from global pool at a time 2020-08-26 6:06 ` [External] " 王洪浩 @ 2020-09-29 3:24 ` 王洪浩 2020-10-13 10:04 ` Stefan Hajnoczi 0 siblings, 1 reply; 11+ messages in thread From: 王洪浩 @ 2020-09-29 3:24 UTC (permalink / raw) To: Stefan Hajnoczi; +Cc: kwolf, pbonzini, fam, qemu-devel Hi, I'd like to know if there are any other problems with this patch, or if there is a better implement to improve coroutine pool. 王洪浩 <wanghonghao@bytedance.com> 于2020年8月26日周三 下午2:06写道: > > The purpose of this patch is to improve performance without increasing > memory consumption. > > My test case: > QEMU command line arguments > -drive file=/dev/nvme2n1p1,format=raw,if=none,id=local0,cache=none,aio=native \ > -device virtio-blk,id=blk0,drive=local0,iothread=iothread0,num-queues=4 \ > -drive file=/dev/nvme3n1p1,format=raw,if=none,id=local1,cache=none,aio=native \ > -device virtio-blk,id=blk1,drive=local1,iothread=iothread1,num-queues=4 \ > > run these two fio jobs at the same time > [job-vda] > filename=/dev/vda > iodepth=64 > ioengine=libaio > rw=randrw > bs=4k > size=300G > rwmixread=80 > direct=1 > numjobs=2 > runtime=60 > > [job-vdb] > filename=/dev/vdb > iodepth=64 > ioengine=libaio > rw=randrw > bs=4k > size=300G > rwmixread=90 > direct=1 > numjobs=2 > loops=1 > runtime=60 > > without this patch, test 3 times: > total iops: 278548.1, 312374.1, 276638.2 > with this patch, test 3 times: > total iops: 368370.9, 335693.2, 327693.1 > > 18.9% improvement in average. > > In addition, we are also using a distributed block storage, of which > the io latency is much more than local nvme devices because of the > network overhead. So it needs higher iodepth(>=256) to reach its max > throughput. > Without this patch, it has more than 5% chance of calling > `qemu_coroutine_new` and the iops is less than 100K, while the iops is > about 260K with this patch. > > On the other hand, there's a simpler way to reduce or eliminate the > cost of `qemu_coroutine_new` is to increase POOL_BATCH_SIZE. But it > will also bring much more memory consumption which we don't expect. > So it's the purpose of this patch. > > Stefan Hajnoczi <stefanha@redhat.com> 于2020年8月25日周二 下午10:52写道: > > > > On Mon, Aug 24, 2020 at 12:31:21PM +0800, wanghonghao wrote: > > > This patch replace the global coroutine queue with a lock-free stack of which > > > the elements are coroutine queues. Threads can put coroutine queues into the > > > stack or take queues from it and each coroutine queue has exactly > > > POOL_BATCH_SIZE coroutines. Note that the stack is not strictly LIFO, but it's > > > enough for buffer pool. > > > > > > Coroutines will be put into thread-local pools first while release. Now the > > > fast pathes of both allocation and release are atomic-free, and there won't > > > be too many coroutines remain in a single thread since POOL_BATCH_SIZE has been > > > reduced to 16. > > > > > > In practice, I've run a VM with two block devices binding to two different > > > iothreads, and run fio with iodepth 128 on each device. It maintains around > > > 400 coroutines and has about 1% chance of calling to `qemu_coroutine_new` > > > without this patch. And with this patch, it maintains no more than 273 > > > coroutines and doesn't call `qemu_coroutine_new` after initial allocations. > > > > Does throughput or IOPS change? > > > > Is the main purpose of this patch to reduce memory consumption? > > > > Stefan ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: PING: [PATCH 2/2] coroutine: take exactly one batch from global pool at a time 2020-09-29 3:24 ` PING: " 王洪浩 @ 2020-10-13 10:04 ` Stefan Hajnoczi 0 siblings, 0 replies; 11+ messages in thread From: Stefan Hajnoczi @ 2020-10-13 10:04 UTC (permalink / raw) To: 王洪浩; +Cc: kwolf, pbonzini, fam, qemu-devel [-- Attachment #1: Type: text/plain, Size: 410 bytes --] On Tue, Sep 29, 2020 at 11:24:14AM +0800, 王洪浩 wrote: > Hi, I'd like to know if there are any other problems with this patch, > or if there is a better implement to improve coroutine pool. Please rebase onto qemu.git/master and resend the patch as a top-level email thread. I think v2 was overlooked because it was sent as a reply: https://wiki.qemu.org/Contribute/SubmitAPatch Thanks, Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] QSLIST: add atomic replace operation 2020-08-24 4:31 [PATCH 1/2] QSLIST: add atomic replace operation wanghonghao 2020-08-24 4:31 ` [PATCH 2/2] coroutine: take exactly one batch from global pool at a time wanghonghao @ 2020-08-24 15:26 ` Stefan Hajnoczi 2020-08-25 3:33 ` [External] " 王洪浩 2020-08-25 3:37 ` [PATCH v2 " wanghonghao 1 sibling, 2 replies; 11+ messages in thread From: Stefan Hajnoczi @ 2020-08-24 15:26 UTC (permalink / raw) To: wanghonghao; +Cc: kwolf, pbonzini, fam, qemu-devel [-- Attachment #1: Type: text/plain, Size: 1089 bytes --] On Mon, Aug 24, 2020 at 12:31:20PM +0800, wanghonghao wrote: > Replace a queue with another atomicly. It's useful when we need to transfer > queues between threads. > > Signed-off-by: wanghonghao <wanghonghao@bytedance.com> > --- > include/qemu/queue.h | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/include/qemu/queue.h b/include/qemu/queue.h > index 456a5b01ee..a3ff544193 100644 > --- a/include/qemu/queue.h > +++ b/include/qemu/queue.h > @@ -226,6 +226,10 @@ struct { \ > (dest)->slh_first = atomic_xchg(&(src)->slh_first, NULL); \ > } while (/*CONSTCOND*/0) > > +#define QSLIST_REPLACE_ATOMIC(dest, src) do { \ > + (src)->slh_first = atomic_xchg(&(dest)->slh_first, (src)->slh_first); \ > +} while (/*CONSTCOND*/0) This is atomic for dest but not src. Maybe the name should make this clear: QSLIST_REPLACE_ATOMIC_DEST(). Please also add a doc comment mentioning that the modification to src is not atomic. Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [External] Re: [PATCH 1/2] QSLIST: add atomic replace operation 2020-08-24 15:26 ` [PATCH 1/2] QSLIST: add atomic replace operation Stefan Hajnoczi @ 2020-08-25 3:33 ` 王洪浩 2020-08-25 3:37 ` [PATCH v2 " wanghonghao 1 sibling, 0 replies; 11+ messages in thread From: 王洪浩 @ 2020-08-25 3:33 UTC (permalink / raw) To: Stefan Hajnoczi; +Cc: kwolf, pbonzini, fam, qemu-devel This function is indeed a bit vague in semantics. I'll modify this function to make it more in line with `replace`. Stefan Hajnoczi <stefanha@redhat.com> 于2020年8月24日周一 下午11:27写道: > > On Mon, Aug 24, 2020 at 12:31:20PM +0800, wanghonghao wrote: > > Replace a queue with another atomicly. It's useful when we need to transfer > > queues between threads. > > > > Signed-off-by: wanghonghao <wanghonghao@bytedance.com> > > --- > > include/qemu/queue.h | 4 ++++ > > 1 file changed, 4 insertions(+) > > > > diff --git a/include/qemu/queue.h b/include/qemu/queue.h > > index 456a5b01ee..a3ff544193 100644 > > --- a/include/qemu/queue.h > > +++ b/include/qemu/queue.h > > @@ -226,6 +226,10 @@ struct { \ > > (dest)->slh_first = atomic_xchg(&(src)->slh_first, NULL); \ > > } while (/*CONSTCOND*/0) > > > > +#define QSLIST_REPLACE_ATOMIC(dest, src) do { \ > > + (src)->slh_first = atomic_xchg(&(dest)->slh_first, (src)->slh_first); \ > > +} while (/*CONSTCOND*/0) > > This is atomic for dest but not src. > > Maybe the name should make this clear: QSLIST_REPLACE_ATOMIC_DEST(). > > Please also add a doc comment mentioning that the modification to src is > not atomic. > > Stefan ^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v2 1/2] QSLIST: add atomic replace operation 2020-08-24 15:26 ` [PATCH 1/2] QSLIST: add atomic replace operation Stefan Hajnoczi 2020-08-25 3:33 ` [External] " 王洪浩 @ 2020-08-25 3:37 ` wanghonghao 2020-08-25 3:37 ` [PATCH v2 2/2] coroutine: take exactly one batch from global pool at a time wanghonghao 1 sibling, 1 reply; 11+ messages in thread From: wanghonghao @ 2020-08-25 3:37 UTC (permalink / raw) To: stefanha; +Cc: kwolf, pbonzini, fam, qemu-devel, wanghonghao Replace a queue with another atomicly. It's useful when we need to transfer queues between threads. Signed-off-by: wanghonghao <wanghonghao@bytedance.com> --- include/qemu/queue.h | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/include/qemu/queue.h b/include/qemu/queue.h index 456a5b01ee..62efad2438 100644 --- a/include/qemu/queue.h +++ b/include/qemu/queue.h @@ -226,6 +226,10 @@ struct { \ (dest)->slh_first = atomic_xchg(&(src)->slh_first, NULL); \ } while (/*CONSTCOND*/0) +#define QSLIST_REPLACE_ATOMIC(dest, src, old) do { \ + (old)->slh_first = atomic_xchg(&(dest)->slh_first, (src)->slh_first); \ +} while (/*CONSTCOND*/0) + #define QSLIST_REMOVE_HEAD(head, field) do { \ typeof((head)->slh_first) elm = (head)->slh_first; \ (head)->slh_first = elm->field.sle_next; \ -- 2.24.3 (Apple Git-128) ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH v2 2/2] coroutine: take exactly one batch from global pool at a time 2020-08-25 3:37 ` [PATCH v2 " wanghonghao @ 2020-08-25 3:37 ` wanghonghao 0 siblings, 0 replies; 11+ messages in thread From: wanghonghao @ 2020-08-25 3:37 UTC (permalink / raw) To: stefanha; +Cc: kwolf, pbonzini, fam, qemu-devel, wanghonghao This patch replace the global coroutine queue with a lock-free stack of which the elements are coroutine queues. Threads can put coroutine queues into the stack or take queues from it and each coroutine queue has exactly POOL_BATCH_SIZE coroutines. Note that the stack is not strictly LIFO, but it's enough for buffer pool. Coroutines will be put into thread-local pools first while release. Now the fast pathes of both allocation and release are atomic-free, and there won't be too many coroutines remain in a single thread since POOL_BATCH_SIZE has been reduced to 16. In practice, I've run a VM with two block devices binding to two different iothreads, and run fio with iodepth 128 on each device. It maintains around 400 coroutines and has about 1% chance of calling to `qemu_coroutine_new` without this patch. And with this patch, it maintains no more than 273 coroutines and doesn't call `qemu_coroutine_new` after initial allocations. Signed-off-by: wanghonghao <wanghonghao@bytedance.com> --- util/qemu-coroutine.c | 63 ++++++++++++++++++++++++++++--------------- 1 file changed, 42 insertions(+), 21 deletions(-) diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c index c3caa6c770..9202ec9c85 100644 --- a/util/qemu-coroutine.c +++ b/util/qemu-coroutine.c @@ -21,13 +21,14 @@ #include "block/aio.h" enum { - POOL_BATCH_SIZE = 64, + POOL_BATCH_SIZE = 16, + POOL_MAX_BATCHES = 32, }; -/** Free list to speed up creation */ -static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool); -static unsigned int release_pool_size; -static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool); +/** Free stack to speed up creation */ +static QSLIST_HEAD(, Coroutine) pool[POOL_MAX_BATCHES]; +static int pool_top; +static __thread QSLIST_HEAD(, Coroutine) alloc_pool; static __thread unsigned int alloc_pool_size; static __thread Notifier coroutine_pool_cleanup_notifier; @@ -49,20 +50,26 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry, void *opaque) if (CONFIG_COROUTINE_POOL) { co = QSLIST_FIRST(&alloc_pool); if (!co) { - if (release_pool_size > POOL_BATCH_SIZE) { - /* Slow path; a good place to register the destructor, too. */ - if (!coroutine_pool_cleanup_notifier.notify) { - coroutine_pool_cleanup_notifier.notify = coroutine_pool_cleanup; - qemu_thread_atexit_add(&coroutine_pool_cleanup_notifier); + int top; + + /* Slow path; a good place to register the destructor, too. */ + if (!coroutine_pool_cleanup_notifier.notify) { + coroutine_pool_cleanup_notifier.notify = coroutine_pool_cleanup; + qemu_thread_atexit_add(&coroutine_pool_cleanup_notifier); + } + + while ((top = atomic_read(&pool_top)) > 0) { + if (atomic_cmpxchg(&pool_top, top, top - 1) != top) { + continue; } - /* This is not exact; there could be a little skew between - * release_pool_size and the actual size of release_pool. But - * it is just a heuristic, it does not need to be perfect. - */ - alloc_pool_size = atomic_xchg(&release_pool_size, 0); - QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool); + QSLIST_MOVE_ATOMIC(&alloc_pool, &pool[top - 1]); co = QSLIST_FIRST(&alloc_pool); + + if (co) { + alloc_pool_size = POOL_BATCH_SIZE; + break; + } } } if (co) { @@ -86,16 +93,30 @@ static void coroutine_delete(Coroutine *co) co->caller = NULL; if (CONFIG_COROUTINE_POOL) { - if (release_pool_size < POOL_BATCH_SIZE * 2) { - QSLIST_INSERT_HEAD_ATOMIC(&release_pool, co, pool_next); - atomic_inc(&release_pool_size); - return; - } + int top, value, old; + if (alloc_pool_size < POOL_BATCH_SIZE) { QSLIST_INSERT_HEAD(&alloc_pool, co, pool_next); alloc_pool_size++; return; } + + for (top = atomic_read(&pool_top); top < POOL_MAX_BATCHES; top++) { + QSLIST_REPLACE_ATOMIC(&pool[top], &alloc_pool, &alloc_pool); + if (!QSLIST_EMPTY(&alloc_pool)) { + continue; + } + + value = top + 1; + + do { + old = atomic_cmpxchg(&pool_top, top, value); + } while (old != top && (top = old) < value); + + QSLIST_INSERT_HEAD(&alloc_pool, co, pool_next); + alloc_pool_size = 1; + return; + } } qemu_coroutine_delete(co); -- 2.24.3 (Apple Git-128) ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 1/2] QSLIST: add atomic replace operation @ 2020-08-13 4:44 wanghonghao 0 siblings, 0 replies; 11+ messages in thread From: wanghonghao @ 2020-08-13 4:44 UTC (permalink / raw) To: qemu-devel; +Cc: kwolf, wanghonghao, stefanha Replace a queue with another atomicly. It's useful when we need to transfer queues between threads. Signed-off-by: wanghonghao <wanghonghao@bytedance.com> --- include/qemu/queue.h | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/include/qemu/queue.h b/include/qemu/queue.h index 456a5b01ee..a3ff544193 100644 --- a/include/qemu/queue.h +++ b/include/qemu/queue.h @@ -226,6 +226,10 @@ struct { \ (dest)->slh_first = atomic_xchg(&(src)->slh_first, NULL); \ } while (/*CONSTCOND*/0) +#define QSLIST_REPLACE_ATOMIC(dest, src) do { \ + (src)->slh_first = atomic_xchg(&(dest)->slh_first, (src)->slh_first); \ +} while (/*CONSTCOND*/0) + #define QSLIST_REMOVE_HEAD(head, field) do { \ typeof((head)->slh_first) elm = (head)->slh_first; \ (head)->slh_first = elm->field.sle_next; \ -- 2.24.3 (Apple Git-128) ^ permalink raw reply related [flat|nested] 11+ messages in thread
end of thread, other threads:[~2020-10-13 10:05 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-08-24 4:31 [PATCH 1/2] QSLIST: add atomic replace operation wanghonghao 2020-08-24 4:31 ` [PATCH 2/2] coroutine: take exactly one batch from global pool at a time wanghonghao 2020-08-25 14:52 ` Stefan Hajnoczi 2020-08-26 6:06 ` [External] " 王洪浩 2020-09-29 3:24 ` PING: " 王洪浩 2020-10-13 10:04 ` Stefan Hajnoczi 2020-08-24 15:26 ` [PATCH 1/2] QSLIST: add atomic replace operation Stefan Hajnoczi 2020-08-25 3:33 ` [External] " 王洪浩 2020-08-25 3:37 ` [PATCH v2 " wanghonghao 2020-08-25 3:37 ` [PATCH v2 2/2] coroutine: take exactly one batch from global pool at a time wanghonghao -- strict thread matches above, loose matches on Subject: below -- 2020-08-13 4:44 [PATCH 1/2] QSLIST: add atomic replace operation wanghonghao
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).