qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] QSLIST: add atomic replace operation
@ 2020-08-24  4:31 wanghonghao
  2020-08-24  4:31 ` [PATCH 2/2] coroutine: take exactly one batch from global pool at a time wanghonghao
  2020-08-24 15:26 ` [PATCH 1/2] QSLIST: add atomic replace operation Stefan Hajnoczi
  0 siblings, 2 replies; 11+ messages in thread
From: wanghonghao @ 2020-08-24  4:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, fam, wanghonghao, stefanha

Replace a queue with another atomicly. It's useful when we need to transfer
queues between threads.

Signed-off-by: wanghonghao <wanghonghao@bytedance.com>
---
 include/qemu/queue.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/qemu/queue.h b/include/qemu/queue.h
index 456a5b01ee..a3ff544193 100644
--- a/include/qemu/queue.h
+++ b/include/qemu/queue.h
@@ -226,6 +226,10 @@ struct {                                                                \
         (dest)->slh_first = atomic_xchg(&(src)->slh_first, NULL);        \
 } while (/*CONSTCOND*/0)
 
+#define QSLIST_REPLACE_ATOMIC(dest, src) do {                                 \
+        (src)->slh_first = atomic_xchg(&(dest)->slh_first, (src)->slh_first); \
+} while (/*CONSTCOND*/0)
+
 #define QSLIST_REMOVE_HEAD(head, field) do {                             \
         typeof((head)->slh_first) elm = (head)->slh_first;               \
         (head)->slh_first = elm->field.sle_next;                         \
-- 
2.24.3 (Apple Git-128)



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/2] coroutine: take exactly one batch from global pool at a time
  2020-08-24  4:31 [PATCH 1/2] QSLIST: add atomic replace operation wanghonghao
@ 2020-08-24  4:31 ` wanghonghao
  2020-08-25 14:52   ` Stefan Hajnoczi
  2020-08-24 15:26 ` [PATCH 1/2] QSLIST: add atomic replace operation Stefan Hajnoczi
  1 sibling, 1 reply; 11+ messages in thread
From: wanghonghao @ 2020-08-24  4:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, fam, wanghonghao, stefanha

This patch replace the global coroutine queue with a lock-free stack of which
the elements are coroutine queues. Threads can put coroutine queues into the
stack or take queues from it and each coroutine queue has exactly
POOL_BATCH_SIZE coroutines. Note that the stack is not strictly LIFO, but it's
enough for buffer pool.

Coroutines will be put into thread-local pools first while release. Now the
fast pathes of both allocation and release are atomic-free, and there won't
be too many coroutines remain in a single thread since POOL_BATCH_SIZE has been
reduced to 16.

In practice, I've run a VM with two block devices binding to two different
iothreads, and run fio with iodepth 128 on each device. It maintains around
400 coroutines and has about 1% chance of calling to `qemu_coroutine_new`
without this patch. And with this patch, it maintains no more than 273
coroutines and doesn't call `qemu_coroutine_new` after initial allocations.

Signed-off-by: wanghonghao <wanghonghao@bytedance.com>
---
 util/qemu-coroutine.c | 63 ++++++++++++++++++++++++++++---------------
 1 file changed, 42 insertions(+), 21 deletions(-)

diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
index c3caa6c770..070d492edc 100644
--- a/util/qemu-coroutine.c
+++ b/util/qemu-coroutine.c
@@ -21,13 +21,14 @@
 #include "block/aio.h"
 
 enum {
-    POOL_BATCH_SIZE = 64,
+    POOL_BATCH_SIZE = 16,
+    POOL_MAX_BATCHES = 32,
 };
 
-/** Free list to speed up creation */
-static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool);
-static unsigned int release_pool_size;
-static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool);
+/** Free stack to speed up creation */
+static QSLIST_HEAD(, Coroutine) pool[POOL_MAX_BATCHES];
+static int pool_top;
+static __thread QSLIST_HEAD(, Coroutine) alloc_pool;
 static __thread unsigned int alloc_pool_size;
 static __thread Notifier coroutine_pool_cleanup_notifier;
 
@@ -49,20 +50,26 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry, void *opaque)
     if (CONFIG_COROUTINE_POOL) {
         co = QSLIST_FIRST(&alloc_pool);
         if (!co) {
-            if (release_pool_size > POOL_BATCH_SIZE) {
-                /* Slow path; a good place to register the destructor, too.  */
-                if (!coroutine_pool_cleanup_notifier.notify) {
-                    coroutine_pool_cleanup_notifier.notify = coroutine_pool_cleanup;
-                    qemu_thread_atexit_add(&coroutine_pool_cleanup_notifier);
+            int top;
+
+            /* Slow path; a good place to register the destructor, too.  */
+            if (!coroutine_pool_cleanup_notifier.notify) {
+                coroutine_pool_cleanup_notifier.notify = coroutine_pool_cleanup;
+                qemu_thread_atexit_add(&coroutine_pool_cleanup_notifier);
+            }
+
+            while ((top = atomic_read(&pool_top)) > 0) {
+                if (atomic_cmpxchg(&pool_top, top, top - 1) != top) {
+                    continue;
                 }
 
-                /* This is not exact; there could be a little skew between
-                 * release_pool_size and the actual size of release_pool.  But
-                 * it is just a heuristic, it does not need to be perfect.
-                 */
-                alloc_pool_size = atomic_xchg(&release_pool_size, 0);
-                QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool);
+                QSLIST_MOVE_ATOMIC(&alloc_pool, &pool[top - 1]);
                 co = QSLIST_FIRST(&alloc_pool);
+
+                if (co) {
+                    alloc_pool_size = POOL_BATCH_SIZE;
+                    break;
+                }
             }
         }
         if (co) {
@@ -86,16 +93,30 @@ static void coroutine_delete(Coroutine *co)
     co->caller = NULL;
 
     if (CONFIG_COROUTINE_POOL) {
-        if (release_pool_size < POOL_BATCH_SIZE * 2) {
-            QSLIST_INSERT_HEAD_ATOMIC(&release_pool, co, pool_next);
-            atomic_inc(&release_pool_size);
-            return;
-        }
+        int top, value, old;
+
         if (alloc_pool_size < POOL_BATCH_SIZE) {
             QSLIST_INSERT_HEAD(&alloc_pool, co, pool_next);
             alloc_pool_size++;
             return;
         }
+
+        for (top = atomic_read(&pool_top); top < POOL_MAX_BATCHES; top++) {
+            QSLIST_REPLACE_ATOMIC(&pool[top], &alloc_pool);
+            if (!QSLIST_EMPTY(&alloc_pool)) {
+                continue;
+            }
+
+            value = top + 1;
+
+            do {
+                old = atomic_cmpxchg(&pool_top, top, value);
+            } while (old != top && (top = old) < value);
+
+            QSLIST_INSERT_HEAD(&alloc_pool, co, pool_next);
+            alloc_pool_size = 1;
+            return;
+        }
     }
 
     qemu_coroutine_delete(co);
-- 
2.24.3 (Apple Git-128)



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] QSLIST: add atomic replace operation
  2020-08-24  4:31 [PATCH 1/2] QSLIST: add atomic replace operation wanghonghao
  2020-08-24  4:31 ` [PATCH 2/2] coroutine: take exactly one batch from global pool at a time wanghonghao
@ 2020-08-24 15:26 ` Stefan Hajnoczi
  2020-08-25  3:33   ` [External] " 王洪浩
  2020-08-25  3:37   ` [PATCH v2 " wanghonghao
  1 sibling, 2 replies; 11+ messages in thread
From: Stefan Hajnoczi @ 2020-08-24 15:26 UTC (permalink / raw)
  To: wanghonghao; +Cc: kwolf, pbonzini, fam, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1089 bytes --]

On Mon, Aug 24, 2020 at 12:31:20PM +0800, wanghonghao wrote:
> Replace a queue with another atomicly. It's useful when we need to transfer
> queues between threads.
> 
> Signed-off-by: wanghonghao <wanghonghao@bytedance.com>
> ---
>  include/qemu/queue.h | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/include/qemu/queue.h b/include/qemu/queue.h
> index 456a5b01ee..a3ff544193 100644
> --- a/include/qemu/queue.h
> +++ b/include/qemu/queue.h
> @@ -226,6 +226,10 @@ struct {                                                                \
>          (dest)->slh_first = atomic_xchg(&(src)->slh_first, NULL);        \
>  } while (/*CONSTCOND*/0)
>  
> +#define QSLIST_REPLACE_ATOMIC(dest, src) do {                                 \
> +        (src)->slh_first = atomic_xchg(&(dest)->slh_first, (src)->slh_first); \
> +} while (/*CONSTCOND*/0)

This is atomic for dest but not src.

Maybe the name should make this clear: QSLIST_REPLACE_ATOMIC_DEST().

Please also add a doc comment mentioning that the modification to src is
not atomic.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [External] Re: [PATCH 1/2] QSLIST: add atomic replace operation
  2020-08-24 15:26 ` [PATCH 1/2] QSLIST: add atomic replace operation Stefan Hajnoczi
@ 2020-08-25  3:33   ` 王洪浩
  2020-08-25  3:37   ` [PATCH v2 " wanghonghao
  1 sibling, 0 replies; 11+ messages in thread
From: 王洪浩 @ 2020-08-25  3:33 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kwolf, pbonzini, fam, qemu-devel

This function is indeed a bit vague in semantics.
I'll modify this function to make it more in line with `replace`.

Stefan Hajnoczi <stefanha@redhat.com> 于2020年8月24日周一 下午11:27写道:
>
> On Mon, Aug 24, 2020 at 12:31:20PM +0800, wanghonghao wrote:
> > Replace a queue with another atomicly. It's useful when we need to transfer
> > queues between threads.
> >
> > Signed-off-by: wanghonghao <wanghonghao@bytedance.com>
> > ---
> >  include/qemu/queue.h | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/include/qemu/queue.h b/include/qemu/queue.h
> > index 456a5b01ee..a3ff544193 100644
> > --- a/include/qemu/queue.h
> > +++ b/include/qemu/queue.h
> > @@ -226,6 +226,10 @@ struct {                                                                \
> >          (dest)->slh_first = atomic_xchg(&(src)->slh_first, NULL);        \
> >  } while (/*CONSTCOND*/0)
> >
> > +#define QSLIST_REPLACE_ATOMIC(dest, src) do {                                 \
> > +        (src)->slh_first = atomic_xchg(&(dest)->slh_first, (src)->slh_first); \
> > +} while (/*CONSTCOND*/0)
>
> This is atomic for dest but not src.
>
> Maybe the name should make this clear: QSLIST_REPLACE_ATOMIC_DEST().
>
> Please also add a doc comment mentioning that the modification to src is
> not atomic.
>
> Stefan


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 1/2] QSLIST: add atomic replace operation
  2020-08-24 15:26 ` [PATCH 1/2] QSLIST: add atomic replace operation Stefan Hajnoczi
  2020-08-25  3:33   ` [External] " 王洪浩
@ 2020-08-25  3:37   ` wanghonghao
  2020-08-25  3:37     ` [PATCH v2 2/2] coroutine: take exactly one batch from global pool at a time wanghonghao
  1 sibling, 1 reply; 11+ messages in thread
From: wanghonghao @ 2020-08-25  3:37 UTC (permalink / raw)
  To: stefanha; +Cc: kwolf, pbonzini, fam, qemu-devel, wanghonghao

Replace a queue with another atomicly. It's useful when we need to transfer
queues between threads.

Signed-off-by: wanghonghao <wanghonghao@bytedance.com>
---
 include/qemu/queue.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/qemu/queue.h b/include/qemu/queue.h
index 456a5b01ee..62efad2438 100644
--- a/include/qemu/queue.h
+++ b/include/qemu/queue.h
@@ -226,6 +226,10 @@ struct {                                                                \
         (dest)->slh_first = atomic_xchg(&(src)->slh_first, NULL);        \
 } while (/*CONSTCOND*/0)
 
+#define QSLIST_REPLACE_ATOMIC(dest, src, old) do {                            \
+        (old)->slh_first = atomic_xchg(&(dest)->slh_first, (src)->slh_first); \
+} while (/*CONSTCOND*/0)
+
 #define QSLIST_REMOVE_HEAD(head, field) do {                             \
         typeof((head)->slh_first) elm = (head)->slh_first;               \
         (head)->slh_first = elm->field.sle_next;                         \
-- 
2.24.3 (Apple Git-128)



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 2/2] coroutine: take exactly one batch from global pool at a time
  2020-08-25  3:37   ` [PATCH v2 " wanghonghao
@ 2020-08-25  3:37     ` wanghonghao
  0 siblings, 0 replies; 11+ messages in thread
From: wanghonghao @ 2020-08-25  3:37 UTC (permalink / raw)
  To: stefanha; +Cc: kwolf, pbonzini, fam, qemu-devel, wanghonghao

This patch replace the global coroutine queue with a lock-free stack of which
the elements are coroutine queues. Threads can put coroutine queues into the
stack or take queues from it and each coroutine queue has exactly
POOL_BATCH_SIZE coroutines. Note that the stack is not strictly LIFO, but it's
enough for buffer pool.

Coroutines will be put into thread-local pools first while release. Now the
fast pathes of both allocation and release are atomic-free, and there won't
be too many coroutines remain in a single thread since POOL_BATCH_SIZE has been
reduced to 16.

In practice, I've run a VM with two block devices binding to two different
iothreads, and run fio with iodepth 128 on each device. It maintains around
400 coroutines and has about 1% chance of calling to `qemu_coroutine_new`
without this patch. And with this patch, it maintains no more than 273
coroutines and doesn't call `qemu_coroutine_new` after initial allocations.

Signed-off-by: wanghonghao <wanghonghao@bytedance.com>
---
 util/qemu-coroutine.c | 63 ++++++++++++++++++++++++++++---------------
 1 file changed, 42 insertions(+), 21 deletions(-)

diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
index c3caa6c770..9202ec9c85 100644
--- a/util/qemu-coroutine.c
+++ b/util/qemu-coroutine.c
@@ -21,13 +21,14 @@
 #include "block/aio.h"
 
 enum {
-    POOL_BATCH_SIZE = 64,
+    POOL_BATCH_SIZE = 16,
+    POOL_MAX_BATCHES = 32,
 };
 
-/** Free list to speed up creation */
-static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool);
-static unsigned int release_pool_size;
-static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool);
+/** Free stack to speed up creation */
+static QSLIST_HEAD(, Coroutine) pool[POOL_MAX_BATCHES];
+static int pool_top;
+static __thread QSLIST_HEAD(, Coroutine) alloc_pool;
 static __thread unsigned int alloc_pool_size;
 static __thread Notifier coroutine_pool_cleanup_notifier;
 
@@ -49,20 +50,26 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry, void *opaque)
     if (CONFIG_COROUTINE_POOL) {
         co = QSLIST_FIRST(&alloc_pool);
         if (!co) {
-            if (release_pool_size > POOL_BATCH_SIZE) {
-                /* Slow path; a good place to register the destructor, too.  */
-                if (!coroutine_pool_cleanup_notifier.notify) {
-                    coroutine_pool_cleanup_notifier.notify = coroutine_pool_cleanup;
-                    qemu_thread_atexit_add(&coroutine_pool_cleanup_notifier);
+            int top;
+
+            /* Slow path; a good place to register the destructor, too.  */
+            if (!coroutine_pool_cleanup_notifier.notify) {
+                coroutine_pool_cleanup_notifier.notify = coroutine_pool_cleanup;
+                qemu_thread_atexit_add(&coroutine_pool_cleanup_notifier);
+            }
+
+            while ((top = atomic_read(&pool_top)) > 0) {
+                if (atomic_cmpxchg(&pool_top, top, top - 1) != top) {
+                    continue;
                 }
 
-                /* This is not exact; there could be a little skew between
-                 * release_pool_size and the actual size of release_pool.  But
-                 * it is just a heuristic, it does not need to be perfect.
-                 */
-                alloc_pool_size = atomic_xchg(&release_pool_size, 0);
-                QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool);
+                QSLIST_MOVE_ATOMIC(&alloc_pool, &pool[top - 1]);
                 co = QSLIST_FIRST(&alloc_pool);
+
+                if (co) {
+                    alloc_pool_size = POOL_BATCH_SIZE;
+                    break;
+                }
             }
         }
         if (co) {
@@ -86,16 +93,30 @@ static void coroutine_delete(Coroutine *co)
     co->caller = NULL;
 
     if (CONFIG_COROUTINE_POOL) {
-        if (release_pool_size < POOL_BATCH_SIZE * 2) {
-            QSLIST_INSERT_HEAD_ATOMIC(&release_pool, co, pool_next);
-            atomic_inc(&release_pool_size);
-            return;
-        }
+        int top, value, old;
+
         if (alloc_pool_size < POOL_BATCH_SIZE) {
             QSLIST_INSERT_HEAD(&alloc_pool, co, pool_next);
             alloc_pool_size++;
             return;
         }
+
+        for (top = atomic_read(&pool_top); top < POOL_MAX_BATCHES; top++) {
+            QSLIST_REPLACE_ATOMIC(&pool[top], &alloc_pool, &alloc_pool);
+            if (!QSLIST_EMPTY(&alloc_pool)) {
+                continue;
+            }
+
+            value = top + 1;
+
+            do {
+                old = atomic_cmpxchg(&pool_top, top, value);
+            } while (old != top && (top = old) < value);
+
+            QSLIST_INSERT_HEAD(&alloc_pool, co, pool_next);
+            alloc_pool_size = 1;
+            return;
+        }
     }
 
     qemu_coroutine_delete(co);
-- 
2.24.3 (Apple Git-128)



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] coroutine: take exactly one batch from global pool at a time
  2020-08-24  4:31 ` [PATCH 2/2] coroutine: take exactly one batch from global pool at a time wanghonghao
@ 2020-08-25 14:52   ` Stefan Hajnoczi
  2020-08-26  6:06     ` [External] " 王洪浩
  0 siblings, 1 reply; 11+ messages in thread
From: Stefan Hajnoczi @ 2020-08-25 14:52 UTC (permalink / raw)
  To: wanghonghao; +Cc: kwolf, pbonzini, fam, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1167 bytes --]

On Mon, Aug 24, 2020 at 12:31:21PM +0800, wanghonghao wrote:
> This patch replace the global coroutine queue with a lock-free stack of which
> the elements are coroutine queues. Threads can put coroutine queues into the
> stack or take queues from it and each coroutine queue has exactly
> POOL_BATCH_SIZE coroutines. Note that the stack is not strictly LIFO, but it's
> enough for buffer pool.
> 
> Coroutines will be put into thread-local pools first while release. Now the
> fast pathes of both allocation and release are atomic-free, and there won't
> be too many coroutines remain in a single thread since POOL_BATCH_SIZE has been
> reduced to 16.
> 
> In practice, I've run a VM with two block devices binding to two different
> iothreads, and run fio with iodepth 128 on each device. It maintains around
> 400 coroutines and has about 1% chance of calling to `qemu_coroutine_new`
> without this patch. And with this patch, it maintains no more than 273
> coroutines and doesn't call `qemu_coroutine_new` after initial allocations.

Does throughput or IOPS change?

Is the main purpose of this patch to reduce memory consumption?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [External] Re: [PATCH 2/2] coroutine: take exactly one batch from global pool at a time
  2020-08-25 14:52   ` Stefan Hajnoczi
@ 2020-08-26  6:06     ` 王洪浩
  2020-09-29  3:24       ` PING: " 王洪浩
  0 siblings, 1 reply; 11+ messages in thread
From: 王洪浩 @ 2020-08-26  6:06 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kwolf, pbonzini, fam, qemu-devel

The purpose of this patch is to improve performance without increasing
memory consumption.

My test case:
QEMU command line arguments
-drive file=/dev/nvme2n1p1,format=raw,if=none,id=local0,cache=none,aio=native \
    -device virtio-blk,id=blk0,drive=local0,iothread=iothread0,num-queues=4 \
-drive file=/dev/nvme3n1p1,format=raw,if=none,id=local1,cache=none,aio=native \
    -device virtio-blk,id=blk1,drive=local1,iothread=iothread1,num-queues=4 \

run these two fio jobs at the same time
[job-vda]
filename=/dev/vda
iodepth=64
ioengine=libaio
rw=randrw
bs=4k
size=300G
rwmixread=80
direct=1
numjobs=2
runtime=60

[job-vdb]
filename=/dev/vdb
iodepth=64
ioengine=libaio
rw=randrw
bs=4k
size=300G
rwmixread=90
direct=1
numjobs=2
loops=1
runtime=60

without this patch, test 3 times:
total iops: 278548.1, 312374.1, 276638.2
with this patch, test 3 times:
total iops: 368370.9, 335693.2, 327693.1

18.9% improvement in average.

In addition, we are also using a distributed block storage, of which
the io latency is much more than local nvme devices because of the
network overhead. So it needs higher iodepth(>=256) to reach its max
throughput.
Without this patch, it has more than 5% chance of calling
`qemu_coroutine_new` and the iops is less than 100K, while the iops is
about 260K with this patch.

On the other hand, there's a simpler way to reduce or eliminate the
cost of `qemu_coroutine_new` is to increase POOL_BATCH_SIZE. But it
will also bring much more memory consumption which we don't expect.
So it's the purpose of this patch.

Stefan Hajnoczi <stefanha@redhat.com> 于2020年8月25日周二 下午10:52写道:
>
> On Mon, Aug 24, 2020 at 12:31:21PM +0800, wanghonghao wrote:
> > This patch replace the global coroutine queue with a lock-free stack of which
> > the elements are coroutine queues. Threads can put coroutine queues into the
> > stack or take queues from it and each coroutine queue has exactly
> > POOL_BATCH_SIZE coroutines. Note that the stack is not strictly LIFO, but it's
> > enough for buffer pool.
> >
> > Coroutines will be put into thread-local pools first while release. Now the
> > fast pathes of both allocation and release are atomic-free, and there won't
> > be too many coroutines remain in a single thread since POOL_BATCH_SIZE has been
> > reduced to 16.
> >
> > In practice, I've run a VM with two block devices binding to two different
> > iothreads, and run fio with iodepth 128 on each device. It maintains around
> > 400 coroutines and has about 1% chance of calling to `qemu_coroutine_new`
> > without this patch. And with this patch, it maintains no more than 273
> > coroutines and doesn't call `qemu_coroutine_new` after initial allocations.
>
> Does throughput or IOPS change?
>
> Is the main purpose of this patch to reduce memory consumption?
>
> Stefan


^ permalink raw reply	[flat|nested] 11+ messages in thread

* PING: [PATCH 2/2] coroutine: take exactly one batch from global pool at a time
  2020-08-26  6:06     ` [External] " 王洪浩
@ 2020-09-29  3:24       ` 王洪浩
  2020-10-13 10:04         ` Stefan Hajnoczi
  0 siblings, 1 reply; 11+ messages in thread
From: 王洪浩 @ 2020-09-29  3:24 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kwolf, pbonzini, fam, qemu-devel

Hi, I'd like to know if there are any other problems with this patch,
or if there is a better implement to improve coroutine pool.

王洪浩 <wanghonghao@bytedance.com> 于2020年8月26日周三 下午2:06写道:

>
> The purpose of this patch is to improve performance without increasing
> memory consumption.
>
> My test case:
> QEMU command line arguments
> -drive file=/dev/nvme2n1p1,format=raw,if=none,id=local0,cache=none,aio=native \
>     -device virtio-blk,id=blk0,drive=local0,iothread=iothread0,num-queues=4 \
> -drive file=/dev/nvme3n1p1,format=raw,if=none,id=local1,cache=none,aio=native \
>     -device virtio-blk,id=blk1,drive=local1,iothread=iothread1,num-queues=4 \
>
> run these two fio jobs at the same time
> [job-vda]
> filename=/dev/vda
> iodepth=64
> ioengine=libaio
> rw=randrw
> bs=4k
> size=300G
> rwmixread=80
> direct=1
> numjobs=2
> runtime=60
>
> [job-vdb]
> filename=/dev/vdb
> iodepth=64
> ioengine=libaio
> rw=randrw
> bs=4k
> size=300G
> rwmixread=90
> direct=1
> numjobs=2
> loops=1
> runtime=60
>
> without this patch, test 3 times:
> total iops: 278548.1, 312374.1, 276638.2
> with this patch, test 3 times:
> total iops: 368370.9, 335693.2, 327693.1
>
> 18.9% improvement in average.
>
> In addition, we are also using a distributed block storage, of which
> the io latency is much more than local nvme devices because of the
> network overhead. So it needs higher iodepth(>=256) to reach its max
> throughput.
> Without this patch, it has more than 5% chance of calling
> `qemu_coroutine_new` and the iops is less than 100K, while the iops is
> about 260K with this patch.
>
> On the other hand, there's a simpler way to reduce or eliminate the
> cost of `qemu_coroutine_new` is to increase POOL_BATCH_SIZE. But it
> will also bring much more memory consumption which we don't expect.
> So it's the purpose of this patch.
>
> Stefan Hajnoczi <stefanha@redhat.com> 于2020年8月25日周二 下午10:52写道:
> >
> > On Mon, Aug 24, 2020 at 12:31:21PM +0800, wanghonghao wrote:
> > > This patch replace the global coroutine queue with a lock-free stack of which
> > > the elements are coroutine queues. Threads can put coroutine queues into the
> > > stack or take queues from it and each coroutine queue has exactly
> > > POOL_BATCH_SIZE coroutines. Note that the stack is not strictly LIFO, but it's
> > > enough for buffer pool.
> > >
> > > Coroutines will be put into thread-local pools first while release. Now the
> > > fast pathes of both allocation and release are atomic-free, and there won't
> > > be too many coroutines remain in a single thread since POOL_BATCH_SIZE has been
> > > reduced to 16.
> > >
> > > In practice, I've run a VM with two block devices binding to two different
> > > iothreads, and run fio with iodepth 128 on each device. It maintains around
> > > 400 coroutines and has about 1% chance of calling to `qemu_coroutine_new`
> > > without this patch. And with this patch, it maintains no more than 273
> > > coroutines and doesn't call `qemu_coroutine_new` after initial allocations.
> >
> > Does throughput or IOPS change?
> >
> > Is the main purpose of this patch to reduce memory consumption?
> >
> > Stefan


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PING: [PATCH 2/2] coroutine: take exactly one batch from global pool at a time
  2020-09-29  3:24       ` PING: " 王洪浩
@ 2020-10-13 10:04         ` Stefan Hajnoczi
  0 siblings, 0 replies; 11+ messages in thread
From: Stefan Hajnoczi @ 2020-10-13 10:04 UTC (permalink / raw)
  To: 王洪浩; +Cc: kwolf, pbonzini, fam, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 410 bytes --]

On Tue, Sep 29, 2020 at 11:24:14AM +0800, 王洪浩 wrote:
> Hi, I'd like to know if there are any other problems with this patch,
> or if there is a better implement to improve coroutine pool.

Please rebase onto qemu.git/master and resend the patch as a top-level
email thread. I think v2 was overlooked because it was sent as a reply:
https://wiki.qemu.org/Contribute/SubmitAPatch

Thanks,
Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 2/2] coroutine: take exactly one batch from global pool at a time
  2020-08-13  4:44 [PATCH 1/2] QSLIST: add atomic replace operation wanghonghao
@ 2020-08-13  4:44 ` wanghonghao
  0 siblings, 0 replies; 11+ messages in thread
From: wanghonghao @ 2020-08-13  4:44 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, wanghonghao, stefanha

This patch replace the global coroutine queue with a lock-free stack of which
the elements are coroutine queues. Threads can put coroutine queues into the
stack or take queues from it and each coroutine queue has exactly
POOL_BATCH_SIZE coroutines. Note that the stack is not strictly LIFO, but it's
enough for buffer pool.

Coroutines will be put into thread-local pools first while release. Now the
fast pathes of both allocation and release are atomic-free, and there won't
be too many coroutines remain in a single thread since POOL_BATCH_SIZE has been
reduced to 16.

In practice, I've run a VM with two block devices binding to two different
iothreads, and run fio with iodepth 128 on each device. It maintains around
400 coroutines and has about 1% chance of calling to `qemu_coroutine_new`
without this patch. And with this patch, it maintains no more than 273
coroutines and doesn't call `qemu_coroutine_new` after initial allocations.

Signed-off-by: wanghonghao <wanghonghao@bytedance.com>
---
 util/qemu-coroutine.c | 63 ++++++++++++++++++++++++++++---------------
 1 file changed, 42 insertions(+), 21 deletions(-)

diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
index c3caa6c770..02cd68bc4a 100644
--- a/util/qemu-coroutine.c
+++ b/util/qemu-coroutine.c
@@ -21,13 +21,14 @@
 #include "block/aio.h"
 
 enum {
-    POOL_BATCH_SIZE = 64,
+    POOL_BATCH_SIZE = 16,
+    POOL_MAX_BATCHES = 32,
 };
 
-/** Free list to speed up creation */
-static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool);
-static unsigned int release_pool_size;
-static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool);
+/** Free stack to speed up creation */
+static QSLIST_HEAD(, Coroutine) pool[POOL_MAX_BATCHES];
+static int pool_top;
+static __thread QSLIST_HEAD(, Coroutine) alloc_pool;
 static __thread unsigned int alloc_pool_size;
 static __thread Notifier coroutine_pool_cleanup_notifier;
 
@@ -49,20 +50,26 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry, void *opaque)
     if (CONFIG_COROUTINE_POOL) {
         co = QSLIST_FIRST(&alloc_pool);
         if (!co) {
-            if (release_pool_size > POOL_BATCH_SIZE) {
-                /* Slow path; a good place to register the destructor, too.  */
-                if (!coroutine_pool_cleanup_notifier.notify) {
-                    coroutine_pool_cleanup_notifier.notify = coroutine_pool_cleanup;
-                    qemu_thread_atexit_add(&coroutine_pool_cleanup_notifier);
+            int top;
+
+            /* Slow path; a good place to register the destructor, too.  */
+            if (!coroutine_pool_cleanup_notifier.notify) {
+                 coroutine_pool_cleanup_notifier.notify = coroutine_pool_cleanup;
+                 qemu_thread_atexit_add(&coroutine_pool_cleanup_notifier);
+            }
+
+            while ((top = atomic_read(&pool_top)) > 0) {
+                if (atomic_cmpxchg(&pool_top, top, top - 1) != top) {
+                    continue;
                 }
 
-                /* This is not exact; there could be a little skew between
-                 * release_pool_size and the actual size of release_pool.  But
-                 * it is just a heuristic, it does not need to be perfect.
-                 */
-                alloc_pool_size = atomic_xchg(&release_pool_size, 0);
-                QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool);
+                QSLIST_MOVE_ATOMIC(&alloc_pool, &pool[top - 1]);
                 co = QSLIST_FIRST(&alloc_pool);
+
+                if (co) {
+                    alloc_pool_size = POOL_BATCH_SIZE;
+                    break;
+                }
             }
         }
         if (co) {
@@ -86,16 +93,30 @@ static void coroutine_delete(Coroutine *co)
     co->caller = NULL;
 
     if (CONFIG_COROUTINE_POOL) {
-        if (release_pool_size < POOL_BATCH_SIZE * 2) {
-            QSLIST_INSERT_HEAD_ATOMIC(&release_pool, co, pool_next);
-            atomic_inc(&release_pool_size);
-            return;
-        }
+        int top, value, old;
+
         if (alloc_pool_size < POOL_BATCH_SIZE) {
             QSLIST_INSERT_HEAD(&alloc_pool, co, pool_next);
             alloc_pool_size++;
             return;
         }
+
+        for (top = atomic_read(&pool_top); top < POOL_MAX_BATCHES; top++) {
+            QSLIST_REPLACE_ATOMIC(&pool[top], &alloc_pool);
+            if (!QSLIST_EMPTY(&alloc_pool)) {
+                continue;
+            }
+
+            value = top + 1;
+
+            do {
+                old = atomic_cmpxchg(&pool_top, top, value);
+            } while (old != top && (top = old) < value);
+
+            QSLIST_INSERT_HEAD(&alloc_pool, co, pool_next);
+            alloc_pool_size = 1;
+            return;
+        }
     }
 
     qemu_coroutine_delete(co);
-- 
2.24.3 (Apple Git-128)



^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-10-13 10:05 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-24  4:31 [PATCH 1/2] QSLIST: add atomic replace operation wanghonghao
2020-08-24  4:31 ` [PATCH 2/2] coroutine: take exactly one batch from global pool at a time wanghonghao
2020-08-25 14:52   ` Stefan Hajnoczi
2020-08-26  6:06     ` [External] " 王洪浩
2020-09-29  3:24       ` PING: " 王洪浩
2020-10-13 10:04         ` Stefan Hajnoczi
2020-08-24 15:26 ` [PATCH 1/2] QSLIST: add atomic replace operation Stefan Hajnoczi
2020-08-25  3:33   ` [External] " 王洪浩
2020-08-25  3:37   ` [PATCH v2 " wanghonghao
2020-08-25  3:37     ` [PATCH v2 2/2] coroutine: take exactly one batch from global pool at a time wanghonghao
  -- strict thread matches above, loose matches on Subject: below --
2020-08-13  4:44 [PATCH 1/2] QSLIST: add atomic replace operation wanghonghao
2020-08-13  4:44 ` [PATCH 2/2] coroutine: take exactly one batch from global pool at a time wanghonghao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).