All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 0/7] coroutine: optimizations
@ 2014-11-28 14:12 Paolo Bonzini
  2014-11-28 14:12 ` [Qemu-devel] [PATCH 1/7] coroutine-ucontext: use __thread Paolo Bonzini
                   ` (7 more replies)
  0 siblings, 8 replies; 26+ messages in thread
From: Paolo Bonzini @ 2014-11-28 14:12 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, ming.lei, pl, stefanha

As discussed in the other thread, this brings speedups from
dropping the coroutine mutex (which serializes multiple iothreads,
too) and using ELF thread-local storage.

The speedup in perf/cost is about 30% (190->145).  Windows port tested
with tests/test-coroutine.exe under Wine.

Paolo

Paolo Bonzini (7):
  coroutine-ucontext: use __thread
  qemu-thread: add per-thread atexit functions
  test-coroutine: avoid overflow on 32-bit systems
  QSLIST: add lock-free operations
  coroutine: rewrite pool to avoid mutex
  coroutine: drop qemu_coroutine_adjust_pool_size
  coroutine: try harder not to delete coroutines

 block/block-backend.c     |   4 --
 coroutine-ucontext.c      |  64 +++++++---------------------
 include/block/coroutine.h |  10 -----
 include/qemu/queue.h      |  15 ++++++-
 include/qemu/thread.h     |   4 ++
 qemu-coroutine.c          | 104 ++++++++++++++++++++++------------------------
 tests/test-coroutine.c    |   2 +-
 util/qemu-thread-posix.c  |  37 +++++++++++++++++
 util/qemu-thread-win32.c  |  48 ++++++++++++++++-----
 9 files changed, 157 insertions(+), 131 deletions(-)

-- 
2.1.0

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH 1/7] coroutine-ucontext: use __thread
  2014-11-28 14:12 [Qemu-devel] [PATCH 0/7] coroutine: optimizations Paolo Bonzini
@ 2014-11-28 14:12 ` Paolo Bonzini
  2014-11-28 14:28   ` Peter Maydell
  2014-11-28 14:45   ` Markus Armbruster
  2014-11-28 14:12 ` [Qemu-devel] [PATCH 2/7] qemu-thread: add per-thread atexit functions Paolo Bonzini
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 26+ messages in thread
From: Paolo Bonzini @ 2014-11-28 14:12 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, ming.lei, pl, stefanha

ELF thread local storage is about 10% faster on tests/test-coroutine's
perf/cost test.  The timing on my machine is 160ns per iteration with
pthread TLS, 145 with ELF TLS.

Based on a patch by Kevin Wolf and Peter Lieven, but redone to follow
the model of coroutine-win32.c (including the important "noinline"
attribute!!!).

Platforms without thread-local storage (OpenBSD probably?) will need
a new-enough GCC for this to compile, in order to use the same emutls
support that Windows already relies on.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 coroutine-ucontext.c | 64 +++++++++++++---------------------------------------
 1 file changed, 16 insertions(+), 48 deletions(-)

diff --git a/coroutine-ucontext.c b/coroutine-ucontext.c
index 4bf2cde..d86e3e1 100644
--- a/coroutine-ucontext.c
+++ b/coroutine-ucontext.c
@@ -25,7 +25,6 @@
 #include <stdlib.h>
 #include <setjmp.h>
 #include <stdint.h>
-#include <pthread.h>
 #include <ucontext.h>
 #include "qemu-common.h"
 #include "block/coroutine_int.h"
@@ -48,15 +47,8 @@ typedef struct {
 /**
  * Per-thread coroutine bookkeeping
  */
-typedef struct {
-    /** Currently executing coroutine */
-    Coroutine *current;
-
-    /** The default coroutine */
-    CoroutineUContext leader;
-} CoroutineThreadState;
-
-static pthread_key_t thread_state_key;
+static __thread CoroutineUContext leader;
+static __thread Coroutine *current;
 
 /*
  * va_args to makecontext() must be type 'int', so passing
@@ -68,36 +60,6 @@ union cc_arg {
     int i[2];
 };
 
-static CoroutineThreadState *coroutine_get_thread_state(void)
-{
-    CoroutineThreadState *s = pthread_getspecific(thread_state_key);
-
-    if (!s) {
-        s = g_malloc0(sizeof(*s));
-        s->current = &s->leader.base;
-        pthread_setspecific(thread_state_key, s);
-    }
-    return s;
-}
-
-static void qemu_coroutine_thread_cleanup(void *opaque)
-{
-    CoroutineThreadState *s = opaque;
-
-    g_free(s);
-}
-
-static void __attribute__((constructor)) coroutine_init(void)
-{
-    int ret;
-
-    ret = pthread_key_create(&thread_state_key, qemu_coroutine_thread_cleanup);
-    if (ret != 0) {
-        fprintf(stderr, "unable to create leader key: %s\n", strerror(errno));
-        abort();
-    }
-}
-
 static void coroutine_trampoline(int i0, int i1)
 {
     union cc_arg arg;
@@ -193,15 +155,22 @@ void qemu_coroutine_delete(Coroutine *co_)
     g_free(co);
 }
 
+/* This function is marked noinline to prevent GCC from inlining it
+ * into coroutine_trampoline(). If we allow it to do that then it
+ * hoists the code to get the address of the TLS variable "current"
+ * out of the while() loop. This is an invalid transformation because
+ * the SwitchToFiber() call may be called when running thread A but
+ * return in thread B, and so we might be in a different thread
+ * context each time round the loop.
+ */
 CoroutineAction qemu_coroutine_switch(Coroutine *from_, Coroutine *to_,
                                       CoroutineAction action)
 {
     CoroutineUContext *from = DO_UPCAST(CoroutineUContext, base, from_);
     CoroutineUContext *to = DO_UPCAST(CoroutineUContext, base, to_);
-    CoroutineThreadState *s = coroutine_get_thread_state();
     int ret;
 
-    s->current = to_;
+    current = to_;
 
     ret = sigsetjmp(from->env, 0);
     if (ret == 0) {
@@ -212,14 +181,13 @@ CoroutineAction qemu_coroutine_switch(Coroutine *from_, Coroutine *to_,
 
 Coroutine *qemu_coroutine_self(void)
 {
-    CoroutineThreadState *s = coroutine_get_thread_state();
-
-    return s->current;
+    if (!current) {
+        current = &leader.base;
+    }
+    return current;
 }
 
 bool qemu_in_coroutine(void)
 {
-    CoroutineThreadState *s = pthread_getspecific(thread_state_key);
-
-    return s && s->current->caller;
+    return current && current->caller;
 }
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH 2/7] qemu-thread: add per-thread atexit functions
  2014-11-28 14:12 [Qemu-devel] [PATCH 0/7] coroutine: optimizations Paolo Bonzini
  2014-11-28 14:12 ` [Qemu-devel] [PATCH 1/7] coroutine-ucontext: use __thread Paolo Bonzini
@ 2014-11-28 14:12 ` Paolo Bonzini
  2014-11-28 14:12 ` [Qemu-devel] [PATCH 3/7] test-coroutine: avoid overflow on 32-bit systems Paolo Bonzini
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: Paolo Bonzini @ 2014-11-28 14:12 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, ming.lei, pl, stefanha

Destructors are the main additional feature of pthread TLS compared
to __thread.  If we were using C++ (hint, hint!) we could have used
thread-local objects with a destructor.  Since we are not, instead,
we add a simple Notifier-based API.

Note that the notifier must be per-thread as well.  We can add a
global list as well later, perhaps.

The Win32 implementation has some complications because a) detached
threads used not to have a QemuThreadData; b) the main thread does
not go through win32_start_routine, so we have to use atexit too.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 include/qemu/thread.h    |  4 ++++
 util/qemu-thread-posix.c | 37 +++++++++++++++++++++++++++++++++++++
 util/qemu-thread-win32.c | 48 +++++++++++++++++++++++++++++++++++++-----------
 3 files changed, 78 insertions(+), 11 deletions(-)

diff --git a/include/qemu/thread.h b/include/qemu/thread.h
index f7e3b9b..e89fdc9 100644
--- a/include/qemu/thread.h
+++ b/include/qemu/thread.h
@@ -61,4 +61,8 @@ bool qemu_thread_is_self(QemuThread *thread);
 void qemu_thread_exit(void *retval);
 void qemu_thread_naming(bool enable);
 
+struct Notifier;
+void qemu_thread_atexit_add(struct Notifier *notifier);
+void qemu_thread_atexit_remove(struct Notifier *notifier);
+
 #endif
diff --git a/util/qemu-thread-posix.c b/util/qemu-thread-posix.c
index d05a649..41cb23d 100644
--- a/util/qemu-thread-posix.c
+++ b/util/qemu-thread-posix.c
@@ -26,6 +26,7 @@
 #endif
 #include "qemu/thread.h"
 #include "qemu/atomic.h"
+#include "qemu/notify.h"
 
 static bool name_threads;
 
@@ -401,6 +402,42 @@ void qemu_event_wait(QemuEvent *ev)
     }
 }
 
+static pthread_key_t exit_key;
+
+union NotifierThreadData {
+    void *ptr;
+    NotifierList list;
+};
+QEMU_BUILD_BUG_ON(sizeof(union NotifierThreadData) != sizeof(void *));
+
+void qemu_thread_atexit_add(Notifier *notifier)
+{
+    union NotifierThreadData ntd;
+    ntd.ptr = pthread_getspecific(exit_key);
+    notifier_list_add(&ntd.list, notifier);
+    pthread_setspecific(exit_key, ntd.ptr);
+}
+
+void qemu_thread_atexit_remove(Notifier *notifier)
+{
+    union NotifierThreadData ntd;
+    ntd.ptr = pthread_getspecific(exit_key);
+    notifier_remove(notifier);
+    pthread_setspecific(exit_key, ntd.ptr);
+}
+
+static void qemu_thread_atexit_run(void *arg)
+{
+    union NotifierThreadData ntd = { .ptr = arg };
+    notifier_list_notify(&ntd.list, NULL);
+}
+
+static void __attribute__((constructor)) qemu_thread_atexit_init(void)
+{
+    pthread_key_create(&exit_key, qemu_thread_atexit_run);
+}
+
+
 /* Attempt to set the threads name; note that this is for debug, so
  * we're not going to fail if we can't set it.
  */
diff --git a/util/qemu-thread-win32.c b/util/qemu-thread-win32.c
index c405c9b..7bda85b 100644
--- a/util/qemu-thread-win32.c
+++ b/util/qemu-thread-win32.c
@@ -12,6 +12,7 @@
  */
 #include "qemu-common.h"
 #include "qemu/thread.h"
+#include "qemu/notify.h"
 #include <process.h>
 #include <assert.h>
 #include <limits.h>
@@ -268,6 +269,7 @@ struct QemuThreadData {
     void             *(*start_routine)(void *);
     void             *arg;
     short             mode;
+    NotifierList      exit;
 
     /* Only used for joinable threads. */
     bool              exited;
@@ -275,18 +277,40 @@ struct QemuThreadData {
     CRITICAL_SECTION  cs;
 };
 
+static bool atexit_registered;
+static NotifierList main_thread_exit;
+
 static __thread QemuThreadData *qemu_thread_data;
 
+static void run_main_iothread_exit(void)
+{
+    notifier_list_notify(&main_thread_exit, NULL);
+}
+
+void qemu_thread_atexit_add(Notifier *notifier)
+{
+    if (!qemu_thread_data) {
+        if (!atexit_registered) {
+            atexit_registered = true;
+            atexit(run_main_iothread_exit);
+        }
+        notifier_list_add(&main_thread_exit, notifier);
+    } else {
+        notifier_list_add(&qemu_thread_data->exit, notifier);
+    }
+}
+
+void qemu_thread_atexit_remove(Notifier *notifier)
+{
+    notifier_remove(notifier);
+}
+
 static unsigned __stdcall win32_start_routine(void *arg)
 {
     QemuThreadData *data = (QemuThreadData *) arg;
     void *(*start_routine)(void *) = data->start_routine;
     void *thread_arg = data->arg;
 
-    if (data->mode == QEMU_THREAD_DETACHED) {
-        g_free(data);
-        data = NULL;
-    }
     qemu_thread_data = data;
     qemu_thread_exit(start_routine(thread_arg));
     abort();
@@ -296,12 +320,14 @@ void qemu_thread_exit(void *arg)
 {
     QemuThreadData *data = qemu_thread_data;
 
-    if (data) {
-        assert(data->mode != QEMU_THREAD_DETACHED);
+    notifier_list_notify(&data->exit, NULL);
+    if (data->mode == QEMU_THREAD_JOINABLE) {
         data->ret = arg;
         EnterCriticalSection(&data->cs);
         data->exited = true;
         LeaveCriticalSection(&data->cs);
+    } else {
+        g_free(data);
     }
     _endthreadex(0);
 }
@@ -313,9 +339,10 @@ void *qemu_thread_join(QemuThread *thread)
     HANDLE handle;
 
     data = thread->data;
-    if (!data) {
+    if (data->mode == QEMU_THREAD_DETACHED) {
         return NULL;
     }
+
     /*
      * Because multiple copies of the QemuThread can exist via
      * qemu_thread_get_self, we need to store a value that cannot
@@ -329,7 +356,6 @@ void *qemu_thread_join(QemuThread *thread)
         CloseHandle(handle);
     }
     ret = data->ret;
-    assert(data->mode != QEMU_THREAD_DETACHED);
     DeleteCriticalSection(&data->cs);
     g_free(data);
     return ret;
@@ -347,6 +373,7 @@ void qemu_thread_create(QemuThread *thread, const char *name,
     data->arg = arg;
     data->mode = mode;
     data->exited = false;
+    notifier_list_init(&data->exit);
 
     if (data->mode != QEMU_THREAD_DETACHED) {
         InitializeCriticalSection(&data->cs);
@@ -358,7 +385,7 @@ void qemu_thread_create(QemuThread *thread, const char *name,
         error_exit(GetLastError(), __func__);
     }
     CloseHandle(hThread);
-    thread->data = (mode == QEMU_THREAD_DETACHED) ? NULL : data;
+    thread->data = data;
 }
 
 void qemu_thread_get_self(QemuThread *thread)
@@ -373,11 +400,10 @@ HANDLE qemu_thread_get_handle(QemuThread *thread)
     HANDLE handle;
 
     data = thread->data;
-    if (!data) {
+    if (data->mode == QEMU_THREAD_DETACHED) {
         return NULL;
     }
 
-    assert(data->mode != QEMU_THREAD_DETACHED);
     EnterCriticalSection(&data->cs);
     if (!data->exited) {
         handle = OpenThread(SYNCHRONIZE | THREAD_SUSPEND_RESUME, FALSE,
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH 3/7] test-coroutine: avoid overflow on 32-bit systems
  2014-11-28 14:12 [Qemu-devel] [PATCH 0/7] coroutine: optimizations Paolo Bonzini
  2014-11-28 14:12 ` [Qemu-devel] [PATCH 1/7] coroutine-ucontext: use __thread Paolo Bonzini
  2014-11-28 14:12 ` [Qemu-devel] [PATCH 2/7] qemu-thread: add per-thread atexit functions Paolo Bonzini
@ 2014-11-28 14:12 ` Paolo Bonzini
  2014-12-01  1:28   ` Ming Lei
  2014-11-28 14:12 ` [Qemu-devel] [PATCH 4/7] QSLIST: add lock-free operations Paolo Bonzini
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Paolo Bonzini @ 2014-11-28 14:12 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, ming.lei, pl, stefanha

unsigned long is not large enough to represent 1000000000 * duration there.
Just use floating point.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 tests/test-coroutine.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/test-coroutine.c b/tests/test-coroutine.c
index e22fae1..27d1b6f 100644
--- a/tests/test-coroutine.c
+++ b/tests/test-coroutine.c
@@ -337,7 +337,7 @@ static void perf_cost(void)
                    "%luns per coroutine",
                    maxcycles,
                    duration, ops,
-                   (unsigned long)(1000000000 * duration) / maxcycles);
+                   (unsigned long)(1000000000.0 * duration / maxcycles));
 }
 
 int main(int argc, char **argv)
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH 4/7] QSLIST: add lock-free operations
  2014-11-28 14:12 [Qemu-devel] [PATCH 0/7] coroutine: optimizations Paolo Bonzini
                   ` (2 preceding siblings ...)
  2014-11-28 14:12 ` [Qemu-devel] [PATCH 3/7] test-coroutine: avoid overflow on 32-bit systems Paolo Bonzini
@ 2014-11-28 14:12 ` Paolo Bonzini
  2014-11-28 14:12 ` [Qemu-devel] [PATCH 5/7] coroutine: rewrite pool to avoid mutex Paolo Bonzini
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: Paolo Bonzini @ 2014-11-28 14:12 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, ming.lei, pl, stefanha

These operations are trivial to implement and do not have ABA problems.
They are enough to implement simple multiple-producer, single consumer
lock-free lists or, as in the next patch, the multiple consumers can
steal a whole batch of elements and process them at their leisure.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 include/qemu/queue.h | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/include/qemu/queue.h b/include/qemu/queue.h
index d433b90..6a01e2f 100644
--- a/include/qemu/queue.h
+++ b/include/qemu/queue.h
@@ -191,8 +191,19 @@ struct {                                                                \
 } while (/*CONSTCOND*/0)
 
 #define QSLIST_INSERT_HEAD(head, elm, field) do {                        \
-        (elm)->field.sle_next = (head)->slh_first;                      \
-        (head)->slh_first = (elm);                                      \
+        (elm)->field.sle_next = (head)->slh_first;                       \
+        (head)->slh_first = (elm);                                       \
+} while (/*CONSTCOND*/0)
+
+#define QSLIST_INSERT_HEAD_ATOMIC(head, elm, field) do {                   \
+        do {                                                               \
+            (elm)->field.sle_next = (head)->slh_first;                     \
+        } while (atomic_cmpxchg(&(head)->slh_first, (elm)->field.sle_next, \
+                               (elm)) != (elm)->field.sle_next);           \
+} while (/*CONSTCOND*/0)
+
+#define QSLIST_MOVE_ATOMIC(dest, src) do {                               \
+        (dest)->slh_first = atomic_xchg(&(src)->slh_first, NULL);        \
 } while (/*CONSTCOND*/0)
 
 #define QSLIST_REMOVE_HEAD(head, field) do {                             \
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH 5/7] coroutine: rewrite pool to avoid mutex
  2014-11-28 14:12 [Qemu-devel] [PATCH 0/7] coroutine: optimizations Paolo Bonzini
                   ` (3 preceding siblings ...)
  2014-11-28 14:12 ` [Qemu-devel] [PATCH 4/7] QSLIST: add lock-free operations Paolo Bonzini
@ 2014-11-28 14:12 ` Paolo Bonzini
  2014-11-28 16:40   ` Kevin Wolf
  2014-11-28 14:12 ` [Qemu-devel] [PATCH 6/7] coroutine: drop qemu_coroutine_adjust_pool_size Paolo Bonzini
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Paolo Bonzini @ 2014-11-28 14:12 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, ming.lei, pl, stefanha

This patch removes the mutex by using fancy lock-free manipulation of
the pool.  Lock-free stacks and queues are not hard, but they can suffer
from the ABA problem so they are better avoided unless you have some
deferred reclamation scheme like RCU.  Otherwise you have to stick
with adding to a list, and emptying it completely.  This is what this
patch does, by coupling a lock-free global list of available coroutines
with per-CPU lists that are actually used on coroutine creation.

Whenever the destruction pool is big enough, the next thread that runs
out of coroutines will steal the whole destruction pool.  This is positive
in two ways:

1) the allocation does not have to do any atomic operation in the fast
path, it's entirely using thread-local storage.  Once every POOL_BATCH_SIZE
allocations it will do a single atomic_xchg.  Release does an atomic_cmpxchg
loop, that hopefully doesn't cause any starvation, and an atomic_inc.

2) in theory this should be completely adaptive.  The number of coroutines
around should be a little more than POOL_BATCH_SIZE * number of allocating
threads; so this also removes qemu_coroutine_adjust_pool_size.  (The previous
pool size was POOL_BATCH_SIZE * number of block backends, so it was a bit
more generous.  But you actually have many high-iodepth disks, it's better
to put them in different iothreads, which will also use separate thread
pools and aio file descriptors).

This speeds up perf/cost (in tests/test-coroutine) by a factor of ~1.33.

I still believe we will end with some kind of coroutine bypass scheme
(even coroutines _do_ allocate an AIOCB, so calling bdrv_aio_readv
directly can help), but hey it cannot hurt to optimize hot code.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 qemu-coroutine.c | 93 +++++++++++++++++++++++++-------------------------------
 1 file changed, 42 insertions(+), 51 deletions(-)

diff --git a/qemu-coroutine.c b/qemu-coroutine.c
index bd574aa..aee1017 100644
--- a/qemu-coroutine.c
+++ b/qemu-coroutine.c
@@ -15,31 +15,57 @@
 #include "trace.h"
 #include "qemu-common.h"
 #include "qemu/thread.h"
+#include "qemu/atomic.h"
 #include "block/coroutine.h"
 #include "block/coroutine_int.h"
 
 enum {
-    POOL_DEFAULT_SIZE = 64,
+    POOL_BATCH_SIZE = 64,
 };
 
 /** Free list to speed up creation */
-static QemuMutex pool_lock;
-static QSLIST_HEAD(, Coroutine) pool = QSLIST_HEAD_INITIALIZER(pool);
-static unsigned int pool_size;
-static unsigned int pool_max_size = POOL_DEFAULT_SIZE;
+static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool);
+static unsigned int release_pool_size;
+static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool);
+static __thread Notifier coroutine_pool_cleanup_notifier;
+
+static void coroutine_pool_cleanup(Notifier *n, void *value)
+{
+    Coroutine *co;
+    Coroutine *tmp;
+
+    QSLIST_FOREACH_SAFE(co, &alloc_pool, pool_next, tmp) {
+        QSLIST_REMOVE_HEAD(&alloc_pool, pool_next);
+        qemu_coroutine_delete(co);
+    }
+}
 
 Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
 {
     Coroutine *co = NULL;
 
     if (CONFIG_COROUTINE_POOL) {
-        qemu_mutex_lock(&pool_lock);
-        co = QSLIST_FIRST(&pool);
+        co = QSLIST_FIRST(&alloc_pool);
+        if (!co) {
+            if (release_pool_size > POOL_BATCH_SIZE) {
+                /* Slow path; a good place to register the destructor, too.  */
+                if (!coroutine_pool_cleanup_notifier.notify) {
+                    coroutine_pool_cleanup_notifier.notify = coroutine_pool_cleanup;
+                    qemu_thread_atexit_add(&coroutine_pool_cleanup_notifier);
+                }
+
+                /* This is not exact; there could be a little skew between
+                 * release_pool_size and the actual size of release_pool.  But
+                 * it is just a heuristic, it does not need to be perfect.
+                 */
+                release_pool_size = 0;
+                QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool);
+                co = QSLIST_FIRST(&alloc_pool);
+            }
+        }
         if (co) {
-            QSLIST_REMOVE_HEAD(&pool, pool_next);
-            pool_size--;
+            QSLIST_REMOVE_HEAD(&alloc_pool, pool_next);
         }
-        qemu_mutex_unlock(&pool_lock);
     }
 
     if (!co) {
@@ -53,39 +80,19 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
 
 static void coroutine_delete(Coroutine *co)
 {
+    co->caller = NULL;
+
     if (CONFIG_COROUTINE_POOL) {
-        qemu_mutex_lock(&pool_lock);
-        if (pool_size < pool_max_size) {
-            QSLIST_INSERT_HEAD(&pool, co, pool_next);
-            co->caller = NULL;
-            pool_size++;
-            qemu_mutex_unlock(&pool_lock);
+        if (release_pool_size < POOL_BATCH_SIZE * 2) {
+            QSLIST_INSERT_HEAD_ATOMIC(&release_pool, co, pool_next);
+            atomic_inc(&release_pool_size);
             return;
         }
-        qemu_mutex_unlock(&pool_lock);
     }
 
     qemu_coroutine_delete(co);
 }
 
-static void __attribute__((constructor)) coroutine_pool_init(void)
-{
-    qemu_mutex_init(&pool_lock);
-}
-
-static void __attribute__((destructor)) coroutine_pool_cleanup(void)
-{
-    Coroutine *co;
-    Coroutine *tmp;
-
-    QSLIST_FOREACH_SAFE(co, &pool, pool_next, tmp) {
-        QSLIST_REMOVE_HEAD(&pool, pool_next);
-        qemu_coroutine_delete(co);
-    }
-
-    qemu_mutex_destroy(&pool_lock);
-}
-
 static void coroutine_swap(Coroutine *from, Coroutine *to)
 {
     CoroutineAction ret;
@@ -140,20 +147,4 @@ void coroutine_fn qemu_coroutine_yield(void)
 
 void qemu_coroutine_adjust_pool_size(int n)
 {
-    qemu_mutex_lock(&pool_lock);
-
-    pool_max_size += n;
-
-    /* Callers should never take away more than they added */
-    assert(pool_max_size >= POOL_DEFAULT_SIZE);
-
-    /* Trim oversized pool down to new max */
-    while (pool_size > pool_max_size) {
-        Coroutine *co = QSLIST_FIRST(&pool);
-        QSLIST_REMOVE_HEAD(&pool, pool_next);
-        pool_size--;
-        qemu_coroutine_delete(co);
-    }
-
-    qemu_mutex_unlock(&pool_lock);
 }
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH 6/7] coroutine: drop qemu_coroutine_adjust_pool_size
  2014-11-28 14:12 [Qemu-devel] [PATCH 0/7] coroutine: optimizations Paolo Bonzini
                   ` (4 preceding siblings ...)
  2014-11-28 14:12 ` [Qemu-devel] [PATCH 5/7] coroutine: rewrite pool to avoid mutex Paolo Bonzini
@ 2014-11-28 14:12 ` Paolo Bonzini
  2014-11-28 14:12 ` [Qemu-devel] [PATCH 7/7] coroutine: try harder not to delete coroutines Paolo Bonzini
  2014-12-01  5:55 ` [Qemu-devel] [PATCH 0/7] coroutine: optimizations Ming Lei
  7 siblings, 0 replies; 26+ messages in thread
From: Paolo Bonzini @ 2014-11-28 14:12 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, ming.lei, pl, stefanha

This is not needed anymore.  The new TLS-based algorithm is adaptive.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block/block-backend.c     |  4 ----
 include/block/coroutine.h | 10 ----------
 qemu-coroutine.c          |  4 ----
 3 files changed, 18 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index d0692b1..abf0cd1 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -260,9 +260,6 @@ int blk_attach_dev(BlockBackend *blk, void *dev)
     blk_ref(blk);
     blk->dev = dev;
     bdrv_iostatus_reset(blk->bs);
-
-    /* We're expecting I/O from the device so bump up coroutine pool size */
-    qemu_coroutine_adjust_pool_size(COROUTINE_POOL_RESERVATION);
     return 0;
 }
 
@@ -290,7 +287,6 @@ void blk_detach_dev(BlockBackend *blk, void *dev)
     blk->dev_ops = NULL;
     blk->dev_opaque = NULL;
     bdrv_set_guest_block_size(blk->bs, 512);
-    qemu_coroutine_adjust_pool_size(-COROUTINE_POOL_RESERVATION);
     blk_unref(blk);
 }
 
diff --git a/include/block/coroutine.h b/include/block/coroutine.h
index 793df0e..20c027a 100644
--- a/include/block/coroutine.h
+++ b/include/block/coroutine.h
@@ -216,14 +216,4 @@ void coroutine_fn co_aio_sleep_ns(AioContext *ctx, QEMUClockType type,
  */
 void coroutine_fn yield_until_fd_readable(int fd);
 
-/**
- * Add or subtract from the coroutine pool size
- *
- * The coroutine implementation keeps a pool of coroutines to be reused by
- * qemu_coroutine_create().  This makes coroutine creation cheap.  Heavy
- * coroutine users should call this to reserve pool space.  Call it again with
- * a negative number to release pool space.
- */
-void qemu_coroutine_adjust_pool_size(int n);
-
 #endif /* QEMU_COROUTINE_H */
diff --git a/qemu-coroutine.c b/qemu-coroutine.c
index aee1017..ca40f4f 100644
--- a/qemu-coroutine.c
+++ b/qemu-coroutine.c
@@ -144,7 +144,3 @@ void coroutine_fn qemu_coroutine_yield(void)
     self->caller = NULL;
     coroutine_swap(self, to);
 }
-
-void qemu_coroutine_adjust_pool_size(int n)
-{
-}
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH 7/7] coroutine: try harder not to delete coroutines
  2014-11-28 14:12 [Qemu-devel] [PATCH 0/7] coroutine: optimizations Paolo Bonzini
                   ` (5 preceding siblings ...)
  2014-11-28 14:12 ` [Qemu-devel] [PATCH 6/7] coroutine: drop qemu_coroutine_adjust_pool_size Paolo Bonzini
@ 2014-11-28 14:12 ` Paolo Bonzini
  2014-11-28 20:52   ` Peter Lieven
  2014-12-01  5:55 ` [Qemu-devel] [PATCH 0/7] coroutine: optimizations Ming Lei
  7 siblings, 1 reply; 26+ messages in thread
From: Paolo Bonzini @ 2014-11-28 14:12 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, ming.lei, pl, stefanha

From: Peter Lieven <pl@kamp.de>

Placing coroutines on the global pool should be preferrable, because it
can help all threads.  But if the global pool is full, we can still
try to save some allocations by stashing completed coroutines on the
local pool.  This is quite cheap too, because it does not require
atomic operations.

Signed-off-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 qemu-coroutine.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/qemu-coroutine.c b/qemu-coroutine.c
index da1b961..977f114 100644
--- a/qemu-coroutine.c
+++ b/qemu-coroutine.c
@@ -27,6 +27,7 @@ enum {
 static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool);
 static unsigned int release_pool_size;
 static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool);
+static __thread unsigned int alloc_pool_size;
 static __thread Notifier coroutine_pool_cleanup_notifier;
 
 static void coroutine_pool_cleanup(Notifier *n, void *value)
@@ -58,13 +59,14 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
                  * release_pool_size and the actual size of release_pool.  But
                  * it is just a heuristic, it does not need to be perfect.
                  */
-                release_pool_size = 0;
+                alloc_pool_size += atomic_xchg(&release_pool_size, 0);
                 QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool);
                 co = QSLIST_FIRST(&alloc_pool);
             }
         }
         if (co) {
             QSLIST_REMOVE_HEAD(&alloc_pool, pool_next);
+            alloc_pool_size--;
         }
     }
 
@@ -87,6 +89,11 @@ static void coroutine_delete(Coroutine *co)
             atomic_inc(&release_pool_size);
             return;
         }
+        if (alloc_pool_size < POOL_BATCH_SIZE) {
+            QSLIST_INSERT_HEAD(&alloc_pool, co, pool_next);
+            alloc_pool_size++;
+            return;
+        }
     }
 
     qemu_coroutine_delete(co);
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] coroutine-ucontext: use __thread
  2014-11-28 14:12 ` [Qemu-devel] [PATCH 1/7] coroutine-ucontext: use __thread Paolo Bonzini
@ 2014-11-28 14:28   ` Peter Maydell
  2014-11-28 14:45   ` Markus Armbruster
  1 sibling, 0 replies; 26+ messages in thread
From: Peter Maydell @ 2014-11-28 14:28 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, ming.lei, Peter Lieven, QEMU Developers, Stefan Hajnoczi

On 28 November 2014 at 14:12, Paolo Bonzini <pbonzini@redhat.com> wrote:
> +/* This function is marked noinline to prevent GCC from inlining it
> + * into coroutine_trampoline(). If we allow it to do that then it
> + * hoists the code to get the address of the TLS variable "current"
> + * out of the while() loop. This is an invalid transformation because
> + * the SwitchToFiber() call may be called when running thread A but
> + * return in thread B, and so we might be in a different thread
> + * context each time round the loop.
> + */
>  CoroutineAction qemu_coroutine_switch(Coroutine *from_, Coroutine *to_,
>                                        CoroutineAction action)

??? You've added the comment but the function is not marked
"noinline" at all...

-- PMM

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] coroutine-ucontext: use __thread
  2014-11-28 14:12 ` [Qemu-devel] [PATCH 1/7] coroutine-ucontext: use __thread Paolo Bonzini
  2014-11-28 14:28   ` Peter Maydell
@ 2014-11-28 14:45   ` Markus Armbruster
  2014-11-28 15:36     ` Kevin Wolf
  1 sibling, 1 reply; 26+ messages in thread
From: Markus Armbruster @ 2014-11-28 14:45 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kwolf, ming.lei, pl, qemu-devel, stefanha

Paolo Bonzini <pbonzini@redhat.com> writes:

> ELF thread local storage is about 10% faster on tests/test-coroutine's
> perf/cost test.  The timing on my machine is 160ns per iteration with
> pthread TLS, 145 with ELF TLS.
>
> Based on a patch by Kevin Wolf and Peter Lieven, but redone to follow
> the model of coroutine-win32.c (including the important "noinline"
> attribute!!!).
>
> Platforms without thread-local storage (OpenBSD probably?) will need
> a new-enough GCC for this to compile, in order to use the same emutls
> support that Windows already relies on.
[...]
> @@ -193,15 +155,22 @@ void qemu_coroutine_delete(Coroutine *co_)
>      g_free(co);
>  }
>  
> +/* This function is marked noinline to prevent GCC from inlining it
> + * into coroutine_trampoline(). If we allow it to do that then it
> + * hoists the code to get the address of the TLS variable "current"
> + * out of the while() loop. This is an invalid transformation because
> + * the SwitchToFiber() call may be called when running thread A but
> + * return in thread B, and so we might be in a different thread
> + * context each time round the loop.
> + */
>  CoroutineAction qemu_coroutine_switch(Coroutine *from_, Coroutine *to_,
>                                        CoroutineAction action)

Err, did you forget the actual __attribute__((noinline))?

>  {
>      CoroutineUContext *from = DO_UPCAST(CoroutineUContext, base, from_);
>      CoroutineUContext *to = DO_UPCAST(CoroutineUContext, base, to_);
> -    CoroutineThreadState *s = coroutine_get_thread_state();
>      int ret;
>  
> -    s->current = to_;
> +    current = to_;
>  
>      ret = sigsetjmp(from->env, 0);
>      if (ret == 0) {
[...]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] coroutine-ucontext: use __thread
  2014-11-28 14:45   ` Markus Armbruster
@ 2014-11-28 15:36     ` Kevin Wolf
  0 siblings, 0 replies; 26+ messages in thread
From: Kevin Wolf @ 2014-11-28 15:36 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: Paolo Bonzini, ming.lei, pl, qemu-devel, stefanha

Am 28.11.2014 um 15:45 hat Markus Armbruster geschrieben:
> Paolo Bonzini <pbonzini@redhat.com> writes:
> 
> > ELF thread local storage is about 10% faster on tests/test-coroutine's
> > perf/cost test.  The timing on my machine is 160ns per iteration with
> > pthread TLS, 145 with ELF TLS.
> >
> > Based on a patch by Kevin Wolf and Peter Lieven, but redone to follow
> > the model of coroutine-win32.c (including the important "noinline"
> > attribute!!!).
> >
> > Platforms without thread-local storage (OpenBSD probably?) will need
> > a new-enough GCC for this to compile, in order to use the same emutls
> > support that Windows already relies on.
> [...]
> > @@ -193,15 +155,22 @@ void qemu_coroutine_delete(Coroutine *co_)
> >      g_free(co);
> >  }
> >  
> > +/* This function is marked noinline to prevent GCC from inlining it
> > + * into coroutine_trampoline(). If we allow it to do that then it
> > + * hoists the code to get the address of the TLS variable "current"
> > + * out of the while() loop. This is an invalid transformation because
> > + * the SwitchToFiber() call may be called when running thread A but
> > + * return in thread B, and so we might be in a different thread
> > + * context each time round the loop.
> > + */
> >  CoroutineAction qemu_coroutine_switch(Coroutine *from_, Coroutine *to_,
> >                                        CoroutineAction action)
> 
> Err, did you forget the actual __attribute__((noinline))?

The comment needs updating, too. There's no SwitchToFiber() in the
ucontext implementation.

Kevin

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH 5/7] coroutine: rewrite pool to avoid mutex
  2014-11-28 14:12 ` [Qemu-devel] [PATCH 5/7] coroutine: rewrite pool to avoid mutex Paolo Bonzini
@ 2014-11-28 16:40   ` Kevin Wolf
  2014-11-28 17:30     ` Paolo Bonzini
  2014-11-28 17:31     ` Paolo Bonzini
  0 siblings, 2 replies; 26+ messages in thread
From: Kevin Wolf @ 2014-11-28 16:40 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: ming.lei, pl, qemu-devel, stefanha

Am 28.11.2014 um 15:12 hat Paolo Bonzini geschrieben:
> I still believe we will end with some kind of coroutine bypass scheme
> (even coroutines _do_ allocate an AIOCB, so calling bdrv_aio_readv
> directly can help), but hey it cannot hurt to optimize hot code.

Not sure if speculations about the future belong into commit messages,
but while it may turn out that a bypass is required in the end (I hope
it doesn't), the part about AIOCBs is wrong if you really consistently
use coroutines all the way down from the device to the block driver.

I think Peter picked up all of my patches to actually handle requests
this way (i.e. virtio-blk already creates the coroutine).

Kevin

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH 5/7] coroutine: rewrite pool to avoid mutex
  2014-11-28 16:40   ` Kevin Wolf
@ 2014-11-28 17:30     ` Paolo Bonzini
  2014-11-28 17:31     ` Paolo Bonzini
  1 sibling, 0 replies; 26+ messages in thread
From: Paolo Bonzini @ 2014-11-28 17:30 UTC (permalink / raw)
  To: qemu-devel; +Cc: ming.lei, pl, stefanha



On 28/11/2014 17:40, Kevin Wolf wrote:
>> > I still believe we will end with some kind of coroutine bypass scheme
>> > (even coroutines _do_ allocate an AIOCB, so calling bdrv_aio_readv
>> > directly can help), but hey it cannot hurt to optimize hot code.
>
> Not sure if speculations about the future belong into commit messages,
> but while it may turn out that a bypass is required in the end (I hope
> it doesn't), the part about AIOCBs is wrong if you really consistently
> use coroutines all the way down from the device to the block driver.

This is much harder from virtio-scsi than from virtio-blk, though.

Paolo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH 5/7] coroutine: rewrite pool to avoid mutex
  2014-11-28 16:40   ` Kevin Wolf
  2014-11-28 17:30     ` Paolo Bonzini
@ 2014-11-28 17:31     ` Paolo Bonzini
  2014-11-28 18:34       ` Kevin Wolf
  1 sibling, 1 reply; 26+ messages in thread
From: Paolo Bonzini @ 2014-11-28 17:31 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: ming.lei, pl, qemu-devel, stefanha



On 28/11/2014 17:40, Kevin Wolf wrote:
>> > I still believe we will end with some kind of coroutine bypass scheme
>> > (even coroutines _do_ allocate an AIOCB, so calling bdrv_aio_readv
>> > directly can help), but hey it cannot hurt to optimize hot code.
>
> Not sure if speculations about the future belong into commit messages,
> but while it may turn out that a bypass is required in the end (I hope
> it doesn't), the part about AIOCBs is wrong if you really consistently
> use coroutines all the way down from the device to the block driver.

This is much harder for virtio-scsi than for virtio-blk, though.

Paolo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH 5/7] coroutine: rewrite pool to avoid mutex
  2014-11-28 17:31     ` Paolo Bonzini
@ 2014-11-28 18:34       ` Kevin Wolf
  2014-11-28 19:57         ` Paolo Bonzini
  0 siblings, 1 reply; 26+ messages in thread
From: Kevin Wolf @ 2014-11-28 18:34 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: ming.lei, pl, qemu-devel, stefanha

Am 28.11.2014 um 18:31 hat Paolo Bonzini geschrieben:
> 
> 
> On 28/11/2014 17:40, Kevin Wolf wrote:
> >> > I still believe we will end with some kind of coroutine bypass scheme
> >> > (even coroutines _do_ allocate an AIOCB, so calling bdrv_aio_readv
> >> > directly can help), but hey it cannot hurt to optimize hot code.
> >
> > Not sure if speculations about the future belong into commit messages,
> > but while it may turn out that a bypass is required in the end (I hope
> > it doesn't), the part about AIOCBs is wrong if you really consistently
> > use coroutines all the way down from the device to the block driver.
> 
> This is much harder for virtio-scsi than for virtio-blk, though.

Why is that? At least replacing the bdrv_aio_*() call by
coroutine_create/coroutine_enter/bdrv_co_*() is a mechanical change that
shouldn't be any harder for virtio-scsi. Whether we can optimise even
more by integration the device more with coroutines might be a different
problem, but at this point you've already got rid of AIOCBs.

Kevin

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH 5/7] coroutine: rewrite pool to avoid mutex
  2014-11-28 18:34       ` Kevin Wolf
@ 2014-11-28 19:57         ` Paolo Bonzini
  0 siblings, 0 replies; 26+ messages in thread
From: Paolo Bonzini @ 2014-11-28 19:57 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: ming.lei, pl, qemu-devel, stefanha



On 28/11/2014 19:34, Kevin Wolf wrote:
>>> > > Not sure if speculations about the future belong into commit messages,
>>> > > but while it may turn out that a bypass is required in the end (I hope
>>> > > it doesn't), the part about AIOCBs is wrong if you really consistently
>>> > > use coroutines all the way down from the device to the block driver.
>> > 
>> > This is much harder for virtio-scsi than for virtio-blk, though.
> Why is that? At least replacing the bdrv_aio_*() call by
> coroutine_create/coroutine_enter/bdrv_co_*() is a mechanical change that
> shouldn't be any harder for virtio-scsi. Whether we can optimise even
> more by integration the device more with coroutines might be a different
> problem, but at this point you've already got rid of AIOCBs.

Because I/O is done by the generic SCSI code, so you'd have to modify
that and the DMA helpers.  And the generic SCSI code is itself written
asynchronously in order to support HBAs that use a bounce buffer.

Paolo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH 7/7] coroutine: try harder not to delete coroutines
  2014-11-28 14:12 ` [Qemu-devel] [PATCH 7/7] coroutine: try harder not to delete coroutines Paolo Bonzini
@ 2014-11-28 20:52   ` Peter Lieven
  2014-11-29 14:27     ` Paolo Bonzini
  2014-11-29 14:28     ` Paolo Bonzini
  0 siblings, 2 replies; 26+ messages in thread
From: Peter Lieven @ 2014-11-28 20:52 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel; +Cc: kwolf, ming.lei, stefanha

Am 28.11.2014 um 15:12 schrieb Paolo Bonzini:
> From: Peter Lieven <pl@kamp.de>
>
> Placing coroutines on the global pool should be preferrable, because it
> can help all threads.  But if the global pool is full, we can still
> try to save some allocations by stashing completed coroutines on the
> local pool.  This is quite cheap too, because it does not require
> atomic operations.

At least in test-couroutine.c this turns out to be not just a nice to have.
I have not fully understood why, but i get the following results:

master:
Run operation 40000000 iterations 13.612604 s, 2938K operations/s, 340ns per coroutine

this series up to patch 6:
Run operation 40000000 iterations 10.428382 s, 3835K operations/s, 260ns per coroutine

this series up to patch 7:
Run operation 40000000 iterations 9.112539 s, 4389K operations/s, 227ns per coroutine

So this confirms the +33% Paolo sees up to Patch 5. But I have yet fully understood the
+15% that this Patch gains.

>
> Signed-off-by: Peter Lieven <pl@kamp.de>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  qemu-coroutine.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/qemu-coroutine.c b/qemu-coroutine.c
> index da1b961..977f114 100644
> --- a/qemu-coroutine.c
> +++ b/qemu-coroutine.c
> @@ -27,6 +27,7 @@ enum {
>  static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool);
>  static unsigned int release_pool_size;
>  static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool);
> +static __thread unsigned int alloc_pool_size;
>  static __thread Notifier coroutine_pool_cleanup_notifier;
>  
>  static void coroutine_pool_cleanup(Notifier *n, void *value)
> @@ -58,13 +59,14 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
>                   * release_pool_size and the actual size of release_pool.  But
>                   * it is just a heuristic, it does not need to be perfect.
>                   */
> -                release_pool_size = 0;
> +                alloc_pool_size += atomic_xchg(&release_pool_size, 0);

I had alloc_pool_size = in my original Patch.
It shouldn't make a difference, since alloc_pool_size should be 0
when we reach this code piece. But if for some reason release_pool_size
is inaccurate we add this error to alloc_pool_size again and again
and eventually end up not adding coroutines to the thread local pool below
altough it might be empty in the worst case.

Peter

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH 7/7] coroutine: try harder not to delete coroutines
  2014-11-28 20:52   ` Peter Lieven
@ 2014-11-29 14:27     ` Paolo Bonzini
  2014-11-29 21:28       ` Peter Lieven
  2014-11-29 14:28     ` Paolo Bonzini
  1 sibling, 1 reply; 26+ messages in thread
From: Paolo Bonzini @ 2014-11-29 14:27 UTC (permalink / raw)
  To: Peter Lieven, qemu-devel; +Cc: kwolf, ming.lei, stefanha



On 28/11/2014 21:52, Peter Lieven wrote:
> 
> master:
> Run operation 40000000 iterations 13.612604 s, 2938K operations/s, 340ns per coroutine
> 
> this series up to patch 6:
> Run operation 40000000 iterations 10.428382 s, 3835K operations/s, 260ns per coroutine
> 
> this series up to patch 7:
> Run operation 40000000 iterations 9.112539 s, 4389K operations/s, 227ns per coroutine
> 
> So this confirms the +33% Paolo sees up to Patch 5. But I have yet fully understood the
> +15% that this Patch gains.

No atomic operations once the release pool gets full.  We're talking of
800 clock cycles here, and one atomic operation costs 50 cycles.  100
clock cycles out of 800 = 15% speedup (8/7 = 1.14).

Paolo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH 7/7] coroutine: try harder not to delete coroutines
  2014-11-28 20:52   ` Peter Lieven
  2014-11-29 14:27     ` Paolo Bonzini
@ 2014-11-29 14:28     ` Paolo Bonzini
  1 sibling, 0 replies; 26+ messages in thread
From: Paolo Bonzini @ 2014-11-29 14:28 UTC (permalink / raw)
  To: Peter Lieven, qemu-devel; +Cc: kwolf, ming.lei, stefanha



On 28/11/2014 21:52, Peter Lieven wrote:
>> > +                alloc_pool_size += atomic_xchg(&release_pool_size, 0);
> I had alloc_pool_size = in my original Patch.
> It shouldn't make a difference, since alloc_pool_size should be 0
> when we reach this code piece. But if for some reason release_pool_size
> is inaccurate we add this error to alloc_pool_size again and again
> and eventually end up not adding coroutines to the thread local pool below
> altough it might be empty in the worst case.

Oops, this must come from a rebase.  Thanks for pointing it out.

Paolo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH 7/7] coroutine: try harder not to delete coroutines
  2014-11-29 14:27     ` Paolo Bonzini
@ 2014-11-29 21:28       ` Peter Lieven
  0 siblings, 0 replies; 26+ messages in thread
From: Peter Lieven @ 2014-11-29 21:28 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel; +Cc: kwolf, ming.lei, stefanha

Am 29.11.2014 um 15:27 schrieb Paolo Bonzini:
>
> On 28/11/2014 21:52, Peter Lieven wrote:
>> master:
>> Run operation 40000000 iterations 13.612604 s, 2938K operations/s, 340ns per coroutine
>>
>> this series up to patch 6:
>> Run operation 40000000 iterations 10.428382 s, 3835K operations/s, 260ns per coroutine
>>
>> this series up to patch 7:
>> Run operation 40000000 iterations 9.112539 s, 4389K operations/s, 227ns per coroutine
>>
>> So this confirms the +33% Paolo sees up to Patch 5. But I have yet fully understood the
>> +15% that this Patch gains.
> No atomic operations once the release pool gets full.  We're talking of
> 800 clock cycles here, and one atomic operation costs 50 cycles.  100
> clock cycles out of 800 = 15% speedup (8/7 = 1.14).

Maybe its worth mentioning this (partly) in the commit message that this can give a gain
of additional 15% best case. This gives a +50% for the whole series best case.

Peter

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH 3/7] test-coroutine: avoid overflow on 32-bit systems
  2014-11-28 14:12 ` [Qemu-devel] [PATCH 3/7] test-coroutine: avoid overflow on 32-bit systems Paolo Bonzini
@ 2014-12-01  1:28   ` Ming Lei
  2014-12-01 12:41     ` Paolo Bonzini
  0 siblings, 1 reply; 26+ messages in thread
From: Ming Lei @ 2014-12-01  1:28 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Kevin Wolf, pl, qemu-devel, Stefan Hajnoczi

On Fri, Nov 28, 2014 at 10:12 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> unsigned long is not large enough to represent 1000000000 * duration there.
> Just use floating point.
>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  tests/test-coroutine.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/tests/test-coroutine.c b/tests/test-coroutine.c
> index e22fae1..27d1b6f 100644
> --- a/tests/test-coroutine.c
> +++ b/tests/test-coroutine.c
> @@ -337,7 +337,7 @@ static void perf_cost(void)
>                     "%luns per coroutine",
>                     maxcycles,
>                     duration, ops,
> -                   (unsigned long)(1000000000 * duration) / maxcycles);
> +                   (unsigned long)(1000000000.0 * duration / maxcycles));

One more single bracket.

thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] coroutine: optimizations
  2014-11-28 14:12 [Qemu-devel] [PATCH 0/7] coroutine: optimizations Paolo Bonzini
                   ` (6 preceding siblings ...)
  2014-11-28 14:12 ` [Qemu-devel] [PATCH 7/7] coroutine: try harder not to delete coroutines Paolo Bonzini
@ 2014-12-01  5:55 ` Ming Lei
  2014-12-01  7:05   ` Peter Lieven
  7 siblings, 1 reply; 26+ messages in thread
From: Ming Lei @ 2014-12-01  5:55 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Kevin Wolf, pl, qemu-devel, Stefan Hajnoczi

On Fri, Nov 28, 2014 at 10:12 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> As discussed in the other thread, this brings speedups from
> dropping the coroutine mutex (which serializes multiple iothreads,
> too) and using ELF thread-local storage.
>
> The speedup in perf/cost is about 30% (190->145).  Windows port tested
> with tests/test-coroutine.exe under Wine.

The data is very nice, and in my laptop, 'perf cost' can be decreased
from 244ns to 174ns.

BTW, the cost by using coroutine to run function isn't only from these
helpers(*_yield, *_enter, *_create, and perf-cost just measures
this part of cost), but also some implicit/invisible part. I have some
test cases which can show the problem. If someone is interested,
I can post them in list.


Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] coroutine: optimizations
  2014-12-01  5:55 ` [Qemu-devel] [PATCH 0/7] coroutine: optimizations Ming Lei
@ 2014-12-01  7:05   ` Peter Lieven
  2014-12-01  7:46     ` Ming Lei
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Lieven @ 2014-12-01  7:05 UTC (permalink / raw)
  To: Ming Lei, Paolo Bonzini; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

On 01.12.2014 06:55, Ming Lei wrote:
> On Fri, Nov 28, 2014 at 10:12 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> As discussed in the other thread, this brings speedups from
>> dropping the coroutine mutex (which serializes multiple iothreads,
>> too) and using ELF thread-local storage.
>>
>> The speedup in perf/cost is about 30% (190->145).  Windows port tested
>> with tests/test-coroutine.exe under Wine.
> The data is very nice, and in my laptop, 'perf cost' can be decreased
> from 244ns to 174ns.
>
> BTW, the cost by using coroutine to run function isn't only from these
> helpers(*_yield, *_enter, *_create, and perf-cost just measures
> this part of cost), but also some implicit/invisible part. I have some
> test cases which can show the problem. If someone is interested,
> I can post them in list.

Of course, maybe the problem can be solved or impaired.

Peter

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] coroutine: optimizations
  2014-12-01  7:05   ` Peter Lieven
@ 2014-12-01  7:46     ` Ming Lei
  0 siblings, 0 replies; 26+ messages in thread
From: Ming Lei @ 2014-12-01  7:46 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Kevin Wolf, Paolo Bonzini, qemu-devel, Stefan Hajnoczi

On Mon, 01 Dec 2014 08:05:17 +0100
Peter Lieven <pl@kamp.de> wrote:

> On 01.12.2014 06:55, Ming Lei wrote:
> > On Fri, Nov 28, 2014 at 10:12 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >> As discussed in the other thread, this brings speedups from
> >> dropping the coroutine mutex (which serializes multiple iothreads,
> >> too) and using ELF thread-local storage.
> >>
> >> The speedup in perf/cost is about 30% (190->145).  Windows port tested
> >> with tests/test-coroutine.exe under Wine.
> > The data is very nice, and in my laptop, 'perf cost' can be decreased
> > from 244ns to 174ns.
> >
> > BTW, the cost by using coroutine to run function isn't only from these
> > helpers(*_yield, *_enter, *_create, and perf-cost just measures
> > this part of cost), but also some implicit/invisible part. I have some
> > test cases which can show the problem. If someone is interested,
> > I can post them in list.
> 
> Of course, maybe the problem can be solved or impaired.

OK, please try below patch:

From 917d5cc0a273f9825b10abd52152c54e08c81ef8 Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@canonical.com>
Date: Mon, 1 Dec 2014 11:11:23 +0800
Subject: [PATCH] test-coroutine: introduce perf-cost-with-load

The perf/cost test case only covers explicit cost by
using coroutine.

This patch provides a open/close file test case, and
from this case, we can find there is also some implicit
or invisible cost except for the cost measured by /perf/cost.

In my environment, follows the test result after appying this
patch and running perf/cost and perf/cost-with-load:

	{*LOG(start):{/perf/cost}:LOG*}
	/perf/cost: {*LOG(message):{Run operation 40000000 iterations 7.539413
	s, 5305K operations/s, 188ns per coroutine}:LOG*}
	OK
	{*LOG(stop):(0;0;7.539497):LOG*}

	{*LOG(start):{/perf/cost-with-load}:LOG*}
	/perf/cost-with-load: {*LOG(message):{Run operation 1000000 iterations
	2.648014 s, 377K operations/s, 2648ns per operation without using
	coroutine}:LOG*}
	{*LOG(message):{Run operation 1000000 iterations 2.919133 s, 342K
	operations/s, 2919ns per operation, 271ns(cost introduced by coroutine)
	per operation with using coroutine}:LOG*}
	OK
	{*LOG(stop):(0;0;5.567333):LOG*}

From above data, we can see 188ns is introduced for running one
coroutine, but in /perf/cost-with-load, the actual cost introduced
is 271ns, and the extra 83ns cost is invisible and implicit.

The similar result can be found in following test case too:
	- read from /dev/nullb0 which is opened with O_DIRECT
	(it is sort of aio read simulation, need 3.13+ kernel for
    /dev/nullbX support by 'modprobe null_blk', this case
	can show +150ns extra cost)
	- statvfs() syscall, there is ~30ns extra cost for running
	one statvfs() with coroutine
---
 tests/test-coroutine.c |   67 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)

diff --git a/tests/test-coroutine.c b/tests/test-coroutine.c
index 27d1b6f..7323a91 100644
--- a/tests/test-coroutine.c
+++ b/tests/test-coroutine.c
@@ -311,6 +311,72 @@ static void perf_baseline(void)
         maxcycles, duration);
 }
 
+static void perf_cost_load_worker(void *opaque)
+{
+    int fd;
+
+    fd = open("/proc/self/exe", O_RDONLY);
+    assert(fd >= 0);
+    close(fd);
+}
+
+static __attribute__((noinline)) void perf_cost_load_func(void *opaque)
+{
+    perf_cost_load_worker(opaque);
+    qemu_coroutine_yield();
+}
+
+static double perf_cost_load(unsigned long maxcycles, bool use_co)
+{
+    unsigned long i = 0;
+    double duration;
+
+    g_test_timer_start();
+    if (use_co) {
+        Coroutine *co;
+        while (i++ < maxcycles) {
+            co = qemu_coroutine_create(perf_cost_load_func);
+            qemu_coroutine_enter(co, &i);
+            qemu_coroutine_enter(co, NULL);
+        }
+    } else {
+        while (i++ < maxcycles) {
+            perf_cost_load_worker(&i);
+        }
+    }
+    duration = g_test_timer_elapsed();
+
+    return duration;
+}
+
+static void perf_cost_with_load(void)
+{
+    const unsigned long maxcycles = 1000000;
+    double duration;
+    unsigned long ops;
+    unsigned long cost_co, cost;
+
+    duration = perf_cost_load(maxcycles, false);
+    ops = (long)(maxcycles / (duration * 1000));
+    cost = (unsigned long)(1000000000.0 * duration / maxcycles);
+    g_test_message("Run operation %lu iterations %f s, %luK operations/s, "
+                   "%luns per operation without using coroutine",
+                   maxcycles,
+                   duration, ops,
+                   cost);
+
+    duration = perf_cost_load(maxcycles, true);
+    ops = (long)(maxcycles / (duration * 1000));
+    cost_co = (unsigned long)(1000000000.0 * duration / maxcycles);
+    g_test_message("Run operation %lu iterations %f s, %luK operations/s, "
+                   "%luns per operation, "
+                   "%luns(cost introduced by coroutine) per operation "
+                   "with using coroutine",
+                   maxcycles,
+                   duration, ops,
+                   cost_co, cost_co - cost);
+}
+
 static __attribute__((noinline)) void perf_cost_func(void *opaque)
 {
     qemu_coroutine_yield();
@@ -355,6 +421,7 @@ int main(int argc, char **argv)
         g_test_add_func("/perf/yield", perf_yield);
         g_test_add_func("/perf/function-call", perf_baseline);
         g_test_add_func("/perf/cost", perf_cost);
+        g_test_add_func("/perf/cost-with-load", perf_cost_with_load);
     }
     return g_test_run();
 }
-- 
1.7.9.5


Thanks,
-- 
Ming Lei

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH 3/7] test-coroutine: avoid overflow on 32-bit systems
  2014-12-01  1:28   ` Ming Lei
@ 2014-12-01 12:41     ` Paolo Bonzini
  2014-12-02  1:20       ` Ming Lei
  0 siblings, 1 reply; 26+ messages in thread
From: Paolo Bonzini @ 2014-12-01 12:41 UTC (permalink / raw)
  To: Ming Lei; +Cc: Kevin Wolf, pl, qemu-devel, Stefan Hajnoczi



On 01/12/2014 02:28, Ming Lei wrote:
>> > -                   (unsigned long)(1000000000 * duration) / maxcycles);
>> > +                   (unsigned long)(1000000000.0 * duration / maxcycles));
> One more single bracket.

I don't understand?

Paolo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH 3/7] test-coroutine: avoid overflow on 32-bit systems
  2014-12-01 12:41     ` Paolo Bonzini
@ 2014-12-02  1:20       ` Ming Lei
  0 siblings, 0 replies; 26+ messages in thread
From: Ming Lei @ 2014-12-02  1:20 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Kevin Wolf, pl, qemu-devel, Stefan Hajnoczi

On Mon, Dec 1, 2014 at 8:41 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 01/12/2014 02:28, Ming Lei wrote:
>>> > -                   (unsigned long)(1000000000 * duration) / maxcycles);
>>> > +                   (unsigned long)(1000000000.0 * duration / maxcycles));
>> One more single bracket.
>
> I don't understand?

Sorry, it is my fault, :-(

Thanks

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2014-12-02  1:20 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-28 14:12 [Qemu-devel] [PATCH 0/7] coroutine: optimizations Paolo Bonzini
2014-11-28 14:12 ` [Qemu-devel] [PATCH 1/7] coroutine-ucontext: use __thread Paolo Bonzini
2014-11-28 14:28   ` Peter Maydell
2014-11-28 14:45   ` Markus Armbruster
2014-11-28 15:36     ` Kevin Wolf
2014-11-28 14:12 ` [Qemu-devel] [PATCH 2/7] qemu-thread: add per-thread atexit functions Paolo Bonzini
2014-11-28 14:12 ` [Qemu-devel] [PATCH 3/7] test-coroutine: avoid overflow on 32-bit systems Paolo Bonzini
2014-12-01  1:28   ` Ming Lei
2014-12-01 12:41     ` Paolo Bonzini
2014-12-02  1:20       ` Ming Lei
2014-11-28 14:12 ` [Qemu-devel] [PATCH 4/7] QSLIST: add lock-free operations Paolo Bonzini
2014-11-28 14:12 ` [Qemu-devel] [PATCH 5/7] coroutine: rewrite pool to avoid mutex Paolo Bonzini
2014-11-28 16:40   ` Kevin Wolf
2014-11-28 17:30     ` Paolo Bonzini
2014-11-28 17:31     ` Paolo Bonzini
2014-11-28 18:34       ` Kevin Wolf
2014-11-28 19:57         ` Paolo Bonzini
2014-11-28 14:12 ` [Qemu-devel] [PATCH 6/7] coroutine: drop qemu_coroutine_adjust_pool_size Paolo Bonzini
2014-11-28 14:12 ` [Qemu-devel] [PATCH 7/7] coroutine: try harder not to delete coroutines Paolo Bonzini
2014-11-28 20:52   ` Peter Lieven
2014-11-29 14:27     ` Paolo Bonzini
2014-11-29 21:28       ` Peter Lieven
2014-11-29 14:28     ` Paolo Bonzini
2014-12-01  5:55 ` [Qemu-devel] [PATCH 0/7] coroutine: optimizations Ming Lei
2014-12-01  7:05   ` Peter Lieven
2014-12-01  7:46     ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.