All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] thread-pool: fix performance regression
@ 2022-05-06 11:47 Paolo Bonzini
  2022-05-06 11:47 ` [PATCH 1/2] thread-pool: optimize scheduling of completion bottom half Paolo Bonzini
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Paolo Bonzini @ 2022-05-06 11:47 UTC (permalink / raw)
  To: qemu-devel; +Cc: stefanha, ldoktor

Together, these two patches fix the performance regression induced by
QemuSemaphore; individually they don't though.

6.2:
   iops        : min=58051, max=62260, avg=60282.57, stdev=1081.18, samples=30
    clat percentiles (usec):   1.00th=[  490],   99.99th=[  775]
   iops        : min=59401, max=61290, avg=60651.27, stdev=468.24, samples=30
    clat percentiles (usec):   1.00th=[  490],   99.99th=[  717]
   iops        : min=59583, max=60816, avg=60353.43, stdev=282.69, samples=30
    clat percentiles (usec):   1.00th=[  490],   99.99th=[  701]
   iops        : min=58099, max=60713, avg=59739.53, stdev=755.49, samples=30
    clat percentiles (usec):   1.00th=[  494],   99.99th=[  717]


patched:
   iops        : min=60616, max=62522, avg=61654.37, stdev=555.67, samples=30
    clat percentiles (usec):   1.00th=[  474],   99.99th=[ 1303]
   iops        : min=61841, max=63600, avg=62878.47, stdev=442.40, samples=30
    clat percentiles (usec):   1.00th=[  465],   99.99th=[  685]
   iops        : min=62976, max=63910, avg=63531.60, stdev=261.05, samples=30
    clat percentiles (usec):   1.00th=[  461],   99.99th=[  693]
   iops        : min=60803, max=63623, avg=62653.37, stdev=808.76, samples=30
    clat percentiles (usec):   1.00th=[  465],   99.99th=[  685]

Paolo

Supersedes: <20220505131346.823941-1-pbonzini@redhat.com>

Paolo Bonzini (2):
  thread-pool: optimize scheduling of completion bottom half
  thread-pool: replace semaphore with condition variable

 util/thread-pool.c | 34 +++++++++++++---------------------
 1 file changed, 13 insertions(+), 21 deletions(-)

-- 
2.31.1



^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/2] thread-pool: optimize scheduling of completion bottom half
  2022-05-06 11:47 [PATCH 0/2] thread-pool: fix performance regression Paolo Bonzini
@ 2022-05-06 11:47 ` Paolo Bonzini
  2022-05-06 11:47 ` [PATCH 2/2] thread-pool: replace semaphore with condition variable Paolo Bonzini
  2022-05-06 18:55 ` [PATCH 0/2] thread-pool: fix performance regression Lukáš Doktor
  2 siblings, 0 replies; 6+ messages in thread
From: Paolo Bonzini @ 2022-05-06 11:47 UTC (permalink / raw)
  To: qemu-devel; +Cc: stefanha, ldoktor

The completion bottom half was scheduled within the pool->lock
critical section.  That actually results in worse performance,
because the worker thread can run its own small critical section
and go to sleep before the bottom half starts running.

Note that this simple change does not produce an improvement without
changing the thread pool QemuSemaphore to a condition variable.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 util/thread-pool.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/util/thread-pool.c b/util/thread-pool.c
index d763cea505..7e9e2c178b 100644
--- a/util/thread-pool.c
+++ b/util/thread-pool.c
@@ -108,9 +108,8 @@ static void *worker_thread(void *opaque)
         smp_wmb();
         req->state = THREAD_DONE;
 
-        qemu_mutex_lock(&pool->lock);
-
         qemu_bh_schedule(pool->completion_bh);
+        qemu_mutex_lock(&pool->lock);
     }
 
     pool->cur_threads--;
-- 
2.31.1




^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/2] thread-pool: replace semaphore with condition variable
  2022-05-06 11:47 [PATCH 0/2] thread-pool: fix performance regression Paolo Bonzini
  2022-05-06 11:47 ` [PATCH 1/2] thread-pool: optimize scheduling of completion bottom half Paolo Bonzini
@ 2022-05-06 11:47 ` Paolo Bonzini
  2022-05-06 18:55 ` [PATCH 0/2] thread-pool: fix performance regression Lukáš Doktor
  2 siblings, 0 replies; 6+ messages in thread
From: Paolo Bonzini @ 2022-05-06 11:47 UTC (permalink / raw)
  To: qemu-devel; +Cc: stefanha, ldoktor

Since commit f9fc8932b1 ("thread-posix: remove the posix semaphore
support", 2022-04-06) QemuSemaphore has its own mutex and condition
variable; this adds unnecessary overhead on I/O with small block sizes.

Check the QTAILQ directly instead of adding the indirection of a
semaphore's count.  Using a semaphore has not been necessary since
qemu_cond_timedwait was introduced; the new code has to be careful about
spurious wakeups but it is simpler, for example thread_pool_cancel does
not have to worry about synchronizing the semaphore count with the number
of elements of pool->request_list.

Note that the return value of qemu_cond_timedwait (0 for timeout, 1 for
signal or spurious wakeup) is different from that of qemu_sem_timedwait
(-1 for timeout, 0 for success).

Reported-by: Lukáš Doktor <ldoktor@redhat.com>
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 util/thread-pool.c | 31 ++++++++++++-------------------
 1 file changed, 12 insertions(+), 19 deletions(-)

diff --git a/util/thread-pool.c b/util/thread-pool.c
index 7e9e2c178b..5511706de8 100644
--- a/util/thread-pool.c
+++ b/util/thread-pool.c
@@ -57,7 +57,7 @@ struct ThreadPool {
     QEMUBH *completion_bh;
     QemuMutex lock;
     QemuCond worker_stopped;
-    QemuSemaphore sem;
+    QemuCond request_cond;
     int max_threads;
     QEMUBH *new_thread_bh;
 
@@ -85,15 +85,14 @@ static void *worker_thread(void *opaque)
         ThreadPoolElement *req;
         int ret;
 
-        do {
+        if (QTAILQ_EMPTY(&pool->request_list)) {
             pool->idle_threads++;
-            qemu_mutex_unlock(&pool->lock);
-            ret = qemu_sem_timedwait(&pool->sem, 10000);
-            qemu_mutex_lock(&pool->lock);
+            ret = qemu_cond_timedwait(&pool->request_cond, &pool->lock, 10000);
             pool->idle_threads--;
-        } while (ret == -1 && !QTAILQ_EMPTY(&pool->request_list));
-        if (ret == -1 || pool->stopping) {
-            break;
+            if (!ret && QTAILQ_EMPTY(&pool->request_list)) {
+                break;
+            }
+            continue;
         }
 
         req = QTAILQ_FIRST(&pool->request_list);
@@ -210,13 +209,7 @@ static void thread_pool_cancel(BlockAIOCB *acb)
     trace_thread_pool_cancel(elem, elem->common.opaque);
 
     QEMU_LOCK_GUARD(&pool->lock);
-    if (elem->state == THREAD_QUEUED &&
-        /* No thread has yet started working on elem. we can try to "steal"
-         * the item from the worker if we can get a signal from the
-         * semaphore.  Because this is non-blocking, we can do it with
-         * the lock taken and ensure that elem will remain THREAD_QUEUED.
-         */
-        qemu_sem_timedwait(&pool->sem, 0) == 0) {
+    if (elem->state == THREAD_QUEUED) {
         QTAILQ_REMOVE(&pool->request_list, elem, reqs);
         qemu_bh_schedule(pool->completion_bh);
 
@@ -261,7 +254,7 @@ BlockAIOCB *thread_pool_submit_aio(ThreadPool *pool,
     }
     QTAILQ_INSERT_TAIL(&pool->request_list, req, reqs);
     qemu_mutex_unlock(&pool->lock);
-    qemu_sem_post(&pool->sem);
+    qemu_cond_signal(&pool->request_cond);
     return &req->common;
 }
 
@@ -304,7 +297,7 @@ static void thread_pool_init_one(ThreadPool *pool, AioContext *ctx)
     pool->completion_bh = aio_bh_new(ctx, thread_pool_completion_bh, pool);
     qemu_mutex_init(&pool->lock);
     qemu_cond_init(&pool->worker_stopped);
-    qemu_sem_init(&pool->sem, 0);
+    qemu_cond_init(&pool->request_cond);
     pool->max_threads = 64;
     pool->new_thread_bh = aio_bh_new(ctx, spawn_thread_bh_fn, pool);
 
@@ -336,15 +329,15 @@ void thread_pool_free(ThreadPool *pool)
 
     /* Wait for worker threads to terminate */
     pool->stopping = true;
+    qemu_cond_broadcast(&pool->request_cond);
     while (pool->cur_threads > 0) {
-        qemu_sem_post(&pool->sem);
         qemu_cond_wait(&pool->worker_stopped, &pool->lock);
     }
 
     qemu_mutex_unlock(&pool->lock);
 
     qemu_bh_delete(pool->completion_bh);
-    qemu_sem_destroy(&pool->sem);
+    qemu_cond_destroy(&pool->request_cond);
     qemu_cond_destroy(&pool->worker_stopped);
     qemu_mutex_destroy(&pool->lock);
     g_free(pool);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/2] thread-pool: fix performance regression
  2022-05-06 11:47 [PATCH 0/2] thread-pool: fix performance regression Paolo Bonzini
  2022-05-06 11:47 ` [PATCH 1/2] thread-pool: optimize scheduling of completion bottom half Paolo Bonzini
  2022-05-06 11:47 ` [PATCH 2/2] thread-pool: replace semaphore with condition variable Paolo Bonzini
@ 2022-05-06 18:55 ` Lukáš Doktor
  2022-05-09  6:42   ` Lukáš Doktor
  2 siblings, 1 reply; 6+ messages in thread
From: Lukáš Doktor @ 2022-05-06 18:55 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel; +Cc: stefanha


[-- Attachment #1.1.1: Type: text/plain, Size: 2307 bytes --]

Hello Paolo, folks, I gave it a try (on top of the f9fc8932) and it's better than the f9fc8932, better than the previous patch by Stefan, but still I'm not reaching the performance of d7482ffe97 (before the f9fc8932 commit):

f9f    |  0.0 | -2.8 |  0.6
stefan | -3.1 | -1.2 | -2.2
paolo  |  5.3 |  5.4 |  7.1
d74    |  7.2 |  9.1 |  8.2

Anyway it's definitely closer to the previous baseline (~-2%). Note I have not tried other scenarios, just the 4K nbd writes on rotational disk. I'll try running more throughout the night.

Regards,
Lukáš

Dne 06. 05. 22 v 13:47 Paolo Bonzini napsal(a):
> Together, these two patches fix the performance regression induced by
> QemuSemaphore; individually they don't though.
> 
> 6.2:
>    iops        : min=58051, max=62260, avg=60282.57, stdev=1081.18, samples=30
>     clat percentiles (usec):   1.00th=[  490],   99.99th=[  775]
>    iops        : min=59401, max=61290, avg=60651.27, stdev=468.24, samples=30
>     clat percentiles (usec):   1.00th=[  490],   99.99th=[  717]
>    iops        : min=59583, max=60816, avg=60353.43, stdev=282.69, samples=30
>     clat percentiles (usec):   1.00th=[  490],   99.99th=[  701]
>    iops        : min=58099, max=60713, avg=59739.53, stdev=755.49, samples=30
>     clat percentiles (usec):   1.00th=[  494],   99.99th=[  717]
> 
> 
> patched:
>    iops        : min=60616, max=62522, avg=61654.37, stdev=555.67, samples=30
>     clat percentiles (usec):   1.00th=[  474],   99.99th=[ 1303]
>    iops        : min=61841, max=63600, avg=62878.47, stdev=442.40, samples=30
>     clat percentiles (usec):   1.00th=[  465],   99.99th=[  685]
>    iops        : min=62976, max=63910, avg=63531.60, stdev=261.05, samples=30
>     clat percentiles (usec):   1.00th=[  461],   99.99th=[  693]
>    iops        : min=60803, max=63623, avg=62653.37, stdev=808.76, samples=30
>     clat percentiles (usec):   1.00th=[  465],   99.99th=[  685]
> 
> Paolo
> 
> Supersedes: <20220505131346.823941-1-pbonzini@redhat.com>
> 
> Paolo Bonzini (2):
>   thread-pool: optimize scheduling of completion bottom half
>   thread-pool: replace semaphore with condition variable
> 
>  util/thread-pool.c | 34 +++++++++++++---------------------
>  1 file changed, 13 insertions(+), 21 deletions(-)
> 

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 12153 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/2] thread-pool: fix performance regression
  2022-05-06 18:55 ` [PATCH 0/2] thread-pool: fix performance regression Lukáš Doktor
@ 2022-05-09  6:42   ` Lukáš Doktor
  2022-05-09 10:07     ` Paolo Bonzini
  0 siblings, 1 reply; 6+ messages in thread
From: Lukáš Doktor @ 2022-05-09  6:42 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel; +Cc: stefanha


[-- Attachment #1.1.1: Type: text/plain, Size: 969 bytes --]

Dne 06. 05. 22 v 20:55 Lukáš Doktor napsal(a):
> Hello Paolo, folks, I gave it a try (on top of the f9fc8932) and it's better than the f9fc8932, better than the previous patch by Stefan, but still I'm not reaching the performance of d7482ffe97 (before the f9fc8932 commit):
> 
> f9f    |  0.0 | -2.8 |  0.6
> stefan | -3.1 | -1.2 | -2.2
> paolo  |  5.3 |  5.4 |  7.1
> d74    |  7.2 |  9.1 |  8.2
> 
> Anyway it's definitely closer to the previous baseline (~-2%). Note I have not tried other scenarios, just the 4K nbd writes on rotational disk. I'll try running more throughout the night.
> 

I tried a couple of iterations of fio-nbd 4/64/256KB read/writes on a rotational disk and overall the latest fix results in a steady 2.5% throughput regression for the 4KiB writes. The remaining tested variants performed similarly. Please let me know if you want me to test the fio execution inside the guest as well or some other variants.

Regards,
Lukáš

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 12153 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/2] thread-pool: fix performance regression
  2022-05-09  6:42   ` Lukáš Doktor
@ 2022-05-09 10:07     ` Paolo Bonzini
  0 siblings, 0 replies; 6+ messages in thread
From: Paolo Bonzini @ 2022-05-09 10:07 UTC (permalink / raw)
  To: Lukáš Doktor, qemu-devel; +Cc: stefanha

On 5/9/22 08:42, Lukáš Doktor wrote:
> Dne 06. 05. 22 v 20:55 Lukáš Doktor napsal(a):
>> Hello Paolo, folks, I gave it a try (on top of the f9fc8932) and
>> it's better than the f9fc8932, better than the previous patch by
>> Stefan, but still I'm not reaching the performance of d7482ffe97
>> (before the f9fc8932 commit):
>> 
>> f9f    |  0.0 | -2.8 |  0.6 stefan | -3.1 | -1.2 | -2.2 paolo  |
>> 5.3 |  5.4 |  7.1 d74    |  7.2 |  9.1 |  8.2
>> 
>> Anyway it's definitely closer to the previous baseline (~-2%). Note
>> I have not tried other scenarios, just the 4K nbd writes on
>> rotational disk. I'll try running more throughout the night.
>> 
> 
> I tried a couple of iterations of fio-nbd 4/64/256KB read/writes on a
> rotational disk and overall the latest fix results in a steady 2.5%
> throughput regression for the 4KiB writes. The remaining tested
> variants performed similarly. Please let me know if you want me to
> test the fio execution inside the guest as well or some other
> variants.

Considering we have conflicting results (I get a 2-3% improvement over 
6.2), and that in general aio=native/aio=io_uring is preferred, I think 
we can proceed with these patches at least for now.

Paolo


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-05-09 10:11 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-06 11:47 [PATCH 0/2] thread-pool: fix performance regression Paolo Bonzini
2022-05-06 11:47 ` [PATCH 1/2] thread-pool: optimize scheduling of completion bottom half Paolo Bonzini
2022-05-06 11:47 ` [PATCH 2/2] thread-pool: replace semaphore with condition variable Paolo Bonzini
2022-05-06 18:55 ` [PATCH 0/2] thread-pool: fix performance regression Lukáš Doktor
2022-05-09  6:42   ` Lukáš Doktor
2022-05-09 10:07     ` Paolo Bonzini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.